CN111933129A - Audio processing method, language model training method and device and computer equipment - Google Patents

Audio processing method, language model training method and device and computer equipment Download PDF

Info

Publication number
CN111933129A
CN111933129A CN202010952838.6A CN202010952838A CN111933129A CN 111933129 A CN111933129 A CN 111933129A CN 202010952838 A CN202010952838 A CN 202010952838A CN 111933129 A CN111933129 A CN 111933129A
Authority
CN
China
Prior art keywords
sequence
audio
context
word
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010952838.6A
Other languages
Chinese (zh)
Other versions
CN111933129B (en
Inventor
黄江泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010952838.6A priority Critical patent/CN111933129B/en
Publication of CN111933129A publication Critical patent/CN111933129A/en
Application granted granted Critical
Publication of CN111933129B publication Critical patent/CN111933129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The application discloses an audio processing method, a language model training method, a device, a system, computer equipment and a storage medium, and belongs to the technical field of signal processing. According to the method and the device, after the phoneme sequence of the target audio is obtained, the first word sequence is obtained based on a traditional speech recognition mode, in addition, the context information is introduced to perform speech recognition under a specific context again to obtain the second word sequence of the context information meeting the specific context, the first word sequence and the second word sequence are comprehensively considered, the final semantic information is planned and decoded, equivalently, the second word sequence is introduced, the occurrence probability of certain words meeting the specific context in the semantic information is strengthened, the misjudgment condition of certain keywords in the semantic information is reduced, the accuracy of the automatic speech recognition process is improved, and therefore the accuracy of the audio processing process is improved.

Description

Audio processing method, language model training method and device and computer equipment
Technical Field
The present application relates to the field of signal processing technologies, and in particular, to an audio processing method, a language model training method, an apparatus, a system, a computer device, and a storage medium.
Background
In the field of signal processing, Automatic Speech Recognition (ASR) is a topical issue, and ASR is a technology for converting human Speech into text, and can be applied to an intelligent paper reading system for spoken test in the field of education.
In the intelligent scoring system, the examinee audio can be recognized as text by ASR technology, and then the intelligent scoring is performed based on a system such as a specified rule (e.g., keyword matching), machine learning, or Natural Language Processing (NLP). In the current ASR technology, the examinee audio is used as an input signal, the characteristics of the examinee audio, such as waveform, spectrum, tone, etc., are extracted and input into an ASR model, a final candidate text list is obtained by decoding, and a final recognition text is screened out through a scoring mechanism for the candidate text.
In the above process, since the score point of the question is often controlled by using a keyword, for example, in the question and answer question "How do you go home after you go" and if the ASR model incorrectly identifies the examinee's audio from "Walking home" to "Working home", the incorrect identification of the keyword "Walking" will result in an incorrect judgment on the examinee's score, and thus a method capable of improving the accuracy of the audio processing process is urgently needed.
Disclosure of Invention
The embodiment of the application provides an audio processing method, a language model training method, a device, a system, computer equipment and a storage medium, and can improve the accuracy of an audio processing process. The technical scheme comprises the following contents.
In one aspect, an audio processing method is provided, and the method includes:
acquiring a phoneme sequence used for representing the pronunciation sequence of the syllables in the target audio;
acquiring a first word sequence matched with the phoneme sequence based on the phoneme sequence;
acquiring a second word sequence matched with the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence, wherein the context information is used for representing the context associated with the target audio;
and determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
In one aspect, a method for training a language model is provided, the method comprising:
based on a reference text of a sample audio, acquiring an extended text with semantic similarity higher than a first threshold value with the reference text, wherein the reference text is a context text related to the generation context of the sample audio;
acquiring extended keywords with semantic similarity higher than a second threshold value with the reference keywords based on the reference keywords in the reference text;
acquiring the reference text, the reference keywords, the extended text and the extended keywords as context information of the sample audio;
training an initial language model based on the phoneme sequence of the sample audio, the semantic information of the sample audio and the context information of the sample audio to obtain a context language model.
In one aspect, an audio processing system is provided, which includes a terminal and a server;
the terminal is used for sending a target audio to the server;
the server is used for acquiring a phoneme sequence used for representing the pronunciation sequence of the syllables in the target audio; acquiring a first word sequence matched with the phoneme sequence based on the phoneme sequence; acquiring a second word sequence matched with the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence, wherein the context information is used for representing the context associated with the target audio; and determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
In one aspect, an audio processing apparatus is provided, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a phoneme sequence used for expressing the pronunciation sequence of the syllables in the target audio;
the second acquisition module is used for acquiring a first word sequence matched with the phoneme sequence based on the phoneme sequence;
a third obtaining module, configured to obtain a second word sequence that matches both the phoneme sequence and the context information based on context information of the target audio and the phoneme sequence, where the context information is used to represent a context associated with the target audio;
and the determining module is used for determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
In one possible implementation, the third obtaining module includes:
and the processing unit is used for calling a context language model to process the phoneme sequence and outputting the second word sequence, and the context language model is used for converting the input phoneme sequence into the second word sequence matched with the context information.
In one possible implementation, the processing unit is configured to:
inputting at least one phoneme in the phoneme sequence into the context language model, and acquiring a plurality of matching probabilities through the context language model, wherein one matching probability is used for expressing the matching degree between one phoneme and one alternative word in the context information;
and determining a sequence formed by at least one alternative word with the maximum matching probability with the at least one phoneme as the second word sequence.
In one possible embodiment, the determining module is configured to:
determining a plurality of alternative texts based on the first word sequence and the second word sequence, wherein one alternative text is used for representing a combination condition of alternative words in the first word sequence and the second word sequence;
and scoring the plurality of candidate texts, and determining the candidate text with the highest score as the semantic information.
In one possible implementation, the second obtaining module is configured to:
and calling a basic language model to process the phoneme sequence and output the first word sequence, wherein the basic language model is used for converting the input phoneme sequence into the first word sequence with consistent pronunciation.
In one aspect, an apparatus for training a language model is provided, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an extended text of which the semantic similarity with a reference text of sample audio is higher than a first threshold value on the basis of the reference text of the sample audio, and the reference text is context text related to the generation context of the sample audio;
the second acquisition module is used for acquiring expanded keywords with semantic similarity higher than a second threshold value with the reference keywords based on the reference keywords in the reference text;
a third obtaining module, configured to obtain the reference text, the reference keyword, the extended text, and the extended keyword as context information of the sample audio;
and the training module is used for training an initial language model based on the phoneme sequence of the sample audio, the semantic information of the sample audio and the context information of the sample audio to obtain a context language model.
In one possible implementation, the first obtaining module is configured to:
acquiring non-stop words in the reference text;
and replacing the non-stop words in the reference text with similar words with semantic similarity higher than a third threshold value with the non-stop words to obtain an expanded text.
In one possible implementation, the second obtaining module is configured to:
embedding the reference keywords to obtain target embedded vectors of the reference keywords;
inquiring target quantity embedded vectors with the closest distance to the target embedded vectors;
and determining the target number of terms corresponding to the inquired target number of embedded vectors as the expanded keywords.
In one possible embodiment, if the target audio is an examination recording, the reference text includes at least one of an examination question or an answer to the examination question.
In one aspect, a computer device is provided, the computer device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one program code, the at least one program code being loaded by the one or more processors and executed to implement an audio processing method or a training method of a language model as described in any one of the possible implementations above.
In one aspect, a storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the audio processing method or the training method of the language model according to any one of the above possible implementations.
In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the audio processing method or the language model training method of any one of the above-mentioned possible embodiments.
The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:
after the phoneme sequence of the target audio is obtained, the first word sequence is obtained based on a traditional speech recognition mode, in addition, context information is introduced to perform speech recognition under a specific context again to obtain a second word sequence of the context information meeting the specific context, the first word sequence and the second word sequence are comprehensively considered, the final semantic information is planned and decoded, equivalently, the occurrence probability of certain words meeting the specific context in the semantic information is strengthened by introducing the second word sequence, the misjudgment condition of certain keywords in the semantic information is reduced, the accuracy of the automatic speech recognition process is improved, and the accuracy of the audio processing process is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to be able to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation environment of an audio processing method according to an embodiment of the present application;
fig. 2 is a flowchart of an audio processing method provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart of an ASR system provided by an embodiment of the present application;
FIG. 4 is a flowchart of a method for training a language model according to an embodiment of the present disclosure;
fig. 5 is a block diagram of a process for obtaining an extended text according to an embodiment of the present application;
fig. 6 is a flowchart of an audio processing method provided in an embodiment of the present application;
fig. 7 is a schematic diagram illustrating input and output results of an audio processing method according to an embodiment of the present application;
fig. 8 is a schematic diagram illustrating input and output results of an audio processing method according to an embodiment of the present application;
fig. 9 is a schematic diagram illustrating input and output results of an audio processing method according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;
FIG. 11 is a schematic structural diagram of an apparatus for training a language model according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.
The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more, for example, a plurality of first locations means two or more first locations.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises an audio processing technology, a computer vision technology, a natural language processing technology, machine learning/deep learning and the like.
The development direction of human-computer interaction in the future is To enable a computer To listen, see, speak and feel, wherein an audio processing Technology (also called Speech processing Technology) becomes one of the best viewed human-computer interaction modes in the future, and specifically includes an Automatic Speech Recognition Technology (ASR), a Speech synthesis Technology (Text To Speech, TTS, also called Text-To-Speech Technology), a Speech separation Technology, a voiceprint Recognition Technology and the like.
With the development of the AI technology, research and application of the audio processing technology have been developed in a plurality of fields, such as common intelligent voice assistants, voice shopping systems, intelligent speakers, voice front-end processing on vehicle-mounted or television boxes, voice recognition products, voiceprint recognition products, and the like.
The embodiment of the application relates to an ASR technology in the field of audio processing, in particular to a technology for converting human voice into text. ASR technology is a multidisciplinary intersection field that is tightly coupled to numerous disciplines, such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the diversity and complexity of audio signals, ASR systems can only achieve satisfactory performance under certain constraints, or can only be used in certain specific situations.
In an exemplary scenario, in an intelligent scoring system for spoken Language examination, ASR is generally used as a system portal, and after the test taker audio is recognized and converted into text by ASR technology, the text recognized by ASR can be intelligently scored by a technology based on rules (such as keyword matching), machine learning, Natural Language Processing (NLP), and the like. Therefore, the accuracy of ASR is the key to the whole intelligent scoring system for spoken test.
When the traditional ASR performs speech recognition, an audio signal is used as input, characteristics such as waveform, frequency spectrum and tone in the audio signal are extracted, the extracted characteristics are input into an ASR model which is trained by using a large amount of audio data, a final alternative text list is obtained by decoding, and a final recognition text is screened out through a scoring mechanism of the alternative text.
However, in a spoken language examination scenario, since score points of a subject are often controlled using keywords, the accuracy of recognition of keywords corresponding to the score points in the audio of an examinee is very important. For example, in the question and answer topic "How do you go home after you go home", if the ASR model incorrectly identifies the examinee's audio from "Walking home" to "Working home", the incorrect identification of the keyword "Walking" will result in that the examinee's correct answer is instead unable to score, resulting in an incorrect evaluation of the examinee's score.
In view of this, an embodiment of the present application provides an audio processing method, which generates corresponding extended texts and extended keywords based on reference texts and reference keywords in a context associated with an audio by using an NLP technique, takes the reference texts, the reference keywords, the extended texts, and the extended keywords as context information, so as to train a targeted context language model, and applies the context language model to an ASR recognition technique, so as to improve recognition accuracy of some speech keywords, thereby improving accuracy of an ASR speech recognition process, that is, improving accuracy of an audio processing process.
Fig. 1 is a schematic diagram of an implementation environment of an audio processing method according to an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102, and the terminal 101 and the server 102 are both computer devices.
The terminal 101 is provided with an audio signal acquisition component for acquiring an audio signal of a speaker, for example, the acquisition component is a recording component such as a microphone, or the terminal 101 downloads a segment of audio file and decodes the audio file to obtain the audio signal.
In some embodiments, after the terminal 101 acquires the target audio to be processed through the acquisition component, the target audio is sent to the server 102, and the server 102 performs ASR processing on the target audio, for example, after the server 102 preprocesses the target audio, an acoustic model is called to separate a phoneme sequence of the preprocessed target audio, a basic language model is called to recognize a first word sequence from the phoneme sequence, a context language model is called to recognize a second word sequence from the phoneme sequence, then the first word sequence and the second word sequence are synthesized to perform speech decoding, and final semantic information is determined based on a scoring mechanism.
The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
The server 102 may be configured to process audio signals, and the server 102 may include at least one of a server, a plurality of servers, a cloud computing platform, or a virtualization center. Alternatively, the server 102 may undertake primary computational tasks and the terminal 101 may undertake secondary computational tasks; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the terminal 101 and the server 102 perform cooperative computing by using a distributed computing architecture.
Optionally, the terminal 101 generally refers to one of a plurality of terminals, and the device type of the terminal 101 includes but is not limited to: at least one of a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, a laptop portable computer, or a desktop computer. The following embodiments are illustrated with the terminal comprising a desktop computer.
Optionally, the server 102 is an independent physical server, or a server cluster or distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and artificial intelligence platform, and the like. Optionally, the terminal is a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
Those skilled in the art will appreciate that the number of terminals 101 described above can be more or less. For example, the number of the terminals 101 is only one, or the number of the terminals 101 is tens or hundreds, or more. The number and the device type of the terminals 101 are not limited in the embodiment of the present application.
In an exemplary scenario, taking a spoken language examination as an example, an examinee enters a target audio on a terminal 101 of a machine room, after the terminal 101 finishes collecting the target audio, the target audio is uploaded to a server 102 in a background, the server 102 performs examination reading and scoring through an intelligent examination reading system of the spoken language examination, the target audio is firstly input to an ASR part, the ASR part performs speech recognition on the target audio, semantic information is recognized, then the semantic information can be subjected to examination reading and scoring based on technologies such as rules, machine learning and NLP, and finally, the score of the spoken language examination of the examinee is output.
Fig. 2 is a flowchart of an audio processing method according to an embodiment of the present application. Referring to fig. 2, the embodiment may be applied to a computer device, and the following description takes the computer device as a server as an example, and the embodiment includes the following steps.
201. The server acquires a phoneme sequence representing the pronunciation order of the syllables in the target audio.
The target audio refers to an audio signal to be processed in the ASR speech recognition process.
In some embodiments, the server obtains a target audio, preprocesses the target audio, inputs the preprocessed target audio into an Acoustic Model (AM), and extracts the phone sequence through the Acoustic Model, wherein the Acoustic Model is used for converting an input audio signal into the phone sequence with the syllable pronunciation order.
Optionally, the server receives a target audio sent by the terminal, where the target audio is an audio recorded by the user on the terminal, for example, the target audio is a spoken test audio recorded by an examinee on the terminal, or the target audio is a voice instruction input by the user when the user requests a song by voice. In one example, a user triggers an audio acquisition instruction in an application program on the terminal, the terminal operating system calls a recording interface in response to the audio acquisition instruction, drives an acquisition component of an audio signal to acquire target audio in the form of audio stream, and uploads the target audio to the server after the acquisition is completed.
Optionally, the server reads a segment of audio from the local database as the target audio, or the server downloads a segment of audio from the cloud database as the target audio, and the embodiment of the application does not specifically limit the obtaining mode of the target audio.
In some embodiments, the server, when pre-processing the target audio, takes the following approach: performing Voice Activity Detection (VAD, also called Voice endpoint Detection) on the target audio, identifying a portion (this portion is colloquially called "mute period") of the target audio where the signal energy is lower than an energy threshold, and then deleting this mute period portion from the target audio to obtain a first audio; furthermore, the first audio is subjected to pre-emphasis processing so as to enhance the high-frequency component in the first audio to obtain a second audio, and the pre-emphasis processing can avoid the high-frequency component from being damaged due to signal attenuation, so that the signal-to-noise ratio is improved; then, performing windowing processing on the second audio through a window function, framing the second audio into a plurality of audio frames, wherein the window function can comprise a Hamming window, a Hanning window, a rectangular window and the like; then, performing Short-Time Fourier Transform (STFT) on a plurality of audio frames of the second audio, and converting the plurality of audio frames from a Time domain to a frequency domain to obtain a third audio; then, filtering out frequency components which are not matched with human ear auditory perception in the third audio frequency through a Mel filter bank, and outputting a Mel nonlinear spectrum of the target audio frequency; thirdly, taking logarithm of the Mel nonlinear spectrum of the target audio to obtain a logarithm spectrum of the target audio; finally, Discrete Cosine Transform (DCT) is performed on the log spectrum of the target audio to obtain a cepstrum of the target audio, and Mel Frequency Cepstral Coeffients (MFCCs) of the target audio are extracted and obtained based on the cepstrum of the target audio. The MFCC of the target audio can be used as a feature vector of the target audio to represent the preprocessed target audio.
In the foregoing process, only the MFCC for preprocessing and extracting the target audio is taken as an example for explanation, in some embodiments, the MFCC may also be preprocessed to extract a Linear Prediction Cepstrum Coefficient (LPCC) of the target audio, and the embodiment of the present application does not specifically limit what kind of audio features are extracted in the preprocessing process. By preprocessing the target audio, the waveform of each audio frame can be changed into a multi-dimensional vector containing sound information, so that the acoustic model can extract the phoneme sequence.
After the target audio is preprocessed, the preprocessed target audio (namely, the MFCC feature vector) is input into an acoustic model, the score of the MFCC vector on the acoustic feature is calculated by the acoustic model, and a phoneme sequence of the target audio is output. Optionally, the acoustic model is trained from a large amount of audio data, the input of the acoustic model is a feature vector of the target audio, and the output is a phoneme sequence of the target audio, for example, the acoustic model includes but is not limited to: hidden Markov Models (HMMs), Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and the like, and the embodiments of the present application do not specifically limit the Model structure of the acoustic Model.
202. The server obtains a first word sequence matched with the phoneme sequence based on the phoneme sequence.
Optionally, the server invokes a basic language model to process the phoneme sequence and output the first word sequence, wherein the basic language model is used for converting the input phoneme sequence into a pronunciation-consistent first word sequence, for example, the basic language model includes but is not limited to: n-gram Language Models (N-gram LM), Markov N-gram Models (Markov N-gram), Exponential Models (exponentials Models), Decision Tree Models (Decision Tree Models), RNNs, and the like, and specifically, N-gram Language Models are further divided into a binary Language Model (Bi-gram), a ternary Language Model (Tri-gram), and the like, and the embodiment of the present application does not specifically limit the Model structure of the basic Language Model.
In some embodiments, the server trains the initial language model based on the sample data to obtain a basic language model, where the basic language model is a language model independent of context and is obtained as the ASR system as a whole (including the acoustic model and the basic language model) is trained, and after the ASR system is trained, the basic language model is not changed with different context information.
203. The server obtains a second word sequence matched with the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence, wherein the context information is used for representing the context associated with the target audio.
Optionally, the server invokes a context language model to process the phoneme sequence and output the second word sequence, wherein the context language model is configured to convert the input phoneme sequence into the second word sequence matching the context information. For example, the contextual language model includes, but is not limited to: the context language model comprises an N-gram model, a Markov N-gram model, an exponential model, a decision tree model, an RNN and the like, specifically, the N-gram is further divided into a Bi-gram, a Tri-gram and the like, and the embodiment of the application does not specifically limit the model structure of the context language model.
In some embodiments, the server trains the initial language model based on the phoneme sequence of the sample audio, semantic information of the sample audio, and context information of the sample audio to obtain a context language model, wherein the context information of the sample audio is the same as the context information of the target audio, that is, the sample audio and the target audio are associated with the same context, where the context language model refers to a language model related to the context, and a context language model can be specifically constructed by using the context information, so as to improve ASR recognition accuracy under a specific context.
Optionally, the context information includes a reference text, a reference keyword, an extended text, and an extended keyword, where the reference text refers to the context related to the generation context of the sample audio, the reference keyword refers to one or more keywords extracted from the reference text, the extended text refers to a text obtained by extending the reference text by using an NLP technique, and the extended keyword refers to a word obtained by extending the reference keyword by using an NLP technique.
Illustratively, in a spoken language test scene, the reference text includes test question questions and reference answers, the reference keywords are given points in the reference answers, the expanded text is a text obtained by expanding the test question questions and the reference answers by using an NLP technology, and the expanded keywords are synonyms or synonyms of the reference keywords.
Illustratively, in a voice ordering scene, the reference text comprises menu information and historical human-computer conversation information of previous rounds, the reference keywords are dishes and parts, the expanded text is a text obtained by expanding the menu information and the historical human-computer conversation information by using an NLP (non line segment) technology, and the expanded keywords are synonyms or synonyms of the reference keywords.
In the process, different context language models can be trained according to different context information, so that semantic information conforming to the context information tends to be output in the audio decoding process finally, for example, different context language models are trained according to different spoken test questions, but the same basic language model is adopted, and the accuracy of the ASR process can be improved.
204. The server determines semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
In the process, the server comprehensively considers the first word sequence output by the basic language model and the second word sequence output by the context language model in the audio decoding process, superposes the output results of the two different language models, and increases the path weight of some words in the context information, so that the decoding algorithm tends to output words conforming to the context information when calculating the optimal word sequence path, thereby improving the accuracy of the determined semantic information.
Fig. 3 is a schematic flowchart of an ASR system according to an embodiment of the present application, please refer to fig. 3, in the ASR system, an input speech is encoded (including a preprocessing process), and in a decoding process, a phoneme sequence of the input speech is obtained by using an acoustic model 301, speech recognition is performed by using a basic language model 302 (lm _ base) and a context language model 303 (lm _ context), respectively, and a first word sequence and a second word sequence are output, wherein the basic language model 302 is trained together with the acoustic model 301, training data of the basic language model 302 is derived from a large number of internet audio samples, and the context language model 303 is trained specifically for a current specific context, and training data of the context model 303 is derived from context information of sample audio. For example, in a spoken language test scenario, a test question, a reference answer, and a reference keyword are obtained first, then the test question, the reference answer, and the reference keyword are expanded, and finally, information before and after expansion is obtained as context information, so that a context language model 303 with strong context pertinence can be trained, it can be understood that before an ASR system identifies a new question, a new context language model 303 needs to be trained for the new question again, but the acoustic model 301 and the basic language model 302 need not be trained again.
All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
According to the method provided by the embodiment of the application, after the phoneme sequence of the target audio is obtained, the first word sequence is obtained based on a traditional speech recognition mode, in addition, the context information is introduced to perform speech recognition under the specific context again so as to obtain the second word sequence of the context information meeting the specific context, the first word sequence and the second word sequence are comprehensively considered, the final semantic information is planned and decoded, namely, the second word sequence is introduced, the occurrence probability of certain words meeting the specific context in the semantic information is strengthened, the misjudgment condition of certain keywords in the semantic information is reduced, the accuracy of the automatic speech recognition process is improved, and the accuracy of the audio processing process is improved.
Fig. 4 is a flowchart of a method for training a speech model according to an embodiment of the present application, please refer to fig. 4, which is applied to a computer device, and will be described with reference to the computer device as a server as an example, where the training process of an acoustic model, a base language model, and a context language model in an ASR system is described, and the embodiment includes the following steps.
401. And the server performs combined training on the initial language model and the initial acoustic model based on the sample data to respectively obtain a basic language model and an acoustic model.
Optionally, the initial language model includes, but is not limited to: the model structure of the initial language model is not specifically limited in the embodiment of the application.
Optionally, the initial acoustic model includes, but is not limited to: HMM model, DNN, CNN, RNN, etc., and the embodiment of the present application does not specifically limit the model structure of the initial acoustic model.
In the process, the server extracts a plurality of sample audios from an audio library, a plurality of sample texts corresponding to the sample audios can be marked by a service worker manually, the sample audios are used as input signals and input into an initial acoustic model and an initial language model in sequence, the initial language model predicts a plurality of corresponding predicted texts, a loss function value of the training process is calculated according to errors between the sample texts and the predicted texts, the training is stopped in response to the fact that the loss function value is smaller than or equal to a loss threshold value or the iteration number is larger than or equal to a number threshold value, a basic language model and an acoustic model are obtained, and otherwise, in response to the fact that the loss function value is larger than or equal to the loss threshold value or the iteration number is smaller than the number threshold value, the next round of training process is executed in an iteration mode.
402. The server obtains a reference text of the sample audio and a reference keyword in the reference text, wherein the reference text is context text related to the generation context of the sample audio.
Wherein the reference text is contextual text that is contextually related to the generation of the sample audio, and in one exemplary scenario, the reference text includes at least one of a test question or a test question answer if the sample audio is an examination recording, such as in a spoken language examination scenario, the reference text includes a test question (above) and a reference answer (below). In another exemplary scenario, if the sample audio is human-machine conversation voice, the reference text includes one or more rounds of historical human-machine conversation information, such as in a voice question-and-answer scenario, the reference text refers to the top N (N ≧ 1) rounds of human-machine conversation content.
The reference keywords refer to core words which correspond to the reference text and are used for refining main content or central thought of the reference text, for example, in a spoken language examination scene, the reference keywords refer to given points in reference answers, for example, in a voice question-answering scene, the reference keywords refer to semantic tags extracted from the previous N rounds of man-machine conversation content.
In the above process, the user inputs the reference text and the reference keywords on the server, or the user inputs the reference text and the reference keywords on the management device, and the server imports the reference text and the reference keywords from the management device, or the user inputs the reference text on the server, and the server intelligently extracts the reference keywords based on the machine learning model, for example, extracts the words with the highest word frequency, for example, extracts the semantic tags with the highest semantic similarity, and the like.
403. And the server acquires the expanded text with semantic similarity higher than a first threshold value with the reference text based on the reference text.
In some embodiments, the server obtains the non-stop word in the reference text, and then replaces the non-stop word in the reference text with a similar word whose semantic similarity with the non-stop word is higher than a third threshold value, so as to obtain an expanded text.
Wherein, the non-stop word refers to a word which does not belong to a stop word (Stopword) in the reference text, and the stop word has the following meaning: if such words are encountered during text processing, the server immediately stops processing and discards such words. The stop words can be self-defined by the user on the server, and a stop word list is generated after the stop words are set so that the server can screen out the non-stop words in the reference text.
Optionally, the server traverses each word in the reference text, for any word, if the any word hits any item in the deactivated word list, the any word is discarded, otherwise, if the any word does not hit all items in the deactivated word list, the any word is determined as a non-deactivated word, and the steps are continuously executed on the next word iteration in the reference text.
Optionally, for any non-stop word, when the server obtains the expanded text, the server replaces the non-stop word with the similar meaning word thereof to obtain the expanded text, optionally, a new round of non-stop words are continuously selected from the expanded text and replaced by the similar meaning word thereof to generate a new expanded text, optionally, the number of the replaced non-stop words is one or more when the expanded text is generated in each round, that is, a plurality of non-stop words can be replaced at one time to generate a new expanded text. It should be noted that, because the manner of generating the expanded text is various, for example, a plurality of non-stop words are replaced at one time, or a single non-stop word is replaced in each round of processing in multiple rounds, repeated expanded texts may be generated, and the expanded text needs to be subjected to de-duplication processing to delete the repeated expanded texts.
In some embodiments, when determining a near-sense word of a non-stop word, the server maps each word in the word library by using a word vector model to obtain a vector representation (called as a word vector or an Embedding vector) of each word in an Embedding space, then determines an Embedding (Embedding) vector of the non-stop word in the Embedding space, queries N (N ≧ 1) words closest to the Embedding vector of the non-stop word in the Embedding space as N near-sense words, and replaces the non-stop word with the N near-sense words respectively, thereby obtaining N extended texts at one time. For example, N is typically taken to be 3, 4, 5, or specified according to test effects. Optionally, the Word vector model includes Glove, Word2Vec, and other Embedding-based models.
Optionally, the above meaning of "nearest" means that the euclidean distance is nearest, or the cosine similarity is highest. For example, for any non-stop word, the euclidean distance between the remaining words in the word bank and the non-stop word is obtained, the remaining words are sorted according to the order of the euclidean distance from small to large, and the words sorted in the top N are selected as N near-synonyms of the non-stop word. For another example, for any non-stop word, cosine similarity between the remaining words in the word bank and the non-stop word is obtained, the remaining words are sorted according to the order of the cosine similarity from large to small, and the words sorted in the top N are selected as N near-meaning words of the non-stop word.
In one example, assuming that the reference text is "By bike", where "By" and "bike" are both non-stop words, searching the word bank for the word closest to "By" as "ride" and the word closest to "bike" as "bicycle", 2 expanded texts are obtained: "By bicycle", "Ride bike"; similarly, assuming that there is one non-stop word in the reference text as "walking", it can be replaced with the synonym "walk".
Fig. 5 is a block diagram of a process for obtaining an extended text according to an embodiment of the present application, please refer to 500, which illustrates a processing logic for generating a plurality of extended texts based on a loop traversal manner, taking a spoken test scenario as an example, first preparing a question topic and a reference answer as reference texts, creating a sentence list sens of the reference texts, setting a sentence index i = 0 (initializing i to 0), and setting expandsns = [ ] (initializing the sentence list of the extended texts to null), then if i < len (sens) (if the sentence index i is smaller than the list length of the sentence list), setting i + =1 (incrementing i by 1), then performing a tokenize process on the ith sentence sens [ i ], obtaining a plurality of words (words) in the ith sentence, setting a word index j = 0 (initializing j to 0), if j < len (words) (if the word index j is less than the word number of the ith sentence), making j + =1 (making j increase by 1), then if the jth word [ j ] is a non-stop word (if word [ j ] not stop a word), finding a word [ j ] semantically similar word [ j ] ', replacing the jth word [ j ] with the similar word [ j ]', obtaining a new sentence sens [ i ] ', and adding the expanded new sentence sens [ i ]' into a sentence list expandsens of the expanded text.
In some embodiments, when performing text augmentation, the server can also cover the non-stop words to be replaced in the round with a MASK (MASK), and then predict the near-meaning words with similar semantics to the non-stop words by using a Language Model, which can also achieve the purpose of text augmentation, for example, the Language Model includes BERT (translation Model using bi-directional encoding Representation), NNLM (Neural Network Language Model), ELMo (embedded Language Model), and the like.
In some embodiments, when text expansion is performed, a server collects a large amount of text data in a relevant context, after specific context information is given, corpus data relevant to the context information is screened from the collected text data through a machine learning method or a deep learning method such as an LDA (late Dirichlet Allocation layer bayesian probability model, which is a document theme generation model), and then the screened corpus data is adopted as an extended text to participate in a training process of a context language model thereof. For example, in a spoken language test scene, a large amount of text data of daily spoken language conversation, spoken language test and daily spoken language practice are collected, after a test question and text information of a reference answer of a certain spoken language test are given, text classification is performed through an LDA model, corpus data with the highest degree of correlation with the spoken language test of the field is screened out from the collected massive daily text data, and the corpus data is applied to a training process of a context language model as an extension text.
404. And the server acquires the expanded keywords with the semantic similarity higher than a second threshold value with the reference keywords based on the reference keywords.
In some embodiments, the server performs embedding processing on the reference keyword to obtain a target embedded vector of the reference keyword, queries a target number of embedded vectors closest to the target embedded vector, and determines a target number of terms corresponding to the queried target number of embedded vectors as the extended keyword. The process of querying the expanded keywords is similar to the process of querying the synonyms of the non-stop words in the reference text in step 403, and is not described herein again.
In some embodiments, the server stores a corresponding word list for each word in the word library, then queries the stored word list corresponding to any reference keyword for any reference keyword, and acquires one or more similar words in the word list as the expanded keyword thereof, so as to simplify the processing logic of the process of acquiring the expanded keyword.
405. And the server acquires the reference text, the reference keywords, the expanded text and the expanded keywords as the context information of the sample audio.
In the process, the server not only obtains the simple and easily obtained reference text and reference keywords, but also fully expands the reference text and the reference keywords by using the NLP technology to obtain the expanded text and the expanded keywords, so that the context information which can be collected is greatly enriched, the word coverage range of the context information is expanded, the sample capacity for training the context language model is improved, the accuracy of the context language model is favorably improved, and the condition that some synonyms appear is avoided being missed.
In some embodiments, the server directly uses the reference text and the reference keywords as the context information, and performs the following step 406 to train the context language model, which can simplify the model training process and simplify the tedious training process.
406. And the server trains the initial language model based on the phoneme sequence of the sample audio, the semantic information of the sample audio and the context information of the sample audio to obtain the context language model.
Wherein the context language model is used for converting the input phoneme sequence into a second word sequence matched with the context information.
Optionally, the initial language model includes, but is not limited to: the model structure of the initial language model is not specifically limited in the embodiment of the application. It should be noted that the initial language model in step 406 and the initial language model in step 401 may be language models with the same structure or language models with different structures, and in the embodiment of the present application, whether both language models with the same structure are used is not specifically limited, for example, both language models are N-gram models.
In some embodiments, taking a sample audio (e.g. correct answer examinee audio) conforming to the context information as an example, inputting the sample audio as an input signal into a trained acoustic model to obtain a phoneme sequence of the sample audio, then inputting the phoneme sequence of the sample audio into an initial language model, predicting a probability that each phoneme in the phoneme sequence corresponds to each word in the given context information by the initial language model, extracting a word with the highest probability corresponding to each phoneme to form a sequence, obtaining an error between the word sequence output by the model and semantic information obtained by artificially labeling the sample audio, calculating a loss function value of the training process based on the error, and in response to the fact that the loss function value is less than or equal to a loss threshold value or the number of iterations is greater than or equal to a number threshold value, stopping training to obtain a context language model, otherwise, responding to the loss function value being larger than the loss threshold value or the iteration times being smaller than the times threshold value, and executing the next round of training process in an iteration mode.
In an exemplary scenario, taking a spoken language examination scenario as an example, assuming that the question-and-answer topic is "How do you go home after you go" and the reference answers include: "I go home by bus (I take bus to go home)", "I go home by bike (I take bicycle to go home)", "I go home on foot (I go home)", "I walk home (I go home)" and "I walk home (I go home)" are obtained by expanding the question and answer questions in the above step 403, and then expanding the reference answers to obtain expanded texts "How do you go home" and "How do you get home", and after expanding the reference answers, expanded texts "I com home by bus (I take bus to go home)", "I com home by bike (I take bicycle to go home)", "I go home" and "I take home" are obtained.
Then, the original question-answering questions "How do you go home after you go" and the original reference answers "I go home by bus", "I go home by bike", "I go home on home", "I walk home" are used, and the expanded texts such as the expanded text 'How you go home after school', 'How you go home', 'I com home by bus', 'I com home by bicycle', 'I com home on foot', 'I com home' and 'I ride home' can be trained to obtain a context-dependent N-gram language model.
All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
In the embodiment of the application, an acoustic model and a basic language model irrelevant to the context are obtained through training of a large amount of audio data, the acoustic model and the basic language model can be migrated to any context and any scene and have better universality and generalization, then, a personalized context language model is trained independently aiming at the context information of the current sample audio, the accuracy of speech recognition under a specific context can be improved on the basis of keeping the generalization capability, the acoustic model, the basic language model and the context language model are comprehensively utilized in an ASR system, the accuracy of the ASR system can be greatly improved, the audio processing accuracy is improved, and aiming at a brand-new context or a brand-new scene, only a new context language model needs to be trained through a small amount of calculation without retraining the new acoustic model and the new basic language model, therefore, the ASR system has extremely high portability on the basis of keeping high accuracy.
Fig. 6 is a flowchart of an audio processing method according to an embodiment of the present application, please refer to fig. 6, which is applied to a computer device, and is described by taking the computer device as a server as an example, and a process of how to apply the three models to perform audio processing will be described on the basis of the acoustic model, the base language model, and the context language model obtained by training in the foregoing embodiment, where the embodiment includes the following steps.
601. The server acquires a target audio.
The target audio refers to an audio signal to be processed in the ASR speech recognition process.
Optionally, the server receives a target audio sent by the terminal, where the target audio is an audio recorded by the user on the terminal, for example, the target audio is a spoken test audio recorded by an examinee on the terminal, or the target audio is a voice instruction input by the user when the user requests a song by voice.
In one example, a user triggers an audio acquisition instruction in an application program on the terminal, the terminal operating system calls a recording interface in response to the audio acquisition instruction, drives an acquisition component of an audio signal to acquire target audio in the form of audio stream, and uploads the target audio to the server after the acquisition is completed.
Optionally, the server reads a segment of audio from the local database as the target audio, or the server downloads a segment of audio from the cloud database as the target audio, and the embodiment of the application does not specifically limit the obtaining mode of the target audio.
602. And the server calls the acoustic model to process the target audio to acquire a phoneme sequence used for expressing the pronunciation sequence of the syllables in the target audio.
In the above process, the server preprocesses the target audio, inputs the preprocessed target audio into an acoustic model, and extracts the phone sequence through the acoustic model, wherein the acoustic model is used for converting the input audio signal into the phone sequence of the syllable pronunciation sequence.
In some embodiments, the server, when pre-processing the target audio, takes the following approach: performing VAD detection on the target audio, identifying a part (commonly called a 'mute period') of the target audio, wherein the signal energy is lower than an energy threshold, and then deleting the part of the mute period from the target audio to obtain a first audio; furthermore, the first audio is subjected to pre-emphasis processing so as to enhance the high-frequency component in the first audio to obtain a second audio, and the pre-emphasis processing can avoid the high-frequency component from being damaged due to signal attenuation, so that the signal-to-noise ratio is improved; then, windowing the second audio by a window function, framing the second audio into a plurality of audio frames, wherein the window function can comprise a Hamming window, a Hanning window, a rectangular window and the like; then, performing STFT on a plurality of audio frames of the second audio, and converting the plurality of audio frames from a time domain to a frequency domain to obtain a third audio; then, filtering out frequency components which are not matched with human ear auditory perception in the third audio frequency through a Mel filter bank, and outputting a Mel nonlinear spectrum of the target audio frequency; thirdly, taking logarithm of the Mel nonlinear spectrum of the target audio to obtain a logarithm spectrum of the target audio; and finally, performing DCT (discrete cosine transformation) on the log spectrum of the target audio to obtain a cepstrum of the target audio, and extracting to obtain an MFCC (Mel frequency cepstrum coefficient) vector of the target audio based on the cepstrum of the target audio. The MFCC vector of the target audio can be used as the feature vector of the target audio to represent the preprocessed target audio.
In the foregoing process, only the MFCC vector of the target audio is preprocessed to extract the target audio is taken as an example for explanation, in some embodiments, the MFCC vector of the target audio can also be preprocessed to extract the LPCC vector of the target audio, and this embodiment of the present application does not specifically limit what kind of audio features are extracted in the preprocessing process. By preprocessing the target audio, the waveform of each audio frame can be changed into a multi-dimensional vector containing sound information, so that the acoustic model can extract the phoneme sequence.
After the target audio is preprocessed, the preprocessed target audio (namely, the MFCC feature vector) is input into an acoustic model, the score of the MFCC vector on the acoustic feature is calculated by the acoustic model, and a phoneme sequence of the target audio is output. Optionally, the acoustic model is trained from a large amount of audio data, the input of the acoustic model is a feature vector of the target audio, and the output is a phoneme sequence of the target audio, for example, the acoustic model includes but is not limited to: HMM model, DNN, CNN, RNN, etc., and the embodiment of the present application does not specifically limit the model structure of the acoustic model.
603. And the server calls a basic language model to process the phoneme sequence and outputs a first word sequence matched with the phoneme sequence.
Wherein the basic language model is used for converting the input phoneme sequence into a first word sequence with consistent pronunciation. The base language model includes, but is not limited to: the model structure of the basic language model is not specifically limited in the embodiment of the application.
In the above process, the server acquires the first word sequence matching the phoneme sequence based on the phoneme sequence. Specifically, the phoneme sequence includes at least one phoneme, taking any phoneme in the phoneme sequence as an example, predicting the probability that the any phoneme matches with a plurality of words in the word stock through a basic language model, adding the word with the highest probability to a position in the first word sequence corresponding to the any phoneme, and repeatedly executing the above processes to obtain the first word sequence matching with the phoneme sequence. Here, matching means that the pronunciation of the phoneme matches the pronunciation of the word.
In some embodiments, for each phoneme in the phoneme sequence, the basic language model retains N words with predicted probabilities located at the top N (N ≧ 1) bits, so as to represent the words at each position in the first word sequence in the form of an N-tuple array, which is not specifically limited by the embodiments of the present application.
604. And the server calls a context language model to process the phoneme sequence and outputs a second word sequence matched with the phoneme sequence and the context information of the target audio.
Wherein the context information is used to represent a context associated with the target audio.
Optionally, the context information includes a reference text, a reference keyword, an expanded text, and an expanded keyword, illustratively, in a spoken language examination scene, the reference text includes a test question topic and a reference answer, the reference keyword is a given point in the reference answer, the expanded text is a text obtained by expanding the test question topic and the reference answer by using an NLP technique, and the expanded keyword is a synonym or a synonym of the reference keyword.
Wherein the context language model is used for converting the input phoneme sequence into a second word sequence matched with the context information. The context language model refers to a context-related language model, and a context language model can be specifically constructed by using context information so as to improve the accuracy of ASR recognition under a specific context.
Optionally, the contextual language model includes, but is not limited to: the context language model comprises an N-gram model, a Markov N-gram model, an exponential model, a decision tree model, an RNN and the like, specifically, the N-gram is further divided into a Bi-gram, a Tri-gram and the like, and the embodiment of the application does not specifically limit the model structure of the context language model.
In some embodiments, the server inputs at least one phoneme in the phoneme sequence into the context language model, and obtains a plurality of matching probabilities through the context language model, wherein one matching probability is used for indicating the matching degree between one phoneme and one alternative word in the context information; and determining a sequence formed by at least one alternative word with the maximum matching probability with the at least one phoneme as the second word sequence. The above alternative words refer to any word in the context information, such as a word in the reference text, a word in the expanded text, a reference keyword, or an expanded keyword, which are all called alternative words.
In some embodiments, for each phoneme in the phoneme sequence, the context language model retains N candidate words with matching probabilities located at the top N (N ≧ 1) bits, so as to represent the candidate words at each position in the second word sequence in the form of an N-tuple array, which is not specifically limited by the embodiment of the present application.
In the above process, the server obtains a second word sequence matching both the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence. In the second word sequence, the alternative words in the context information tend to be matched, so that the context information of the target audio is repeatedly utilized when the phoneme sequence is processed, phonemes with similar pronunciations are prevented from being recognized as homophones in other contexts, and the accuracy of speech recognition can be improved.
605. The server determines semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
In some embodiments, the server determines a plurality of candidate texts based on the first word sequence and the second word sequence, wherein one candidate text is used for representing a combination of candidate words in the first word sequence and the second word sequence; and scoring the plurality of candidate texts, and determining the candidate text with the highest score as the semantic information.
In the above process, when determining the alternative text, the word sequence is a new word sequence obtained by arranging and combining any alternative word in the first word sequence and one or more alternative words in the second word sequence according to the original word sequence, and it can be understood that the alternative text includes the first word sequence itself and the second word sequence itself, and of course, also includes a new word sequence obtained by disorganizing and recombining the first word sequence and the second word sequence.
In one example, where the first sequence of words is "work home" and the second sequence of words is "Walking the road", then the words in the first sequence of words are arranged and combined with the words in the second sequence of words, and the resulting alternative text includes at least two of: "Working the road" and "Walking home".
In the above process, in the process of scoring the multiple candidate texts, optionally, path planning is performed by using a dynamic programming algorithm Viterbi, and for words appearing in the first word sequence and the second word sequence at the same time, by increasing path weights of the words appearing at the same time, path weights of candidate words in the reference text, the extended text, the reference keyword and the extended keyword can be increased, so as to increase the appearance probability of the candidate words in the finally determined semantic information.
The above process is also a process of performing speech decoding by using output results of the acoustic model and the two language models, the decoding process is a core component of the ASR system, and the trained decoder decodes the target audio to obtain the word string (i.e., semantic information) with the highest probability. The core algorithm of the decoder is the dynamic programming algorithm Viterbi, and in the scoring process by the Viterbi algorithm, the decoder is inclined to output the words which are in accordance with the context information in the semantic information.
In the server audio decoding process, a first word sequence output by the basic language model and a second word sequence output by the context language model are comprehensively considered, output results of two different language models are superposed, and path weights of certain words in the context information are increased, so that when a decoding algorithm calculates an optimal word sequence path, words conforming to the context information tend to be output, and the accuracy of the determined semantic information can be improved.
In an exemplary scenario, for example, in a spoken language examination scenario, the possibility that the content of the test taker's answer is related to the question of the test question and the reference answer is high, and by increasing the output probability of these alternative words, the Word Error Rate (WER, which refers to the proportion of the number of words that are recognized by ASR and are erroneous to the number of all words) of the ASR system can be effectively reduced. By improving the recognition accuracy of the ASR system and reducing the WER of the ASR system, the intelligent marking system can give correct scores based on correct ASR recognition results (semantic information), and the reliability and the stability of the intelligent marking system of the whole spoken language examination can be improved.
Fig. 7 is a schematic diagram illustrating input and output results of an audio processing method according to an embodiment of the present application, please refer to 700, which shows ASR recognition results for a short text reading question type in a spoken language test. Fig. 8 is a schematic diagram of input and output results of an audio processing method according to an embodiment of the present application, please refer to 800, which shows ASR recognition results for question answer type in spoken language examination. Fig. 9 is a schematic diagram of input and output results of an audio processing method according to an embodiment of the present application, please refer to 900, which shows ASR recognition results for a semi-open question type in a spoken language examination. It can be seen that the audio processing method provided by the embodiment of the present application can be flexibly applied to various question types in a spoken language examination, and besides, the question types such as a scenario question, a picture description, etc. can be covered in the application range of the ASR system, moreover, an API (Application Programming Interface) access version and a privatized deployment version can be provided to the outside to meet the scoring requirements for paper marking in different security requirement scenarios, for example, in oral examinations with high security level, because data is strictly kept secret, the intelligent paper marking system should adopt a privatized deployment scheme to be deployed in a machine room of an examination room to complete a scoring process off line, for some spoken language examinations with lower security level, the intelligent paper marking system is directly accessed through the online API, and the complicated deployment process of privatization deployment is avoided due to the fact that the spoken language examinations are deployed in the cloud.
All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
According to the method provided by the embodiment of the application, after the phoneme sequence of the target audio is obtained, the first word sequence is obtained based on a traditional speech recognition mode, in addition, the context information is introduced to perform speech recognition under the specific context again so as to obtain the second word sequence of the context information meeting the specific context, the first word sequence and the second word sequence are comprehensively considered, the final semantic information is planned and decoded, namely, the second word sequence is introduced, the occurrence probability of certain words meeting the specific context in the semantic information is strengthened, the misjudgment condition of certain keywords in the semantic information is reduced, the accuracy of the automatic speech recognition process is improved, and the accuracy of the audio processing process is improved.
Fig. 10 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, please refer to fig. 10, where the apparatus includes:
a first obtaining module 1001, configured to obtain a phoneme sequence used for representing a pronunciation order of syllables in a target audio;
a second obtaining module 1002, configured to obtain, based on the phoneme sequence, a first word sequence matched with the phoneme sequence;
a third obtaining module 1003, configured to obtain, based on the context information of the target audio and the phoneme sequence, a second word sequence matching both the phoneme sequence and the context information, where the context information is used to represent a context associated with the target audio;
a determining module 1004, configured to determine semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
According to the device provided by the embodiment of the application, after the phoneme sequence of the target audio is obtained, the first word sequence is obtained based on a traditional voice recognition mode, in addition, the context information is introduced to perform voice recognition under the specific context again so as to obtain the second word sequence of the context information meeting the specific context, the first word sequence and the second word sequence are comprehensively considered, the final semantic information is planned and decoded, equivalently, the occurrence probability of certain words meeting the specific context in the semantic information is strengthened by introducing the second word sequence, the misjudgment condition of certain keywords in the semantic information is reduced, the accuracy of the automatic voice recognition process is improved, and therefore the accuracy of the audio processing process is improved.
In a possible implementation manner, based on the apparatus composition of fig. 10, the third obtaining module 1003 includes:
and the processing unit is used for calling a context language model to process the phoneme sequence and output the second word sequence, and the context language model is used for converting the input phoneme sequence into the second word sequence matched with the context information.
In one possible embodiment, the processing unit is configured to:
inputting at least one phoneme in the phoneme sequence into the context language model, and acquiring a plurality of matching probabilities through the context language model, wherein one matching probability is used for expressing the matching degree between one phoneme and one alternative word in the context information;
and determining a sequence formed by at least one alternative word with the maximum matching probability with the at least one phoneme as the second word sequence.
In one possible implementation, the determining module 1004 is configured to:
determining a plurality of alternative texts based on the first word sequence and the second word sequence, wherein one alternative text is used for representing a combination condition of alternative words in the first word sequence and the second word sequence;
and scoring the plurality of candidate texts, and determining the candidate text with the highest score as the semantic information.
In a possible implementation, the second obtaining module 1002 is configured to:
and calling a basic language model to process the phoneme sequence and output the first word sequence, wherein the basic language model is used for converting the input phoneme sequence into the pronunciation-consistent first word sequence.
All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
It should be noted that: in the audio processing apparatus provided in the above embodiment, when processing audio, only the division of the above functional modules is taken as an example, and in practical applications, the above functions can be distributed by different functional modules as needed, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the audio processing apparatus and the audio processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the audio processing method embodiments and are not described herein again.
Fig. 11 is a schematic structural diagram of a training apparatus for a language model according to an embodiment of the present application, please refer to fig. 11, the apparatus includes:
a first obtaining module 1101, configured to obtain, based on a reference text of a sample audio, an extended text with semantic similarity higher than a first threshold with the reference text, where the reference text is a context text that is context-related to generation of the sample audio;
a second obtaining module 1102, configured to obtain, based on a reference keyword in the reference text, an extended keyword with semantic similarity higher than a second threshold with the reference keyword;
a third obtaining module 1103, configured to obtain the reference text, the reference keyword, the expanded text, and the expanded keyword as context information of the sample audio;
the training module 1104 is configured to train the initial language model based on the phoneme sequence of the sample audio, the semantic information of the sample audio, and the context information of the sample audio, so as to obtain a context language model.
The device provided by the embodiment of the application can be used for training an individualized context language model aiming at the context information of sample audio, the accuracy of speech recognition under a specific context can be improved, the acoustic model, the basic language model and the context language model are comprehensively utilized in the ASR system, the accuracy of the ASR system can be greatly improved, the audio processing accuracy is improved, and aiming at a brand new context or a brand new scene, only a small amount of calculation is needed to train a new context language model, and the new acoustic model and the new basic language model do not need to be retrained, so that the ASR system has extremely high portability on the basis of keeping high accuracy.
In a possible implementation, the first obtaining module 1101 is configured to:
acquiring non-stop words in the reference text;
and replacing the non-stop word in the reference text with a similar word with the semantic similarity higher than a third threshold value with the non-stop word to obtain an expanded text.
In a possible implementation, the second obtaining module 1102 is configured to:
embedding the reference keyword to obtain a target embedded vector of the reference keyword;
inquiring the number of target embedded vectors with the shortest distance to the target embedded vector;
and determining the target number of terms corresponding to the inquired target number of embedded vectors as the expanded keywords.
In one possible embodiment, if the target audio is an examination recording, the reference text includes at least one of an examination question or an answer to the examination question.
All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
It should be noted that: in the training apparatus for a language model according to the above embodiments, when training a language model, only the division of the functional modules is exemplified, and in practical applications, the functions can be distributed by different functional modules as needed, that is, the internal structure of the computer device can be divided into different functional modules to complete all or part of the functions described above. In addition, the training device of the language model and the training method of the language model provided in the above embodiments belong to the same concept, and the specific implementation process is detailed in the embodiment of the training method of the language model, and is not described herein again.
Fig. 12 is a schematic structural diagram of a computer device 1200, where the computer device 1200 may have a relatively large difference due to different configurations or performances, and the computer device 1200 includes one or more processors (CPUs) 1201 and one or more memories 1202, where the memory 1202 stores at least one program code, and the at least one program code is loaded and executed by the processors 1201 to implement the audio Processing method or the language model training method provided by the above embodiments. Optionally, the computer device 1200 further has a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the computer device 1200 further includes other components for implementing the device functions, which are not described herein again.
In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory including at least one program code, which is executable by a processor in a terminal to perform the audio processing method or the training method of a language model in the above embodiments. For example, the computer-readable storage medium includes a ROM (Read-Only Memory), a RAM (Random-Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or computer program is also provided, comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute to perform the audio processing method or the training method of the language model in the above-described embodiments.
Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments can be implemented by hardware, or can be implemented by a program instructing relevant hardware, and optionally, the program is stored in a computer readable storage medium, and optionally, the above mentioned storage medium is a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A method of audio processing, the method comprising:
acquiring a phoneme sequence used for representing the pronunciation sequence of the syllables in the target audio;
acquiring a first word sequence matched with the phoneme sequence based on the phoneme sequence;
acquiring a second word sequence matched with the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence, wherein the context information is used for representing the context associated with the target audio;
and determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
2. The method of claim 1, wherein the obtaining a second word sequence matching both the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence comprises:
and calling a context language model to process the phoneme sequence and output the second word sequence, wherein the context language model is used for converting the input phoneme sequence into the second word sequence matched with the context information.
3. The method of claim 2, wherein said invoking the context language model to process the sequence of phonemes and outputting the second sequence of words comprises:
inputting at least one phoneme in the phoneme sequence into the context language model, and acquiring a plurality of matching probabilities through the context language model, wherein one matching probability is used for expressing the matching degree between one phoneme and one alternative word in the context information;
and determining a sequence formed by at least one alternative word with the maximum matching probability with the at least one phoneme as the second word sequence.
4. The method of claim 1, wherein determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence comprises:
determining a plurality of alternative texts based on the first word sequence and the second word sequence, wherein one alternative text is used for representing a combination condition of alternative words in the first word sequence and the second word sequence;
and scoring the plurality of candidate texts, and determining the candidate text with the highest score as the semantic information.
5. The method of claim 1, wherein obtaining the first sequence of words matching the sequence of phonemes based on the sequence of phonemes comprises:
and calling a basic language model to process the phoneme sequence and output the first word sequence, wherein the basic language model is used for converting the input phoneme sequence into the first word sequence with consistent pronunciation.
6. A method for training a language model, the method comprising:
based on a reference text of a sample audio, acquiring an extended text with semantic similarity higher than a first threshold value with the reference text, wherein the reference text is a context text related to the generation context of the sample audio;
acquiring extended keywords with semantic similarity higher than a second threshold value with the reference keywords based on the reference keywords in the reference text;
acquiring the reference text, the reference keywords, the extended text and the extended keywords as context information of the sample audio;
training an initial language model based on the phoneme sequence of the sample audio, the semantic information of the sample audio and the context information of the sample audio to obtain a context language model.
7. The method according to claim 6, wherein the obtaining of the extended text with semantic similarity higher than a first threshold with the reference text based on the reference text of the sample audio comprises:
acquiring non-stop words in the reference text;
and replacing the non-stop words in the reference text with similar words with semantic similarity higher than a third threshold value with the non-stop words to obtain an expanded text.
8. The method according to claim 6, wherein the obtaining of the expanded keywords with semantic similarity higher than a second threshold with the reference keywords in the reference text comprises:
embedding the reference keywords to obtain target embedded vectors of the reference keywords;
inquiring target quantity embedded vectors with the closest distance to the target embedded vectors;
and determining the target number of terms corresponding to the inquired target number of embedded vectors as the expanded keywords.
9. The method of claim 6, wherein if the sample audio is a test recording, the reference text comprises at least one of a test question or a test question answer.
10. An audio processing system, comprising a terminal and a server;
the terminal is used for sending a target audio to the server;
the server is used for acquiring a phoneme sequence used for representing the pronunciation sequence of the syllables in the target audio; acquiring a first word sequence matched with the phoneme sequence based on the phoneme sequence; acquiring a second word sequence matched with the phoneme sequence and the context information based on the context information of the target audio and the phoneme sequence, wherein the context information is used for representing the context associated with the target audio; and determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
11. An audio processing apparatus, characterized in that the apparatus comprises:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a phoneme sequence used for expressing the pronunciation sequence of the syllables in the target audio;
the second acquisition module is used for acquiring a first word sequence matched with the phoneme sequence based on the phoneme sequence;
a third obtaining module, configured to obtain a second word sequence that matches both the phoneme sequence and the context information based on context information of the target audio and the phoneme sequence, where the context information is used to represent a context associated with the target audio;
and the determining module is used for determining semantic information corresponding to the target audio based on the first word sequence and the second word sequence.
12. The apparatus of claim 11, wherein the third obtaining module comprises:
and the processing unit is used for calling a context language model to process the phoneme sequence and outputting the second word sequence, and the context language model is used for converting the input phoneme sequence into the second word sequence matched with the context information.
13. An apparatus for training a language model, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an extended text of which the semantic similarity with a reference text of sample audio is higher than a first threshold value on the basis of the reference text of the sample audio, and the reference text is context text related to the generation context of the sample audio;
the second acquisition module is used for acquiring expanded keywords with semantic similarity higher than a second threshold value with the reference keywords based on the reference keywords in the reference text;
a third obtaining module, configured to obtain the reference text, the reference keyword, the extended text, and the extended keyword as context information of the sample audio;
and the training module is used for training an initial language model based on the phoneme sequence of the sample audio, the semantic information of the sample audio and the context information of the sample audio to obtain a context language model.
14. A computer device, characterized in that the computer device comprises one or more processors and one or more memories having stored therein at least one program code, the at least one program code being loaded and executed by the one or more processors to implement the audio processing method of any one of claims 1 to 5; or to implement a method of training a language model as claimed in any one of claims 6 to 9.
15. A storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor to implement the audio processing method of any one of claims 1 to 5; or to implement a method of training a language model as claimed in any one of claims 6 to 9.
CN202010952838.6A 2020-09-11 2020-09-11 Audio processing method, language model training method and device and computer equipment Active CN111933129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010952838.6A CN111933129B (en) 2020-09-11 2020-09-11 Audio processing method, language model training method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010952838.6A CN111933129B (en) 2020-09-11 2020-09-11 Audio processing method, language model training method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN111933129A true CN111933129A (en) 2020-11-13
CN111933129B CN111933129B (en) 2021-01-05

Family

ID=73309367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010952838.6A Active CN111933129B (en) 2020-09-11 2020-09-11 Audio processing method, language model training method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN111933129B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102812A (en) * 2020-11-19 2020-12-18 成都启英泰伦科技有限公司 Anti-false-wake-up method based on multiple acoustic models and voice recognition module
CN112597340A (en) * 2020-12-23 2021-04-02 杭州艾耕科技有限公司 Extraction method of short video ASR text keywords in vertical field, computer equipment and readable storage medium
CN112988965A (en) * 2021-03-01 2021-06-18 腾讯科技(深圳)有限公司 Text data processing method and device, storage medium and computer equipment
CN113035200A (en) * 2021-03-03 2021-06-25 科大讯飞股份有限公司 Voice recognition error correction method, device and equipment based on human-computer interaction scene
CN113609264A (en) * 2021-06-28 2021-11-05 国网北京市电力公司 Data query method and device for power system nodes
WO2022142011A1 (en) * 2020-12-30 2022-07-07 平安科技(深圳)有限公司 Method and device for address recognition, computer device, and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577386A (en) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 Method and device for dynamically loading language model based on user input scene
CN105654946A (en) * 2014-12-02 2016-06-08 三星电子株式会社 Method and apparatus for speech recognition
CN105759983A (en) * 2009-03-30 2016-07-13 触摸式有限公司 System and method for inputting text into electronic devices
US9911409B2 (en) * 2015-07-23 2018-03-06 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
US9934777B1 (en) * 2016-07-01 2018-04-03 Amazon Technologies, Inc. Customized speech processing language models
CN108140019A (en) * 2015-10-09 2018-06-08 三菱电机株式会社 Language model generating means, language model generation method and its program, speech recognition equipment and audio recognition method and its program
CN108242235A (en) * 2016-12-23 2018-07-03 三星电子株式会社 Electronic equipment and its audio recognition method
CN111128175A (en) * 2020-01-19 2020-05-08 大连即时智能科技有限公司 Spoken language dialogue management method and system
US10692493B2 (en) * 2018-05-01 2020-06-23 Dell Products, L.P. Intelligent assistance using voice services

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105759983A (en) * 2009-03-30 2016-07-13 触摸式有限公司 System and method for inputting text into electronic devices
CN103577386A (en) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 Method and device for dynamically loading language model based on user input scene
CN105654946A (en) * 2014-12-02 2016-06-08 三星电子株式会社 Method and apparatus for speech recognition
US9911409B2 (en) * 2015-07-23 2018-03-06 Samsung Electronics Co., Ltd. Speech recognition apparatus and method
CN108140019A (en) * 2015-10-09 2018-06-08 三菱电机株式会社 Language model generating means, language model generation method and its program, speech recognition equipment and audio recognition method and its program
US9934777B1 (en) * 2016-07-01 2018-04-03 Amazon Technologies, Inc. Customized speech processing language models
CN108242235A (en) * 2016-12-23 2018-07-03 三星电子株式会社 Electronic equipment and its audio recognition method
US10692493B2 (en) * 2018-05-01 2020-06-23 Dell Products, L.P. Intelligent assistance using voice services
CN111128175A (en) * 2020-01-19 2020-05-08 大连即时智能科技有限公司 Spoken language dialogue management method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102812A (en) * 2020-11-19 2020-12-18 成都启英泰伦科技有限公司 Anti-false-wake-up method based on multiple acoustic models and voice recognition module
CN112597340A (en) * 2020-12-23 2021-04-02 杭州艾耕科技有限公司 Extraction method of short video ASR text keywords in vertical field, computer equipment and readable storage medium
WO2022142011A1 (en) * 2020-12-30 2022-07-07 平安科技(深圳)有限公司 Method and device for address recognition, computer device, and storage medium
CN112988965A (en) * 2021-03-01 2021-06-18 腾讯科技(深圳)有限公司 Text data processing method and device, storage medium and computer equipment
CN113035200A (en) * 2021-03-03 2021-06-25 科大讯飞股份有限公司 Voice recognition error correction method, device and equipment based on human-computer interaction scene
CN113609264A (en) * 2021-06-28 2021-11-05 国网北京市电力公司 Data query method and device for power system nodes

Also Published As

Publication number Publication date
CN111933129B (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
WO2017076222A1 (en) Speech recognition method and apparatus
JP2019514045A (en) Speaker verification method and system
US11475881B2 (en) Deep multi-channel acoustic modeling
US11380330B2 (en) Conversational recovery for voice user interface
WO2021051544A1 (en) Voice recognition method and device
CN109686383B (en) Voice analysis method, device and storage medium
US10872601B1 (en) Natural language processing
US20210312914A1 (en) Speech recognition using dialog history
CN112349289A (en) Voice recognition method, device, equipment and storage medium
Moyal et al. Phonetic search methods for large speech databases
Kumar et al. A comprehensive review of recent automatic speech summarization and keyword identification techniques
Kumar et al. Machine learning based speech emotions recognition system
Mary et al. Searching speech databases: features, techniques and evaluation measures
Xu et al. A Comprehensive Survey of Automated Audio Captioning
CN113763992A (en) Voice evaluation method and device, computer equipment and storage medium
CN112530408A (en) Method, apparatus, electronic device, and medium for recognizing speech
CN113421593A (en) Voice evaluation method and device, computer equipment and storage medium
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
Roy et al. Bengali consonants-voice to text conversion using machine learning tool
US11328713B1 (en) On-device contextual understanding
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
JP7146038B2 (en) Speech recognition system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant