CN115223562A - Voice recognition method and device - Google Patents

Voice recognition method and device Download PDF

Info

Publication number
CN115223562A
CN115223562A CN202110405365.2A CN202110405365A CN115223562A CN 115223562 A CN115223562 A CN 115223562A CN 202110405365 A CN202110405365 A CN 202110405365A CN 115223562 A CN115223562 A CN 115223562A
Authority
CN
China
Prior art keywords
accent
recognized
voice signal
training
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110405365.2A
Other languages
Chinese (zh)
Inventor
张晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Wuhan Co Ltd
Original Assignee
Tencent Technology Wuhan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Wuhan Co Ltd filed Critical Tencent Technology Wuhan Co Ltd
Priority to CN202110405365.2A priority Critical patent/CN115223562A/en
Publication of CN115223562A publication Critical patent/CN115223562A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application relates to the field of artificial intelligence and discloses a voice recognition method and a voice recognition device, wherein the method comprises the following steps: acquiring a voice signal to be recognized; carrying out accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting the accent feature of a user generating the voice signal to be recognized; and calling the voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized. By adopting the embodiment of the application, the accuracy rate of voice recognition of the voice signal with accent can be improved.

Description

Voice recognition method and device
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a speech recognition method and apparatus.
Background
Speech Recognition, also known as Automatic Speech Recognition (ASR), is an important direction of research in the field of artificial intelligence, with the aim of converting human Speech into text that can be understood by computers. In the application scene of voice recognition, the voice recognition equipment can recognize the voice input by a user; however, due to geographical areas and individual differences, some voices input by users have accents, and generally, users in a certain geographical area range have the same or similar accents, so that when recognizing voices with accents, misjudgments or misjudgments are easy to occur, and the accuracy of voice recognition is low. Therefore, in the field of speech recognition, how to improve the accuracy of speech recognition of accented speech is one of the most important issues in recent research.
Disclosure of Invention
The embodiment of the application provides a voice recognition method and device, which can improve the accuracy of voice recognition of a voice signal with accent.
In one aspect, an embodiment of the present application provides a speech recognition method, including:
acquiring a voice signal to be recognized;
performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting accent features of a user generating the voice signal to be recognized;
and calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized.
In one aspect, an embodiment of the present application provides a speech recognition apparatus, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal to be recognized;
the processing unit is used for carrying out accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, and the accent vector corresponding to the voice signal to be recognized is used for reflecting the accent feature of a user generating the voice signal to be recognized;
the processing unit is further configured to call a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized.
In one aspect, an embodiment of the present application provides a speech recognition device, where the speech recognition device includes an input interface and an output interface, and further includes:
a processor adapted to implement one or more instructions; and the number of the first and second groups,
a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to execute the speech recognition method described above.
In one aspect, an embodiment of the present application provides a computer storage medium, where computer program instructions are stored in the computer storage medium, and when the computer program instructions are executed by a processor, the computer storage medium is configured to execute the foregoing speech recognition method.
In one aspect, embodiments of the present application provide a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; the processor of the speech recognition device reads the computer instructions from the computer-readable storage medium, executes the computer instructions, and when executed by the processor, performs the speech recognition method described above.
In the embodiment of the application, after the voice recognition device acquires the voice signal to be recognized; carrying out accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting accent features of a user generating the voice signal to be recognized; and then calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized. In the speech recognition process, the speech recognition model is used for carrying out recognition processing based on the speech signal to be recognized and the accent vector, the accent vector is obtained based on the speech signal to be recognized and can reflect the accent characteristics of a user generating the speech signal to be recognized, so that the accent characteristics are used for assisting the speech signal to be recognized to carry out speech recognition processing, and the speech recognition accuracy can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
FIG. 3a is a schematic diagram of a trained accent feature extraction model according to an embodiment of the present application;
FIG. 3b is a schematic diagram of another trained accent feature extraction model provided in the embodiments of the present application;
FIG. 4 is a schematic diagram of an acoustic model provided by an embodiment of the present application;
FIG. 5 is a flow chart of another speech recognition method provided by the embodiments of the present application;
FIG. 6 is a schematic flow chart of a method for obtaining a trained accent feature extraction model according to an embodiment of the present application;
FIG. 7a is a diagram of a mouth recognition model according to an embodiment of the present application;
FIG. 7b is a diagram of another accent recognition model provided by embodiments of the present application;
fig. 8 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The embodiment of the application mainly relates to the field of voice recognition in the field of artificial intelligence, and the aim of voice recognition is to convert human voice into text which can be understood by a computer. In an application scenario of speech recognition, speech recognition processing is often required to be performed on a speech signal with an accent, but at present, recognition accuracy of the speech signal with the accent is low. Specifically, after the voice signal to be recognized is obtained, a corresponding accent vector can be extracted from the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting accent characteristics of a user generating the voice signal to be recognized; and then calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized.
The voice recognition scheme can be executed by a voice recognition device, wherein the voice recognition device can be a voice interaction device integrated with a voice recognition service, such as a robot, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart car-mounted and smart wearable device, a server capable of providing the voice recognition service, or any terminal device, such as a device capable of controlling and managing the voice interaction device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), big data and an artificial intelligence platform.
Based on the foregoing speech recognition scheme, an embodiment of the present application provides a speech recognition system, and referring to fig. 1, a schematic structural diagram of the speech recognition system provided in the embodiment of the present application is shown. The speech recognition system shown in fig. 1 may comprise a speech recognition device 101 and at least one speech interaction device 102. The voice interaction device 102 may include any one or more of a robot, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart television, a smart car, and a smart wearable device. The voice recognition device 101 and the voice interaction device 102 may be directly or indirectly connected in a wired or wireless communication manner, and the present application is not limited thereto.
In one embodiment, the voice interaction device 102 is integrated with a voice collecting device such as a microphone, which can collect a voice signal to be recognized generated by a user; after the voice interaction equipment 102 acquires a voice signal to be recognized input by a user through the acquisition device, the voice signal to be recognized is sent to the voice recognition equipment 101, after the voice recognition equipment 101 acquires the voice signal to be recognized, the voice signal to be recognized is subjected to accent feature extraction processing to obtain an accent vector corresponding to the voice signal to be recognized, and the accent vector corresponding to the voice signal to be recognized is used for reflecting the accent feature of the user generating the voice signal to be recognized; and then calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized. Optionally, after obtaining the recognition result corresponding to the voice signal to be recognized, the voice recognition device 101 may send the recognition result to the voice interaction device 102, and after receiving the recognition result, the voice interaction device 102 may display the recognition result. Optionally, after obtaining the recognition result corresponding to the voice signal to be recognized, the voice recognition device 101 may generate a control instruction according to the recognition result, and send the control instruction to the voice interaction device 102, and after receiving the control instruction, the voice interaction device 102 executes the operation indicated by the control instruction. Wherein, the recognition result may be a text.
For example, if a social application program runs in the voice interaction device 102, if the user a sends a voice session message to the user B through the social application program; after receiving the voice session message sent by the user a, the user B may trigger a voice recognition function for the voice session message; then, the voice interaction device 102 may send the voice conversation message to the voice recognition device 101 as a voice signal to be recognized; after receiving the voice signal to be recognized, the voice recognition device 101 processes the voice signal to be recognized through the voice recognition scheme to obtain a text conversation message corresponding to the voice conversation message, and returns the text conversation message to the voice interaction device 102; after the voice interaction device 102 receives the text conversation message, the text conversation message is displayed through the social application.
For another example, if the voice interaction device 102 is a smart television, it is assumed that the user a inputs a voice of "please open the animation film 1" (i.e., a voice signal to be recognized) through the smart television; after acquiring a voice signal to be recognized, the smart television sends the voice signal to be recognized to the voice recognition device 101; after receiving the voice signal to be recognized, the voice recognition device 101 processes the voice signal to be recognized through the voice recognition scheme to obtain a text corresponding to the voice signal to be recognized, then generates a control instruction according to the text corresponding to the voice signal to be recognized, and sends the control instruction to the smart television; after the smart television receives the control instruction, the operation indicated by the control instruction is executed, namely the smart television opens the cartoon 1.
For another example, in an application scenario of intelligent navigation, if the voice interaction device 102 is an intelligent vehicle running an intelligent navigation application. Suppose that when a user a requests a navigation service through an intelligent navigation application, the user a inputs a voice of "i want to go to address a" (i.e., a voice signal to be recognized) through the intelligent navigation application; then, the smart car may send the voice signal to be recognized to the voice recognition device 101; after receiving the voice signal to be recognized, the voice recognition device 101 processes the voice signal to be recognized through the voice recognition scheme to obtain a text corresponding to the voice signal to be recognized, then performs address searching and route planning according to the text corresponding to the voice signal to be recognized, sends the planned route to the intelligent vehicle, and displays and navigates through the intelligent vehicle.
In the relevant application scenario of voice interaction, accurate recognition of a voice signal to be recognized is the basis for voice interaction, but due to geographical regions and individual differences, some user-input voice signals to be recognized may be accented. Generally, users within a certain geographic area range have the same or similar accent, and users between different geographic areas often have different accents, for example, users in area 1 often have accents corresponding to area 1, and users in area 2 often have accents corresponding to area 2. According to the difference of geographic areas and the similarity of accents in a certain geographic area range, voices with the same or similar accents can be divided into the same dialect, and the geographic area in which the same dialect is spoken can be divided into a dialect area; taking Chinese as an example, the Chinese can be divided into nine major dialects such as official languages, jin language, gan language, hui language, wu language, xiang language, hakka language, guangdong language, min language and the like, wherein the official language is mandarin; the chinese speech with accents is classified into mandarin speech, accent with jin, accent with gan, accent with hui, accent with wu, accent with xiang, accent with khan, accent with guanjia, accent with guang, and accent with min. The speech recognition scheme provided by the embodiment of the application can well recognize the speech signal with accent.
Based on the voice recognition scheme, the embodiment of the application provides a voice recognition method. Referring to fig. 2, a flow chart of a speech recognition method according to an embodiment of the present application is schematically shown. The speech recognition method shown in fig. 2 may be performed by a speech recognition device. The speech recognition method shown in fig. 2 may include the steps of:
s201, acquiring a voice signal to be recognized.
The speech signal to be recognized may be a speech signal with an accent input by a user, and the speech signal to be recognized may be input by the user in real time, and specifically may be collected by a speech collection device such as a microphone based on the speech recognition device. The speech signal to be recognized can also be generated by the user at any historical moment.
In an embodiment, after acquiring the voice signal to be recognized, the voice recognition device may sample the voice signal to be recognized according to a predetermined sampling rate to obtain a sampling value corresponding to the voice signal to be recognized, where the predetermined sampling rate is preset, and is usually 16000, that is, 16000 values are sampled in 1 second, and all subsequent correlation processes performed on the voice signal to be recognized are correlation processes performed on the sampling value corresponding to the voice signal to be recognized.
S202, performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized.
The accent vector corresponding to the voice signal to be recognized is used for reflecting the accent characteristics of the user generating the voice signal to be recognized.
In one embodiment, the speech recognition device may perform an accent feature extraction process on the speech signal to be recognized through the trained accent feature extraction model to obtain an accent vector corresponding to the speech signal to be recognized. The trained accent feature extraction model is used for extracting an accent vector corresponding to any voice signal, the trained accent feature extraction model can be obtained by performing model adjustment processing on the trained accent recognition model, and the trained accent recognition model can be obtained by training the accent recognition model based on training samples. The method includes training an accent recognition model based on a training sample to obtain a relevant process of the trained accent recognition model, and performing model adjustment processing on the trained accent recognition model to obtain a relevant process of the trained accent feature extraction model, which will be introduced in the following embodiments.
In the specific implementation, the speech recognition device can perform frame-level feature extraction processing on the speech signal to be recognized through a trained accent feature extraction model to obtain frame-level features corresponding to the speech signal to be recognized; carrying out feature aggregation processing on frame-level features corresponding to a voice signal to be recognized to obtain a primary feature vector of a target length; and carrying out accent vector analysis based on the primary feature vector to obtain an accent vector corresponding to the voice signal to be recognized.
In an embodiment, the trained accent feature extraction model may be a Deep Neural Network (DNN) based model, as shown in fig. 3a, which is a schematic diagram of a trained accent feature extraction model provided in an embodiment of the present application, and the trained accent feature extraction model may include an input layer, a hidden layer, and an accent vector layer. Specifically, the voice recognition device can receive a voice signal to be recognized through an input layer in a trained accent feature extraction model and send the voice signal to be recognized to a hidden layer; then, calling a hidden layer to perform frame-level feature extraction processing on the voice signal to be recognized to obtain frame-level features corresponding to the voice signal to be recognized, and performing feature aggregation processing on the frame-level features corresponding to the voice signal to be recognized to obtain a primary feature vector of a target length; and then calling an accent vector layer to perform accent vector analysis based on the primary feature vector to obtain an accent vector corresponding to the voice signal to be recognized. The accent vector layer may be a single-layer accent vector layer or a multi-layer accent vector layer, and both the single-layer accent vector layer and the multi-layer accent vector layer are used for performing accent vector analysis based on the primary feature vector to obtain an accent vector corresponding to the speech signal to be recognized, and the number of layers of the specific accent vector layer is not limited by the embodiment of the present application, for example, as shown in fig. 3b, a schematic diagram of another trained accent feature extraction model provided by the embodiment of the present application is provided, and the accent vector layer in the trained accent feature extraction model is two layers.
When the hidden layer is called to perform the frame-level feature extraction processing on the voice signal to be recognized, not only each isolated voice frame signal corresponding to the voice signal to be recognized is subjected to the frame-level feature extraction processing, but also the voice frame signals of the preset number before and after each voice frame signal are considered at the same time, namely when the frame-level feature extraction processing needs to be performed on one voice frame signal, the voice frame signals of the preset number before the voice frame signal, and the voice frame signals of the preset number after the voice frame signal are simultaneously subjected to the frame-level feature extraction processing. Therefore, the continuity of the voice signal to be recognized in time can be considered, the frame-level feature extraction of the voice signal to be recognized is more accurate, and the subsequently obtained accent vector corresponding to the voice signal to be recognized is more accurate. The preset number can be set according to different requirements.
The method comprises the steps of calling a hidden layer to carry out feature aggregation processing on frame-level features corresponding to a voice signal to be recognized to obtain a primary feature vector of a target length, converting the frame-level features corresponding to the voice signal to be recognized into the primary feature vector of a fixed dimension to realize dimension reduction processing, and setting different target lengths according to different requirements. Optionally, feature aggregation processing may be performed on the frame-level features corresponding to the speech signal to be recognized through a Pooling (Pooling) layer included in the hidden layer, or feature aggregation processing may be performed on the frame-level features corresponding to the speech signal to be recognized through a Statistics Pooling (Statistics Pooling) layer included in the hidden layer.
The accent vector corresponding to the speech signal to be recognized, which is obtained by calling the accent vector layer to perform the accent vector analysis based on the primary feature vector, may indicate features of different accent categories included in the speech signal to be recognized, and may specifically indicate features of a preset accent category that may be extracted by the trained accent feature extraction model. For example, the preset accent categories of the trained accent feature extraction model are accent a, accent b, accent c, accent d, and others, and then the trained accent feature extraction model can only extract features of accent a, accent b, accent c, and accent d that may be included in the speech signal to be recognized; if the to-be-recognized speech signal carries the accent e, the trained accent feature extraction model extracts features of the accent e into other features, and an accent vector corresponding to the to-be-recognized speech signal carrying the accent e indicates that the to-be-recognized speech signal carries more features and less features of the accent a, the accent b, the accent c and the accent d; if the speech signal to be recognized carries the accent b, the accent vector corresponding to the speech signal to be recognized carrying the accent b will indicate that the speech signal to be recognized has more features with the accent b and less features with the accent a, the accent c, the accent d and others. The vector elements of the accent vector corresponding to the speech signal to be recognized are preset accent categories, and the feature values in the vector space can be used to represent the features of each preset accent category included in the speech signal to be recognized. For example, if the predefined accent category is accent a, accent b, accent c, accent d, and so on, and the corresponding relationship between the vector elements and the predefined accent category is { accent a, accent b, accent c, accent d, and so on }, the corresponding accent vector of the speech signal to be recognized is { eigenvalue 1, eigenvalue 2, eigenvalue 3, eigenvalue 4, and eigenvalue 5}.
S203, calling the voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, and obtaining a recognition result corresponding to the voice signal to be recognized.
The recognition result corresponding to the voice signal to be recognized may be a text corresponding to the voice signal to be recognized.
In one embodiment, the speech recognition model may be a hybrid model (Hybird) based speech recognition model. A speech recognition model based on a hybrid model is generally composed of two parts, an acoustic model and a language model. The input of the acoustic model is generally a speech frame sequence divided according to a fixed time length, wherein the fixed time length is preset and is generally 10 to 30 milliseconds; the output of the acoustic model is the probability that each frame of speech corresponds to each acoustic modeling unit, taking chinese speech recognition as an example, the commonly used acoustic modeling units are initials and finals, that is, the output of the acoustic model is the probability that each frame of speech corresponds to each initial and final. And then mapping the different combinations of the voice frame sequences corresponding to the acoustic modeling units into different candidate texts through a dictionary, namely mapping the different combinations of the voice frame sequences corresponding to initials or finals into different candidate texts. The role of the language model is to give linguistic scores for different candidate texts. The acoustic model and the language model together can convert an input voice signal into a series of possible candidate texts, each candidate text gives a corresponding text probability, the text probability can comprehensively consider the probability of the language model and the probability of the acoustic model, and finally the candidate text corresponding to the maximum text probability is determined as the text corresponding to the voice signal. Therefore, the calling of the speech recognition model by the speech recognition device performs speech recognition processing based on the speech signal to be recognized and the accent vector corresponding to the speech signal to be recognized, so as to obtain a recognition result corresponding to the speech signal to be recognized, which may include: the voice recognition equipment performs framing processing on the voice signal to be recognized according to a fixed time length to obtain a framed voice signal to be recognized; processing the framed voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized through an acoustic model to obtain the probability that each frame of voice signal to be recognized in the framed voice signal to be recognized corresponds to each acoustic modeling unit, mapping different combinations of the framed voice signal to be recognized corresponding to each acoustic modeling unit into different candidate texts through a dictionary, grading different candidate texts through a language model to obtain the text probability corresponding to each candidate text, and determining the candidate text corresponding to the maximum text probability as the text corresponding to the voice signal to be recognized. The acoustic model may be a model based on a deep neural network, as shown in fig. 4, which is a schematic diagram of an acoustic model provided in an embodiment of the present application. For example, if the speech signal to be recognized is a speech signal with accent and is "i love Chinese", a combination of two acoustic modeling units, namely "w o ai zh ong h u a" and "w o ai z ong h u a", may be obtained through an acoustic model, that is, a combination of two initials and finals is obtained, the two combinations, which may be about to the acoustic modeling units, are mapped to a candidate text 1 "i love", a candidate text 2 "i love flowers" and a candidate text 3 "i love zonghua" through a dictionary, and if the text probability corresponding to the candidate text 1 is the maximum text probability, the candidate text 1 "i love Chinese" is taken as the text corresponding to the speech signal to be recognized.
In one embodiment, the speech recognition model may be an End-to-End (End-to-End) based speech recognition model. Based on an end-to-end voice recognition model, an acoustic model and a language model are not distinguished independently, the whole model is built through a deep neural network, and an input voice signal is processed directly to output a text. Namely, the speech recognition device calls an end-to-end speech recognition model to perform speech recognition processing on the speech signal to be recognized and the accent vector corresponding to the speech signal to be recognized, so as to obtain a text corresponding to the speech signal to be recognized. The end-to-end speech recognition model may include a List Advanced and Speech (LAS) model and a transform network (transformer) based end-to-end model.
In an embodiment, if the to-be-recognized voice signal is a voice signal for controlling the voice interaction device, after obtaining a recognition result corresponding to the to-be-recognized voice signal, the voice recognition device may generate a control instruction according to the recognition result, and send the control instruction to the voice interaction device to instruct the voice interaction device to execute an operation indicated by the control instruction. For example, if the voice signal to be recognized input by the user is "please open the animation film 1", the voice recognition device may generate a control instruction according to the recognition result after obtaining the recognition result corresponding to the voice signal to be recognized, and send the control instruction to the voice interaction device, and after receiving the control instruction, the voice interaction device may execute the operation indicated by the control instruction, that is, open the animation film 1.
In one embodiment, after obtaining a recognition result corresponding to a speech signal to be recognized, if feedback information indicating speech recognition errors is received, the speech recognition device adds the speech signal to be recognized as a training sample to a first training set and a second training set respectively, where the first training set is used for training a trained accent recognition model, the second training set is used for training a speech recognition model, and the feedback information indicating speech recognition errors is generated according to an error feedback operation performed on the recognition result. Wherein, an error feedback control can be arranged in the voice interaction device for the user to perform error feedback. For example, in an application scenario of converting speech into text, if a text corresponding to a speech signal to be recognized displayed by the speech interaction device is not a content that the user actually wants to express, the user may trigger the error feedback control, so that the speech interaction device sends feedback information indicating that speech recognition is incorrect to the speech recognition device. For another example, if the voice signal to be recognized input by the user is a voice signal for controlling the voice interaction device, but an operation performed after the voice interaction device receives the control instruction sent by the voice recognition device is not an operation that the user originally wants to perform by the voice interaction device, the user may trigger the error feedback control, so that the voice interaction device sends feedback information indicating that the voice recognition is incorrect to the voice recognition device.
In the specific implementation, if the voice recognition device receives feedback information indicating voice recognition errors, a voice signal to be recognized is acquired, a target accent category corresponding to the voice signal to be recognized is acquired, a target text corresponding to the voice signal to be recognized is acquired, the voice signal to be recognized and the target accent category corresponding to the voice signal to be recognized are taken as training samples to be added into a first training set, the voice signal to be recognized and the target text corresponding to the voice signal to be recognized are added into a second training set, a trained accent recognition model is trained on the basis of the first training set, the trained accent recognition model is updated, and then the trained accent feature extraction model is updated; and training the speech recognition model based on the second training set to update the speech recognition model. The target accent category corresponding to the speech signal to be recognized and the target text corresponding to the speech signal to be recognized may be marked by a technician and input to the speech recognition device.
In one embodiment, a speech recognition accuracy test is performed by using speech test data in the national nine-dialect area, and a Word Error Rate (WER) index commonly used in the speech recognition field is used as a quantization index, so that the speech recognition method provided by the embodiment of the present application can relatively reduce the speech recognition Error Rate of a speech signal with an accent by 5% to 10%.
In the embodiment of the application, after the voice recognition device acquires the voice signal to be recognized; carrying out accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting accent features of a user generating the voice signal to be recognized; and then calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized. In the speech recognition process, the speech recognition model is recognized based on the speech signal to be recognized and the accent vector, and the accent vector is obtained based on the speech signal to be recognized and can reflect the accent characteristics of the user generating the speech signal to be recognized, so that the accent characteristics are used for assisting the speech signal to be recognized to perform speech recognition processing, and the speech recognition accuracy can be improved.
Based on the foregoing speech recognition method embodiment, another speech recognition method is provided in the embodiments of the present application. Referring to fig. 5, a schematic flow chart of another speech recognition method provided in the embodiment of the present application is shown. The speech recognition method shown in fig. 5 may be performed by a speech recognition apparatus. The speech recognition method shown in fig. 5 may include the steps of:
s501, acquiring a voice signal to be recognized.
And S502, acquiring a noise value in the voice signal to be recognized.
In specific implementation, the speech recognition device may analyze, extract and process the speech signal to be recognized to obtain a noise signal included in the speech signal to be recognized, and then determine a noise value of the noise signal. Optionally, the speech recognition device may analyze, extract and process the speech signal to be recognized according to characteristics of the noise signal, to obtain the noise signal included in the speech signal to be recognized, where the characteristics of the noise signal may include a sound source position generating the noise signal, a spectrum feature of the noise signal, and the like.
And S503, if the noise value is larger than the noise threshold value, outputting prompt information of the input voice signal again.
The noise threshold is preset according to requirements, and the prompt information is used for prompting a user to input the voice signal again; optionally, the prompt information may be text information or voice information, and the specific format of the prompt information and the specific content included in the prompt information are not limited in the embodiment of the present application. For example, the prompt message may be a text message "do nothing, i don't hear clearly, ask you to say again", or a voice message "you say too fast, i don't keep up with your thinking, say a bar again".
S504, if the noise value is smaller than the noise threshold, the accent feature extraction processing is carried out on the voice signal to be recognized, and an accent vector corresponding to the voice signal to be recognized is obtained.
In an embodiment, the speech recognition device may perform the accent feature extraction processing on the speech signal to be recognized through the trained accent feature extraction model to obtain an accent vector corresponding to the speech signal to be recognized, which has been described in step S202, and is not described herein again.
In one embodiment, the speech recognition device may perform extraction processing on the speech signal to be recognized based on factorization, so as to obtain a voiceprint feature vector of a user generating the speech signal to be recognized; and extracting the accent vector corresponding to the voice signal to be recognized from the voiceprint feature vector.
In specific implementation, the speech recognition device may process the speech signal to be recognized through a combined Model of a Gaussian Mixed Model (GMM) and a Universal Background Model (UBM), that is, a GMM-UBM Model, to obtain a Gaussian mean supervector corresponding to the speech signal to be recognized, then obtain a global difference space factor based on the Gaussian mean supervector corresponding to the UBM Model and the global difference space matrix, then obtain a voiceprint feature vector of a user generating the speech signal to be recognized by calculating a posterior mean of the global difference space factor, and then extract a accent vector corresponding to the speech signal to be recognized from the voiceprint feature vector. Wherein the global disparity space factor can be given by equation 1:
M=m+Tω (1)
wherein, M is a gaussian mean value supervector corresponding to the speech signal to be recognized, M is a gaussian mean value supervector corresponding to the UBM model, T is a global difference space matrix, and ω is a global difference space factor.
And S505, calling the voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized.
Step S505 is the same as step S203, and is not described herein again.
In the embodiment of the application, after the voice recognition device acquires the voice signal to be recognized; the noise value in the voice signal to be recognized can be obtained; if the noise value is larger than the noise threshold value, outputting prompt information of the input voice signal again; if the noise value is smaller than the noise threshold value, performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, and then calling a voice recognition model to perform voice recognition processing on the basis of the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized to obtain a recognition result corresponding to the voice signal to be recognized. By judging the noise value in the voice signal to be recognized, the prompt message can be output when the noise value in the voice signal to be recognized is larger than the noise threshold value, and subsequent voice recognition processing is not performed, so that the calculation resource is saved; and because the accent vector reflects the accent characteristics of the user who generates the speech signal to be recognized, the speech recognition accuracy of the speech signal with accent can be improved.
Based on the embodiment of the speech recognition method, the embodiment of the application trains the accent recognition model based on the training sample to obtain the relevant flow of the trained accent recognition model, and introduces the relevant flow of the trained accent feature extraction model by carrying out model adjustment processing on the trained accent recognition model. Referring to fig. 6, a schematic flow chart of obtaining a trained accent feature extraction model according to an embodiment of the present application is provided. The related process of obtaining the trained accent feature extraction model shown in fig. 6 may be executed by the speech recognition device, and specifically may include the following steps:
and S601, obtaining a training sample.
The training samples are samples used for training the accent recognition model, and the training samples comprise training voice signals and target accent categories corresponding to the training voice signals. The training speech signal is a speech signal with accent, taking the chinese speech as an example, the training speech signal is a chinese speech signal with accent of chinese dialect, taking other languages as an example, and the training speech signal is a speech signal with accent of dialect corresponding to the language. The target accent category corresponding to the training speech signal indicates what kind of preset accent category the training speech signal corresponds to, wherein the preset accent category is preset according to a specific application scenario and requirements, and indicates a training target of the accent recognition model, that is, indicates which kinds of preset accent categories the trained accent recognition model trained on the accent recognition model can recognize, for example, if the preset accent categories are set to be 3, which are respectively jin language accent, gan language accent and other accents, the trained accent recognition model can only recognize the jin language accent and gan language accent, and non-jin language accent and gan language accent will be recognized as other accents. For another example, if the preset accent categories are accent a, accent b, accent c, accent d, and others, respectively, and the target accent category corresponding to the training speech signal is accent b, it indicates that the training speech signal corresponds to accent b in the preset accent categories.
In one embodiment, the target accent category corresponding to the training speech signal may be characterized in a form of a vector, where a vector element is a preset accent category. For example, if the predefined accent categories are accent a, accent b, accent c, accent d, and others, and the corresponding relationship between the vector elements and the predefined accent categories is { accent a, accent b, accent c, accent d, and others }, and assuming that the target accent category corresponding to the training speech signal is accent b, the target accent category corresponding to the training speech signal is represented by a vector form as {0,1, 0}.
In an embodiment, after acquiring the training speech signal, the speech recognition device may sample the training speech signal according to a predetermined sampling rate to obtain a sampling value corresponding to the training speech signal, where the predetermined sampling rate is preset, and is typically 16000, that is, 16000 values are sampled in 1 second, and the correlation processing performed on the training speech signal is the correlation processing performed on the sampling value corresponding to the training speech signal.
And S602, taking the training voice signal as input, taking the target accent type corresponding to the training voice signal as expected output, and training the accent recognition model based on the training sample to obtain the trained accent recognition model.
The accent recognition model is used for extracting training accent vectors corresponding to the training voice signals and recognizing prediction accent categories corresponding to the training voice signals based on the training accent vectors. The training accent vector corresponding to the training speech signal indicates the characteristics of different preset accent categories contained in the training speech signal predicted and obtained after the accent recognition model is processed; the predicted accent category corresponding to the training speech signal indicates to which preset accent category the training speech signal predicted after being processed by the accent recognition model corresponds, and also indicates to which predicted accent category of the user's accent generating the training speech signal predicted after being processed by the accent recognition model corresponds.
In one embodiment, the accent recognition model may be a deep neural network-based model, as shown in fig. 7a, which is a schematic diagram of an accent recognition model provided for an embodiment of the present application, and the accent recognition model may include an input layer, a hidden layer, an accent vector layer, and an output layer. Specifically, the speech recognition device may receive a training speech signal through the input layer and send the training speech signal to the hidden layer; calling a hidden layer to perform frame-level feature extraction processing on the training voice signal to obtain frame-level features corresponding to the training voice signal, and performing feature aggregation processing on the frame-level features corresponding to the training voice signal to obtain a predicted primary feature vector of a target length; calling an accent vector layer to perform accent vector analysis based on the predicted primary feature vector to obtain a training accent vector corresponding to the training speech signal; calling an output layer to determine a predicted accent category corresponding to the training speech signal based on the training accent vector, wherein the predicted accent category corresponding to the training speech signal indicates a predicted accent category to which a accent of a user generating the training speech signal belongs; and training the accent recognition model based on the target accent category and the predicted accent category to obtain the trained accent recognition model. The accent vector layer may be a single-layer accent vector layer or a multi-layer accent vector layer, and both the single-layer and the multi-layer are used to perform accent vector analysis based on the predicted primary feature vector to obtain a training accent vector corresponding to the training speech signal, and the number of layers of the specific accent vector layer is not limited in the embodiment of the present application, for example, as shown in fig. 7b, the number of layers of the accent vector layer in the accent recognition model is two.
The voice recognition equipment receives a training voice signal through an input layer and sends the training voice signal to a hidden layer; calling a hidden layer to perform frame-level feature extraction processing on the training voice signal to obtain frame-level features corresponding to the training voice signal, and performing feature aggregation processing on the frame-level features corresponding to the training voice signal to obtain a predicted primary feature vector of a target length; and calling an accent vector layer to perform accent vector analysis based on the predicted primary feature vector to obtain a correlation process of a training accent vector corresponding to the training speech signal, wherein the correlation process is similar to the correlation process of performing accent feature extraction processing on the speech signal to be recognized through the trained accent feature extraction model to obtain the accent vector corresponding to the speech signal to be recognized, and the description is omitted here.
In one embodiment, the speech recognition device invoking the output layer to determine a predicted accent category corresponding to the training speech signal based on the training accent vector may include: and calling an output layer by the voice recognition equipment to carry out normalization processing on the training accent vector to obtain a predicted accent category corresponding to the training voice signal. Wherein the output layer may be a normalization (Softmax) layer. In a specific implementation, a training accent vector corresponding to a training speech signal may be mapped to a probability that the training speech signal corresponds to each preset accent category through a Softmax layer. And the preset accent category corresponding to the maximum probability in the probabilities of the training voice signals corresponding to the preset accent categories is the predicted accent category corresponding to the training voice signals. For example, if the predetermined accent categories are accent a, accent b, accent c, accent d, and others, and it is assumed that the probability of the training speech signal corresponding to accent a is 0.5%, the probability corresponding to accent b is 99%, the probability corresponding to accent c is 0.3%, the probability corresponding to accent d is 0.1%, and the probabilities corresponding to others are 0.1%, the predicted accent category corresponding to the training speech signal is accent b. Optionally, the predicted accent categories corresponding to the training speech signal may be represented in a vector form, where a vector element is a preset accent category, and if a corresponding relationship between the vector element and the preset accent category is { accent a, accent b, accent c, accent d, or other }, the predicted accent categories corresponding to the training speech signal may be represented as {0.5%,99%,0.3%,0.1%,0.1% }.
Further, the speech recognition device trains the accent recognition model based on the target accent category and the predicted accent category to obtain a trained accent recognition model, which may include: and determining a loss value of the loss function based on the target accent category and the predicted accent category, and training the accent recognition model by the voice recognition equipment based on the loss values generated by different training samples, so that the loss value generated by the training samples for testing the accent recognition model is smaller than a preset loss value, and thus obtaining the trained accent recognition model. The preset loss value can be preset according to requirements, and can represent whether the training of the accent recognition model can be finished or not. Specifically, the speech recognition device may iteratively update the model parameters of the accent recognition model based on the loss values generated by different training samples, so that the trained accent recognition model is obtained based on the updated model parameters if the loss values generated by the training samples for the accent recognition model test based on the updated model parameters are smaller than the preset loss values.
In one embodiment, the loss function may be a cross-entropy loss function. Wherein the cross entropy loss function can be determined by equation (2):
Figure BDA0003022050290000161
wherein n is the number of preset accent categories, y i A value corresponding to the ith preset accent category in the target accent categories corresponding to the training speech signals,
Figure BDA0003022050290000162
the value corresponding to the ith preset accent category in the predicted accent categories corresponding to the training speech signal, that is, the probability that the training speech signal corresponds to the ith preset accent category. For example, if the target accent class corresponding to the training speech signal is {0,1, 0} and the predicted accent class corresponding to the training speech signal is {0.5%,99%,0.3%,0.1%,0.1% }, the loss value of the cross entropy loss function is = - (0 × log (0.5%) +1 × log (99%) +0 × log (0.3%) +0 × log (0.1%) +0 × log (0.1%)).
And S603, carrying out model adjustment processing on the trained accent recognition model to obtain the trained accent feature extraction model.
In specific implementation, the speech recognition device may obtain a trained accent feature extraction model based on an input layer, a hidden layer, and an accent vector layer of the trained accent recognition model, where the trained accent feature extraction model takes an output of the accent vector layer of the trained accent recognition model as a model output, and the trained accent feature extraction model is used to extract an accent vector corresponding to any speech signal.
In an embodiment, the speech recognition apparatus further performs a model adjustment process on the trained accent recognition model after obtaining the trained accent feature extraction model, so that the trained accent feature extraction model outputs an output of an accent vector layer of the trained accent recognition model as a model, but does not directly output the trained accent recognition model as the trained accent feature extraction model, because if the trained accent recognition model is directly used to process a speech signal to be recognized, a predicted accent category corresponding to the speech signal to be recognized is directly output, and a predicted accent category corresponding to the speech signal to be recognized is a preset accent category corresponding to a maximum probability among probabilities that the speech signal to be recognized corresponds to each preset accent category, that is, the predicted accent category corresponding to the speech signal to be recognized is a determined preset accent category, whereas for the speech signal to be recognized, a determined preset accent category is used to represent that the accent feature carried in the speech signal to be recognized is not accurate. For example, if the preset accent categories are accent a, accent b, accent c, accent d, and others, and if a user carries two accents, namely accent a and accent b, that are relatively obvious, and the probability of the to-be-recognized speech signal generated by the user corresponding to accent a is 49%, the probability corresponding to accent b is 50.5%, the probability corresponding to accent c is 0.3%, the probability corresponding to accent d is 0.1%, and the probabilities corresponding to others are 0.1%, the trained accent recognition model determines accent a as the predicted accent category corresponding to the to-be-recognized speech signal, ignores the influence of the other preset accent categories except accent a, so that recognition during subsequent speech recognition is inaccurate, and when the duration of the to-be-recognized speech signal is short, the determination of the predicted accent category of the to-be-recognized speech signal is also reduced.
Furthermore, if the predicted accent category corresponding to the speech signal to be recognized is not determined as a certain preset accent category, but the predicted accent category corresponding to the speech signal to be recognized is directly represented in a vector form, and the speech recognition model is called to perform speech recognition processing based on the vector of the predicted accent category corresponding to the speech signal to be recognized and the speech signal to be recognized, then the accent vector obtained based on the accent vector layer will generate a normalization error in the normalization process through the output layer, so as to avoid the error generated in the normalization process from the feature value in the vector space to the probability, and therefore, the trained accent feature extraction model obtained in the embodiment of the present application outputs the accent vector layer of the trained accent recognition model as a model output.
In an embodiment, the trained accent feature extraction model provided in the embodiment of the present application may also be used to determine accent similarity between different to-be-recognized speech signals, and may further determine whether different users generating different to-be-recognized speech signals carry similar accents, and may further determine whether different users generating different to-be-recognized speech signals are from the same geographical area range due to the correspondence between different accents and geographical areas. In specific implementation, the speech recognition device may represent the accent similarity between the speech signals to be recognized corresponding to different accent vectors based on the vector similarity between the different accent vectors; if the accent similarity is larger than the accent similarity threshold, it is considered that similar accents are carried among different users generating different voice signals to be recognized, and it is further judged that the different users are in the same geographical area range. Wherein, the threshold value of the accent similarity is set according to requirements. Alternatively, cosine similarity may be used to characterize the vector similarity between different accent vectors.
In the embodiment of the application, after acquiring a training sample comprising a training voice signal and a target accent category corresponding to the training voice signal, a voice recognition device takes the training voice signal as input, takes the target accent category corresponding to the training voice signal as expected output, and trains an accent recognition model based on the training sample to obtain a trained accent recognition model; and then carrying out model adjustment processing on the trained accent recognition model to obtain a trained accent feature extraction model, so that the trained accent feature extraction model is used for extracting an accent vector corresponding to any voice signal. The trained accent feature extraction model obtained in the embodiment of the application is based on the input layer, the hidden layer and the accent vector layer of the trained accent recognition model, the output layer in the trained accent recognition model is abandoned, the output of the accent vector layer of the trained accent recognition model is directly used as model output, and when the accent feature extraction model obtained through the method is used for carrying out accent feature extraction processing on a speech signal to be recognized, the obtained accent vector can better represent accent features of the speech signal to be recognized.
Based on the above method embodiments, the present application provides a speech recognition device. Referring to fig. 8, a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure, the speech recognition apparatus 80 may include an obtaining unit 801 and a processing unit 802. The speech recognition device 80 shown in fig. 8 may operate as follows:
an obtaining unit 801, configured to obtain a speech signal to be recognized;
a processing unit 802, configured to perform an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, where the accent vector corresponding to the speech signal to be recognized is used to reflect an accent feature of a user generating the speech signal to be recognized;
the processing unit 802 is further configured to invoke a speech recognition model to perform speech recognition processing based on the speech signal to be recognized and the accent vector corresponding to the speech signal to be recognized, so as to obtain a recognition result corresponding to the speech signal to be recognized.
In an embodiment, when the processing unit 802 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the following operations are specifically performed:
acquiring a noise value in the voice signal to be recognized;
and if the noise value is smaller than a noise threshold value, performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized.
In an embodiment, when the processing unit 802 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the following operations are specifically performed:
performing accent feature extraction processing on the voice signal to be recognized through a trained accent feature extraction model to obtain an accent vector corresponding to the voice signal to be recognized;
before the processing unit 802 performs the accent feature extraction on the speech signal to be recognized to obtain the accent vector corresponding to the speech signal to be recognized:
the obtaining unit 801 is further configured to obtain a training sample, where the training sample includes a training speech signal and a target accent category corresponding to the training speech signal;
the processing unit 802 is further configured to take the training speech signal as input, take a target accent category corresponding to the training speech signal as expected output, train an accent recognition model based on the training sample, and obtain a trained accent recognition model, where the accent recognition model is configured to extract a training accent vector corresponding to the training speech signal and recognize a predicted accent category corresponding to the training speech signal based on the training accent vector;
the processing unit 802 is further configured to perform model adjustment processing on the trained accent recognition model to obtain the trained accent feature extraction model, where the trained accent feature extraction model is used to extract an accent vector corresponding to any speech signal.
In one embodiment, the accent recognition model includes an input layer, a hidden layer, an accent vector layer, and an output layer, and the processing unit 802 specifically performs the following operations when training the accent recognition model based on the training samples to obtain a trained accent recognition model:
receiving the training voice signal through the input layer and sending the training voice signal to the hidden layer;
calling the hidden layer to perform frame-level feature extraction processing on the training voice signal to obtain frame-level features corresponding to the training voice signal, and performing feature aggregation processing on the frame-level features corresponding to the training voice signal to obtain a predicted primary feature vector of a target length;
calling the accent vector layer to perform accent vector analysis based on the predicted primary feature vector to obtain a training accent vector corresponding to the training speech signal;
calling the output layer to determine a predicted accent category corresponding to the training speech signal based on the training accent vector, wherein the predicted accent category corresponding to the training speech signal indicates a predicted accent category of a user accent generating the training speech signal;
and training the accent recognition model based on the target accent category and the predicted accent category to obtain the trained accent recognition model.
In an embodiment, when the processing unit 802 performs model adjustment processing on the trained accent recognition model to obtain the trained accent feature extraction model, the following operations are specifically performed:
and obtaining the trained accent feature extraction model based on an input layer, a hidden layer and an accent vector layer of the trained accent recognition model, wherein the trained accent feature extraction model takes the output of the accent vector layer of the trained accent recognition model as a model output.
In an embodiment, when the processing unit 802 performs an accent feature extraction process on the to-be-recognized speech signal through a trained accent feature extraction model to obtain an accent vector corresponding to the to-be-recognized speech signal, the following operations are specifically performed:
carrying out frame-level feature extraction processing on the voice signal to be recognized through the trained accent feature extraction model to obtain frame-level features corresponding to the voice signal to be recognized;
performing feature aggregation processing on the frame-level features corresponding to the voice signal to be recognized to obtain a primary feature vector of a target length;
and carrying out accent vector analysis based on the primary feature vector to obtain an accent vector corresponding to the voice signal to be recognized.
In an embodiment, when the processing unit 802 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the following operations are specifically performed:
extracting the voice signal to be recognized based on factorization to obtain a voiceprint feature vector of the user generating the voice signal to be recognized;
and extracting the accent vector corresponding to the voice signal to be recognized from the vocal print characteristic vector.
In an embodiment, after obtaining the recognition result corresponding to the speech signal to be recognized, the processing unit 802 is further configured to:
if feedback information indicating voice recognition errors is received, the voice signal to be recognized is used as a training sample and is respectively added into a first training set and a second training set, the first training set is used for training the trained accent recognition model, and the second training set is used for training the voice recognition model; wherein the feedback information indicating the voice recognition error is generated according to an error feedback operation performed on the recognition result.
In one embodiment, the speech recognition apparatus 80 further comprises an output unit 803, and the speech signal to be recognized is a speech signal for controlling a speech interaction device; after the processing unit 802 obtains the recognition result corresponding to the voice signal to be recognized:
the processing unit 802 is further configured to generate a control instruction according to the identification result;
the output unit 803 is configured to send the control instruction to the voice interaction device, so as to instruct the voice interaction device to execute the operation indicated by the control instruction.
According to an embodiment of the present application, the steps involved in the speech recognition method shown in fig. 2 and 5 and the method of obtaining a trained accent feature extraction model shown in fig. 6 may be performed by units in the speech recognition apparatus 80 shown in fig. 8. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 801 in the speech recognition apparatus 80 shown in fig. 8, and step S202 and step S203 shown in fig. 2 may be performed by the processing unit 802 in the speech recognition apparatus 80 shown in fig. 8. As another example, step S501 shown in fig. 5 may be performed by the acquisition unit 801 in the speech recognition apparatus 80 shown in fig. 8, steps S502, S504, and S505 shown in fig. 5 may be performed by the processing unit 802 in the speech recognition apparatus 80 shown in fig. 8, and step S503 shown in fig. 5 may be performed by the output unit 803 in the speech recognition apparatus 80 shown in fig. 8. For another example, step S601 shown in fig. 6 may be executed by the acquisition unit 801 in the speech recognition apparatus 80 shown in fig. 8, and step S602 and step S603 shown in fig. 6 may be executed by the processing unit 802 in the speech recognition apparatus 80 shown in fig. 8.
According to another embodiment of the present application, the units in the speech recognition apparatus 80 shown in fig. 8 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form the same operation, without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the speech recognition device 80 based on logic function division may also include other units, and in practical applications, the functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.
According to another embodiment of the present application, the speech recognition apparatus 80 as shown in fig. 8 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2, fig. 5, and fig. 6 on a general-purpose computing device such as a computer including a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and a storage element, and implementing the speech recognition method of the embodiment of the present application. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the above-described computing apparatus via the computer-readable storage medium.
In this embodiment of the application, after the obtaining unit 801 obtains a voice signal to be recognized, the processing unit 802 obtains the voice signal to be recognized; carrying out accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting accent features of a user generating the voice signal to be recognized; and then calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized. In the speech recognition process, the speech recognition model is used for performing recognition processing based on the speech signal to be recognized and the accent vector, and the accent vector is obtained based on the speech signal to be recognized and can reflect the accent characteristics of a user generating the speech signal to be recognized, so that the accent characteristics are used for assisting the speech signal to be recognized to perform speech recognition processing, and the speech recognition accuracy can be improved.
Based on the method embodiment and the device embodiment, the application also provides a voice recognition device. Referring to fig. 9, a schematic structural diagram of a speech recognition device provided in an embodiment of the present application is shown. The speech recognition device 90 shown in fig. 9 may include at least a processor 901, an input interface 902, an output interface 903, and a computer storage medium 904. The processor 901, the input interface 902, the output interface 903, and the computer storage medium 904 may be connected by a bus or other means.
A computer storage medium 904 may be stored in the memory of the speech recognition device 90, the computer storage medium 904 being used for storing a computer program comprising program instructions, the processor 901 being used for executing the program instructions stored by the computer storage medium 904. The processor 901 (or CPU) is a computing core and a control core of the speech recognition device 90, and is adapted to implement one or more instructions, and specifically, adapted to load and execute the one or more instructions so as to implement the above-mentioned speech recognition method flow or corresponding functions.
An embodiment of the present application further provides a computer storage medium (Memory), which is a Memory device in the speech recognition device 90 and is used for storing programs and data. It is understood that the computer storage medium herein may include a built-in storage medium in the terminal, and may also include an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 901. It should be noted that the computer storage medium herein may be a Random Access Memory (RAM) memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.
In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by the processor 901, the input interface 902, and the output interface 903 to implement the corresponding steps in the speech recognition method shown in fig. 2 and fig. 5 and the method for obtaining a trained accent feature extraction model shown in fig. 6, in particular, the one or more instructions in the computer storage medium are loaded and executed by the processor 901 and the input interface 902 to implement the following steps:
an input interface 902 for acquiring a speech signal to be recognized;
a processor 901, configured to perform an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, where the accent vector corresponding to the speech signal to be recognized is used to reflect an accent feature of a user generating the speech signal to be recognized;
the processor 901 is configured to invoke a speech recognition model to perform speech recognition processing based on the speech signal to be recognized and the accent vector corresponding to the speech signal to be recognized, so as to obtain a recognition result corresponding to the speech signal to be recognized.
In an embodiment, when the processor 901 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the following operations are specifically performed:
acquiring a noise value in the voice signal to be recognized;
and if the noise value is smaller than a noise threshold value, performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized.
In an embodiment, when the processor 901 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the following operations are specifically performed:
carrying out accent feature extraction processing on the voice signal to be recognized through a trained accent feature extraction model to obtain an accent vector corresponding to the voice signal to be recognized;
before the processor 901 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized:
the input interface 902 is further configured to obtain a training sample, where the training sample includes a training speech signal and a target accent category corresponding to the training speech signal;
the processor 901 is further configured to take the training speech signal as input, take a target accent category corresponding to the training speech signal as expected output, train an accent recognition model based on the training sample, and obtain a trained accent recognition model, where the accent recognition model is configured to extract a training accent vector corresponding to the training speech signal and recognize a predicted accent category corresponding to the training speech signal based on the training accent vector;
the processor 901 is further configured to perform model adjustment processing on the trained accent recognition model to obtain the trained accent feature extraction model, where the trained accent feature extraction model is used to extract an accent vector corresponding to any speech signal.
In an embodiment, the accent recognition model includes an input layer, a hidden layer, an accent vector layer, and an output layer, and when the processor 901 trains the accent recognition model based on the training samples to obtain a trained accent recognition model, the following operations are specifically performed:
receiving the training voice signal through the input layer and sending the training voice signal to the hidden layer;
calling the hidden layer to perform frame-level feature extraction processing on the training voice signal to obtain frame-level features corresponding to the training voice signal, and performing feature aggregation processing on the frame-level features corresponding to the training voice signal to obtain a predicted primary feature vector of a target length;
calling the accent vector layer to perform accent vector analysis based on the predicted primary feature vector to obtain a training accent vector corresponding to the training speech signal;
calling the output layer to determine a predicted accent category corresponding to the training speech signal based on the training accent vector, the predicted accent category corresponding to the training speech signal indicating a predicted accent category of a user accent from which the training speech signal was generated;
and training the accent recognition model based on the target accent category and the predicted accent category to obtain the trained accent recognition model.
In an embodiment, the processor 901 performs a model adjustment process on the trained accent recognition model to obtain the trained accent feature extraction model, and specifically performs the following operations:
and obtaining the trained accent feature extraction model based on an input layer, a hidden layer and an accent vector layer of the trained accent recognition model, wherein the trained accent feature extraction model takes the output of the accent vector layer of the trained accent recognition model as a model output.
In an embodiment, the processor 901 performs an accent feature extraction process on the speech signal to be recognized through a trained accent feature extraction model, and when an accent vector corresponding to the speech signal to be recognized is obtained, the following operations are specifically executed:
carrying out frame-level feature extraction processing on the voice signal to be recognized through the trained accent feature extraction model to obtain frame-level features corresponding to the voice signal to be recognized;
performing feature aggregation processing on the frame-level features corresponding to the voice signal to be recognized to obtain a primary feature vector of a target length;
and carrying out accent vector analysis based on the primary feature vector to obtain an accent vector corresponding to the voice signal to be recognized.
In an embodiment, when the processor 901 performs an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the following operations are specifically performed:
extracting the voice signal to be recognized based on factorization to obtain a voiceprint feature vector of the user generating the voice signal to be recognized;
and extracting the accent vector corresponding to the voice signal to be recognized from the voiceprint feature vector.
In an embodiment, after obtaining the recognition result corresponding to the speech signal to be recognized, the processor 901 is further configured to:
if feedback information indicating voice recognition errors is received, the voice signal to be recognized is used as a training sample and is respectively added into a first training set and a second training set, the first training set is used for training the trained accent recognition model, and the second training set is used for training the voice recognition model; wherein the feedback information indicating the voice recognition error is generated according to an error feedback operation performed on the recognition result.
In one embodiment, the voice signal to be recognized is a voice signal for controlling a voice interaction device; after the processor 901 obtains the recognition result corresponding to the speech signal to be recognized:
the processor 901 is further configured to generate a control instruction according to the recognition result;
the output interface 903 is configured to send the control instruction to the voice interaction device, so as to instruct the voice interaction device to execute an operation indicated by the control instruction.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the speech recognition device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, causing the speech recognition device to perform the method embodiments described above as shown in fig. 2, 5 or 6. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A speech recognition method, comprising:
acquiring a voice signal to be recognized;
performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, wherein the accent vector corresponding to the voice signal to be recognized is used for reflecting accent features of a user generating the voice signal to be recognized;
and calling a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized.
2. The method according to claim 1, wherein the performing an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized comprises:
acquiring a noise value in the voice signal to be recognized;
and if the noise value is smaller than a noise threshold value, performing accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized.
3. The method according to claim 1, wherein the performing an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized comprises:
carrying out accent feature extraction processing on the voice signal to be recognized through a trained accent feature extraction model to obtain an accent vector corresponding to the voice signal to be recognized;
before the accent feature extraction processing is performed on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized, the method further includes:
acquiring a training sample, wherein the training sample comprises a training voice signal and a target accent type corresponding to the training voice signal;
taking the training voice signal as input, taking a target accent category corresponding to the training voice signal as expected output, training an accent recognition model based on the training sample to obtain a trained accent recognition model, wherein the accent recognition model is used for extracting a training accent vector corresponding to the training voice signal and recognizing a predicted accent category corresponding to the training voice signal based on the training accent vector;
and carrying out model adjustment processing on the trained accent recognition model to obtain the trained accent feature extraction model, wherein the trained accent feature extraction model is used for extracting an accent vector corresponding to any voice signal.
4. The method of claim 3, wherein the accent recognition model comprises an input layer, a hidden layer, an accent vector layer, and an output layer, and wherein training the accent recognition model based on the training samples to obtain a trained accent recognition model comprises:
receiving the training voice signal through the input layer and sending the training voice signal to the hidden layer;
calling the hidden layer to perform frame-level feature extraction processing on the training voice signal to obtain frame-level features corresponding to the training voice signal, and performing feature aggregation processing on the frame-level features corresponding to the training voice signal to obtain a predicted primary feature vector of a target length;
calling the accent vector layer to perform accent vector analysis based on the predicted primary feature vector to obtain a training accent vector corresponding to the training speech signal;
calling the output layer to determine a predicted accent category corresponding to the training speech signal based on the training accent vector, the predicted accent category corresponding to the training speech signal indicating a predicted accent category of a user accent from which the training speech signal was generated;
and training the accent recognition model based on the target accent category and the predicted accent category to obtain the trained accent recognition model.
5. The method as claimed in claim 4, wherein said performing model adjustment on said trained accent recognition model to obtain said trained accent feature extraction model comprises:
and obtaining the trained accent feature extraction model based on an input layer, a hidden layer and an accent vector layer of the trained accent recognition model, wherein the trained accent feature extraction model takes the output of the accent vector layer of the trained accent recognition model as a model output.
6. The method as claimed in claim 3, wherein said performing an accent feature extraction process on the speech signal to be recognized through the trained accent feature extraction model to obtain an accent vector corresponding to the speech signal to be recognized comprises:
performing frame-level feature extraction processing on the voice signal to be recognized through the trained accent feature extraction model to obtain frame-level features corresponding to the voice signal to be recognized;
carrying out feature aggregation processing on the frame-level features corresponding to the voice signal to be recognized to obtain a primary feature vector of a target length;
and carrying out accent vector analysis based on the primary feature vector to obtain an accent vector corresponding to the voice signal to be recognized.
7. The method according to claim 1, wherein the performing an accent feature extraction process on the speech signal to be recognized to obtain an accent vector corresponding to the speech signal to be recognized comprises:
extracting the voice signal to be recognized based on factorization to obtain a voiceprint feature vector of the user generating the voice signal to be recognized;
and extracting the accent vector corresponding to the voice signal to be recognized from the vocal print characteristic vector.
8. The method according to claim 3, wherein after obtaining the recognition result corresponding to the speech signal to be recognized, the method further comprises:
if feedback information indicating voice recognition errors is received, the voice signal to be recognized is used as a training sample and is respectively added into a first training set and a second training set, the first training set is used for training the trained accent recognition model, and the second training set is used for training the voice recognition model; wherein the feedback information indicating the voice recognition error is generated according to an error feedback operation performed on the recognition result.
9. The method of claim 1, wherein the voice signal to be recognized is a voice signal for controlling a voice interaction device; after obtaining the recognition result corresponding to the voice signal to be recognized, the method further includes:
and generating a control instruction according to the recognition result, and sending the control instruction to the voice interaction equipment so as to instruct the voice interaction equipment to execute the operation indicated by the control instruction.
10. A speech recognition apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal to be recognized;
the processing unit is used for carrying out accent feature extraction processing on the voice signal to be recognized to obtain an accent vector corresponding to the voice signal to be recognized, and the accent vector corresponding to the voice signal to be recognized is used for reflecting accent features of a user generating the voice signal to be recognized;
the processing unit is further configured to call a voice recognition model to perform voice recognition processing based on the voice signal to be recognized and the accent vector corresponding to the voice signal to be recognized, so as to obtain a recognition result corresponding to the voice signal to be recognized.
CN202110405365.2A 2021-04-15 2021-04-15 Voice recognition method and device Pending CN115223562A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110405365.2A CN115223562A (en) 2021-04-15 2021-04-15 Voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110405365.2A CN115223562A (en) 2021-04-15 2021-04-15 Voice recognition method and device

Publications (1)

Publication Number Publication Date
CN115223562A true CN115223562A (en) 2022-10-21

Family

ID=83605122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110405365.2A Pending CN115223562A (en) 2021-04-15 2021-04-15 Voice recognition method and device

Country Status (1)

Country Link
CN (1) CN115223562A (en)

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
US10991366B2 (en) Method of processing dialogue query priority based on dialog act information dependent on number of empty slots of the query
WO2022134894A1 (en) Speech recognition method and apparatus, computer device, and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111145733B (en) Speech recognition method, speech recognition device, computer equipment and computer readable storage medium
CN113488024B (en) Telephone interrupt recognition method and system based on semantic recognition
CN112397056B (en) Voice evaluation method and computer storage medium
CN111223476B (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN111710337A (en) Voice data processing method and device, computer readable medium and electronic equipment
KR102312993B1 (en) Method and apparatus for implementing interactive message using artificial neural network
CN113254613A (en) Dialogue question-answering method, device, equipment and storage medium
CN114220461A (en) Customer service call guiding method, device, equipment and storage medium
CN115132174A (en) Voice data processing method and device, computer equipment and storage medium
KR20190109651A (en) Voice imitation conversation service providing method and sytem based on artificial intelligence
US11615787B2 (en) Dialogue system and method of controlling the same
KR102409873B1 (en) Method and system for training speech recognition models using augmented consistency regularization
CN109002498B (en) Man-machine conversation method, device, equipment and storage medium
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN115689603A (en) User feedback information collection method and device and user feedback system
CN114913871A (en) Target object classification method, system, electronic device and storage medium
CN111554300B (en) Audio data processing method, device, storage medium and equipment
CN115050351A (en) Method and device for generating timestamp and computer equipment
CN115223562A (en) Voice recognition method and device
CN114203201A (en) Spoken language evaluation method, device, equipment, storage medium and program product
KR102631143B1 (en) Voice synthesizer using artificial intelligence, operating method of voice synthesizer and computer redable recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination