WO2021190389A1 - 语音处理方法、语音编码器、语音解码器及语音识别系统 - Google Patents

语音处理方法、语音编码器、语音解码器及语音识别系统 Download PDF

Info

Publication number
WO2021190389A1
WO2021190389A1 PCT/CN2021/081457 CN2021081457W WO2021190389A1 WO 2021190389 A1 WO2021190389 A1 WO 2021190389A1 CN 2021081457 W CN2021081457 W CN 2021081457W WO 2021190389 A1 WO2021190389 A1 WO 2021190389A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
information
neural network
voice
feature information
Prior art date
Application number
PCT/CN2021/081457
Other languages
English (en)
French (fr)
Inventor
张仕良
高志付
雷鸣
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2021190389A1 publication Critical patent/WO2021190389A1/zh
Priority to US17/951,569 priority Critical patent/US20230009633A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to the technical field of data processing, in particular to a voice processing method, a voice encoder, a voice decoder and a voice recognition system.
  • Speech recognition technology can convert human speech waveforms into text that can be recognized by machines.
  • the speech recognition rate is an important indicator for evaluating speech recognition performance.
  • Google proposed a Transformer model that can perform speech recognition.
  • the Transformer model can use a text-related self-attention mechanism to model the long-term relevance of speech to obtain a speech recognition model.
  • the speech recognition operation is realized through the established speech recognition model.
  • the embodiments of the present invention provide a voice processing method, a voice encoder, a voice decoder, and a voice recognition system, which can not only reduce the complexity of processing voice signals, but also improve the quality and efficiency of voice signal recognition.
  • an embodiment of the present invention provides a voice processing method, including:
  • target feature information used to characterize the semantics in the speech signal is determined.
  • an embodiment of the present invention provides a speech encoder, including:
  • the first acquiring unit is used to acquire the voice signal to be processed
  • the first processing unit is configured to use a first neural network to process the voice signal to obtain first feature information corresponding to the voice signal, and the first feature information is used to identify the semantics in the voice signal ;
  • the first processing unit is further configured to use a second neural network to process the voice signal to obtain second feature information corresponding to the voice signal, and the second feature information is used to identify the voice signal Semantics in, wherein the second feature information is different from the first feature information;
  • the first determining unit is configured to determine target feature information used to characterize the semantics in the voice signal according to the first feature information and the second feature information.
  • an embodiment of the present invention provides an electronic device, including: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are The processor implements the voice processing method in the first aspect when executed.
  • an embodiment of the present invention provides a computer storage medium for storing a computer program that enables a computer to implement the voice processing method in the first aspect when executed by a computer.
  • an embodiment of the present invention provides a voice processing method, including:
  • the multi-head attention mechanism and the historical prediction information are used to process the target feature information to obtain text information corresponding to the voice signal.
  • an embodiment of the present invention provides a speech decoder, including:
  • the second receiving module is configured to receive target characteristic information sent by the encoder, where the target characteristic information corresponds to a voice signal;
  • the second acquisition module is used to acquire historical prediction information
  • the second processing module is used to process the target feature information by using the multi-head attention mechanism and the historical prediction information to obtain text information corresponding to the voice signal.
  • an embodiment of the present invention provides an electronic device, including: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are The processor implements the voice processing method in the fifth aspect when executed.
  • an embodiment of the present invention provides a computer storage medium for storing a computer program that enables a computer to implement the voice processing method in the fifth aspect when executed by a computer.
  • an embodiment of the present invention provides a voice recognition system, including:
  • the voice encoder described in the second aspect is used to perform data dimensionality reduction processing on the acquired voice signal to obtain voice feature information corresponding to the voice signal.
  • an embodiment of the present invention provides a data processing method, including:
  • the first neural network and the second neural network are respectively used to process the voice signal to obtain the first feature information and the second feature information corresponding to the voice signal, wherein the calculation efficiency of the first neural network is high
  • the accuracy of the second feature information output by the second neural network is higher than the accuracy of the first feature information output by the first neural network;
  • target feature information used to characterize the semantics in the speech signal is determined.
  • an embodiment of the present invention provides a speech encoder, including:
  • the third acquisition module is used to acquire the voice signal to be processed
  • the third processing module is used to process the voice signal by using the first neural network and the second neural network to obtain first feature information and second feature information corresponding to the voice signal.
  • the calculation efficiency of a neural network is higher than the calculation efficiency of the second neural network, and the accuracy of the second feature information output by the second neural network is higher than the accuracy of the first feature information output by the first neural network ;
  • the third determining module is configured to determine target feature information used to characterize the semantics in the voice signal according to the first feature information and the second feature information.
  • an embodiment of the present invention provides an electronic device, including: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are When the processor is executed, the voice processing method in the tenth aspect is realized.
  • an embodiment of the present invention provides a computer storage medium for storing a computer program that enables a computer to implement the voice processing method in the tenth aspect when executed by a computer.
  • an embodiment of the present invention provides a voice recognition system, including:
  • the voice encoder described in the eleventh aspect is used to perform data dimensionality reduction processing on the acquired voice signal to obtain voice feature information corresponding to the voice signal.
  • the speech processing method, speech encoder, speech decoder, and speech recognition system use a first neural network to process the acquired speech signal to obtain the first characteristic information, and use the second neural network to The voice signal is processed to obtain the second feature information. Because the first neural network and the second neural network are different, the obtained first feature information and the second feature information are complementary in the efficiency and quality of speech processing. Then, according to the first feature information and the second feature information, the target feature information used to characterize the semantics in the speech signal is determined, which effectively guarantees the quality of acquiring the target feature information, and further improves the quality of the speech signal processing And efficiency, to ensure the practicability of the method.
  • FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an application scenario of a voice processing method provided by an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of processing the voice signal by using a first neural network to obtain first characteristic information corresponding to the voice signal according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of a process of processing the voice feature information based on the self-attention mechanism to obtain the first feature information according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of a process for obtaining fusion conversion information corresponding to the search term feature, the keyword feature, and the value feature according to an embodiment of the present invention
  • FIG. 6 is a schematic flowchart of obtaining first characteristic information corresponding to the voice signal according to the number of attention mechanisms and fusion conversion information according to an embodiment of the present invention
  • FIG. 7 is a schematic flowchart of processing the voice signal by using a second neural network to obtain second characteristic information corresponding to the voice signal according to an embodiment of the present invention
  • FIG. 8 is a schematic flowchart of processing the value feature by using a static memory neural network to obtain the second feature information according to an embodiment of the present invention
  • FIG. 9 is a schematic diagram of a voice processing method provided by an application embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of another voice processing method provided by an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of another voice processing method provided by an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of yet another voice processing method provided by an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of a speech encoder provided by an embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of an electronic device corresponding to the speech encoder provided by the embodiment shown in FIG. 13;
  • FIG. 15 is a schematic structural diagram of a speech decoder provided by an embodiment of the present invention.
  • FIG. 16 is a schematic structural diagram of an electronic device corresponding to the speech decoder provided by the embodiment shown in FIG. 15;
  • Figure 17 is a schematic structural diagram of another speech encoder provided by an embodiment of the present invention.
  • FIG. 18 is a schematic structural diagram of an electronic device corresponding to the speech encoder provided in the embodiment shown in FIG. 17;
  • FIG. 19 is a schematic structural diagram of a speech recognition system provided by an embodiment of the present invention.
  • FIG. 20 is a schematic diagram of the application of a speech recognition system provided by an embodiment of the present invention.
  • the words “if” and “if” as used herein can be interpreted as “when” or “when” or “in response to determination” or “in response to detection”.
  • the phrase “if determined” or “if detected (statement or event)” can be interpreted as “when determined” or “in response to determination” or “when detected (statement or event) )” or “in response to detection (statement or event)”.
  • the neural networks used in the existing end-to-end speech recognition systems include: long and short-term memory unit-based recurrent neural network (LSTM-RNN), transformer model based on self-attention mechanism, and deep feedforward sequential memory neural network (Deep- Feed-forward Sequential Memory Network (referred to as DFSMN) and so on.
  • DFSMN is an improved FSMN network structure proposed on the basis of the previous Feedforward Sequential Memory Networks (referred to as FSMN).
  • the Transformer model when the Transformer model constructs a speech recognition system, the Transformer model can use a text-related self-attention mechanism to model the long-term dependence of speech to obtain a speech recognition model so that it can be
  • the established speech recognition model realizes speech recognition operations; among them, the long-term correlation of speech refers to the correlation between the current speech signal and the speech signal content at the historical moment and the speech signal content at the future moment.
  • the voice recognition efficiency of the Transformer model is higher and the effect is better than that of the LSTM-RNN model.
  • the DFSMN model constructs a speech recognition system
  • the DFSMN model can use some text-independent filters to model the long-term relevance of speech to obtain a speech recognition model, so as to realize the speech recognition operation through the established speech recognition model.
  • DFSMN can achieve better performance than Transformer on some clean voices, and has lower complexity, but for some voices of poor quality, Transformer has advantages in performance.
  • Fig. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention
  • Fig. 2 is a schematic diagram of an application scenario of a voice processing method provided by an embodiment of the present invention
  • this embodiment provides a voice processing method.
  • the execution body of the method may be a voice processing device. It is understandable that the voice processing device may be implemented as software or a combination of software and hardware.
  • the speech processing device may be a speech encoder, which can process the speech signal to obtain characteristic information used to characterize the semantics of the speech signal.
  • the voice processing method may include:
  • Step S101 Acquire a voice signal to be processed.
  • Step S102 Use the first neural network to process the voice signal to obtain first feature information corresponding to the voice signal, where the first feature information is used to identify the semantics in the voice signal.
  • Step S103 Use the second neural network to process the voice signal to obtain second feature information corresponding to the voice signal.
  • the second feature information is used to identify the semantics of the voice signal.
  • the second feature information is the same as the first feature information. different.
  • Step S104 According to the first feature information and the second feature information, determine target feature information used to characterize the semantics in the speech signal.
  • Step S101 Acquire a voice signal to be processed.
  • the voice signal to be processed refers to a signal that requires voice recognition or voice processing.
  • the above voice signal may be voice information directly input by the user.
  • the voice processing device may directly input the voice information by the user. Collect, so as to obtain the voice signal to be processed.
  • the above-mentioned voice signal may be voice information sent by other equipment, for example: the voice information input by the user is collected through the voice collection unit, and the voice processing device is in communication with the voice collection unit. At this time, the voice processing device can be collected through voice The unit obtains the voice signal to be processed.
  • Step S102 Use the first neural network to process the voice signal to obtain first feature information corresponding to the voice signal, where the first feature information is used to identify the semantics in the voice signal.
  • the first neural network may include any one of the following: a self-attention mechanism, a static memory neural network (Static Memory Nework, SMN for short). It is understandable that the first neural network is not limited to the types of networks exemplified above. Those skilled in the art can also set the first neural network to other types of neural networks according to specific application requirements and design requirements, as long as they can make The first neural network processes the speech signal to obtain feature information used to identify the semantics of the speech signal, which will not be repeated here.
  • SMN Static Memory Nework
  • the voice signal includes a first signal used to identify the semantics of the voice and a second signal used to identify the characteristics of the user.
  • the second signal is used to identify the user who inputs the voice signal. Tone information, user's accent information, user language type, user age information, etc.
  • the first neural network can be used to process the voice signal, so that the first feature information corresponding to the voice signal can be obtained. It can be used to identify the semantics included in the speech signal.
  • Step S103 Use the second neural network to process the voice signal to obtain second feature information corresponding to the voice signal.
  • the second feature information is used to identify the semantics in the voice signal.
  • the second feature information is the same as the first feature information. different.
  • the second neural network may include any one of the following: a self-attention mechanism and a static memory neural network. It is understandable that in order to make the second feature information different from the first feature information, the second neural network can be made different from the first neural network.
  • the first neural network includes a self-attention mechanism
  • the second neural network It may include a static memory neural network; when the first neural network includes a static memory neural network, the second neural network may include a self-attention mechanism.
  • the second neural network is not limited to the types of networks exemplified above, and those skilled in the art can also set the second neural network to other types of neural networks according to specific application requirements and design requirements, as long as it can be guaranteed
  • the second neural network is different from the first neural network, and the second neural network can process the speech signal to obtain the characteristic information used to identify the semantics of the speech signal, which will not be repeated here.
  • the voice signal includes the first signal for identifying the semantics of the voice and the second signal for identifying the user characteristics
  • the voice signal can be processed by using the second neural network, so that second feature information corresponding to the voice signal can be obtained, and the second feature information can be used to identify the semantics included in the voice signal. Since the second neural network is different from the first neural network, the second feature information obtained through the second neural network and the first feature information obtained through the first neural network are complementary in the quality and efficiency of speech recognition .
  • Step S104 According to the first feature information and the second feature information, determine target feature information used to characterize the semantics in the speech signal.
  • the first feature information and the second feature information can be analyzed and processed to determine the target feature information used to characterize the semantics of the speech signal.
  • determining the target feature information used to characterize the semantics of the speech signal may include:
  • Step S1041 Determine the sum of the first feature information and the second feature information as the target feature information.
  • the efficiency and quality of the speech signal processing using the first neural network and the second neural network are complementary. After acquiring the first feature information and the second feature information, the sum of the complementary first feature information and the second feature information is determined as the target feature information, because the target feature information at this time is fused with the first feature information And the second feature information, thereby effectively improving the quality and efficiency of recognizing voice signals.
  • the voice processing method provided in this embodiment uses a first neural network to process the acquired voice signal to obtain the first characteristic information, and uses the second neural network to process the acquired voice signal to obtain the second characteristic information, Since the first neural network and the second neural network are different, the obtained first feature information and the second feature information are complementary in the efficiency and quality of speech processing, and then are based on the first feature information and the second feature information. Determining the target feature information used to characterize the semantics of the speech signal effectively guarantees the quality of acquiring the target feature information, further improves the quality and efficiency of processing the speech signal, and ensures the practicability of the method.
  • FIG. 3 is a schematic flowchart of processing a voice signal using a first neural network to obtain first characteristic information corresponding to the voice signal according to an embodiment of the present invention; on the basis of the foregoing embodiment, continue to refer to FIG. 3
  • this embodiment does not limit its specific processing implementation. Those skilled in the art can set it according to specific application requirements and design requirements.
  • using the first neural network to process the voice signal to obtain the first characteristic information corresponding to the voice signal may include:
  • Step S301 Determine the voice feature information corresponding to the voice signal.
  • the voice feature information includes at least one of the following: a search term feature, a keyword feature, and a value feature.
  • Step S302 Process the voice feature information based on the self-attention mechanism to obtain first feature information.
  • the voice signal can be converted and processed, so that the voice feature information corresponding to the voice signal can be obtained.
  • the voice feature information may include at least one of the following: query, keyword Features (key) and value features (value). It is understandable that when different voice feature information is acquired, the process of converting the voice signal is also different.
  • the step of obtaining the voice feature information may include: obtaining first conversion information corresponding to the search term feature, and the first conversion information may be a conversion matrix, using the first The conversion information performs conversion processing on the speech signal, so that the feature of the search term can be obtained.
  • the step of obtaining the voice feature information may include: obtaining first conversion information corresponding to the search term feature and second conversion information corresponding to the keyword feature, respectively.
  • the above-mentioned first conversion information and second conversion information can both be conversion matrices. It should be noted that the first conversion information is different from the second conversion information, and then the first conversion information is used to perform conversion processing on the speech signal, so as to obtain retrieval
  • the word feature uses the second conversion information to perform conversion processing on the voice signal, so that the keyword feature can be obtained.
  • the step of acquiring the voice feature information may include: respectively acquiring first conversion information corresponding to the search term feature and corresponding to the keyword feature.
  • the corresponding second conversion information and the third conversion information corresponding to the value feature, the above-mentioned first conversion information, second conversion information, and third conversion information can all be conversion matrices. It should be noted that the first conversion information, The second conversion information and the third conversion information are different, and then the first conversion information is used to convert the voice signal to obtain the search term feature, and the second conversion information is used to convert the voice signal to obtain the keyword Feature, using the third conversion information to perform conversion processing on the voice signal, so that the value feature can be obtained.
  • the self-attention mechanism can be used to process the voice feature information, so that the first feature information used to identify the semantics in the voice signal can be obtained. It is understandable that the feature information included in the voice feature information The more, the better the quality and efficiency of the obtained first feature information.
  • the voice feature information by determining the voice feature information corresponding to the voice signal, and then processing the voice feature information based on the self-attention mechanism, not only can the first feature information be accurately and effectively obtained, but also because the voice feature information can include At least one of the search term feature, the keyword feature, and the value feature is used, therefore, the method for obtaining the first feature information is effectively increased, thereby improving the flexibility and reliability of the method.
  • Figure 4 is a schematic diagram of the process of processing speech feature information based on the self-attention mechanism to obtain first feature information according to an embodiment of the present invention
  • the specific implementation method for obtaining the first feature information is not limited, and those skilled in the art can set it according to specific application requirements and design requirements.
  • the voice feature information includes: search term feature, keyword feature, and value feature;
  • the voice feature information is processed based on the self-attention mechanism, and obtaining the first feature information may include:
  • Step S401 Obtain fusion conversion information corresponding to the search term feature, keyword feature, and value feature.
  • the fusion conversion information includes conversion information corresponding to the search term feature, conversion information corresponding to the keyword feature, and value feature Corresponding conversion information.
  • obtaining the fusion conversion information corresponding to the search term feature, keyword feature, and value feature may include:
  • Step S4011 Obtain the first conversion information, the second conversion information, and the third conversion information corresponding to the search term feature, the keyword feature, and the value feature, respectively;
  • Step S4012 Perform splicing processing on the first conversion information, the second conversion information, and the third conversion information to obtain the fusion conversion information.
  • the first conversion information, the second conversion information, and the third conversion information can be determined based on the voice signal, and the above-mentioned first conversion information is used to perform conversion processing on the voice signal, so as to obtain the search term feature ,
  • the second conversion information is used to perform conversion processing on the voice signal, thereby obtaining keyword features
  • the third conversion information is used to perform conversion processing on the voice signal, thereby obtaining value characteristics.
  • the voice signal can be processed using a preset voice recognition algorithm or voice recognition model, so that the search term feature, keyword feature, and value feature corresponding to the voice signal can be obtained.
  • the aforementioned speech recognition algorithm or speech recognition model includes first conversion information, second conversion information, and third conversion information corresponding to the search term feature, keyword feature, and value feature, respectively.
  • the fusion conversion information include the above three conversion information.
  • the voice signal is I
  • the search term feature is Q
  • the keyword feature is K
  • the value feature is V
  • the first conversion information is the conversion matrix W Q
  • the second conversion information is the conversion matrix W k
  • the third conversion information It is the conversion matrix W V
  • K W K *I
  • V W V *I.
  • the conversion matrix W Q , the conversion matrix W K and the conversion matrix W V can be spliced together, so as to obtain the fusion conversion information W O
  • the fusion conversion information is also matrix information.
  • Step S402 Use the self-attention mechanism to process the search term feature, keyword feature, and value feature, and determine the number of attention mechanisms corresponding to the speech signal.
  • the number of attention mechanisms can be different, for example: in relatively simple application scenarios, the number of attention mechanisms can be less; in more complex application scenarios, the number of attention mechanisms can be More.
  • the self-attention mechanism can be used to process the above-mentioned features, so that the number of attention mechanisms corresponding to the speech signal can be determined.
  • using the self-attention mechanism to process search term features, keyword features, and value features, and determining the number of attention mechanisms corresponding to the speech signal may include:
  • Step S4021 Use the following formula to obtain the number of attention mechanisms corresponding to the speech signal:
  • head i is the i-th attention mechanism
  • Attention is the self-attention mechanism
  • Q is the search term feature
  • K is the keyword feature
  • V is the value feature.
  • the number of attention mechanisms corresponding to the voice signal can be quickly and effectively determined by the above formula, which facilitates the rapid and accurate analysis and processing of the voice signal based on the number of attention mechanisms.
  • Step S403 According to the number of attention mechanisms and the fusion conversion information, first feature information corresponding to the speech signal is obtained.
  • the number of attention mechanisms and the fusion conversion information described above can be analyzed and processed to determine the first characteristic information corresponding to the speech signal. Specifically, referring to FIG. 6, according to the number of attention mechanisms and the fusion conversion information, the first characteristic information corresponding to the speech signal is obtained, including:
  • Step S4031 Use the connection function to combine all the number of attention mechanisms to obtain combined information corresponding to the attention mechanism, where the connection function is used to connect the strings;
  • Step S4032 Determine the product of the combined information and the fusion conversion information as the first feature information corresponding to the speech signal.
  • the concatenation function concat is used to combine text in multiple regions and/or strings. After obtaining the number of attention mechanisms, you can use the connection function to combine all the number of attention mechanisms to obtain the combined information corresponding to the attention mechanism.
  • the number of attention mechanisms corresponding to the speech signal is determined by obtaining the fusion conversion information corresponding to the search term feature, keyword feature, and value feature, and then according to the number of attention mechanisms and the fusion conversion information, Obtaining the first characteristic information corresponding to the voice signal effectively guarantees the accuracy and reliability of acquiring the first characteristic information, and further improves the quality and efficiency of recognizing the voice signal.
  • FIG. 7 is a schematic diagram of a process of processing a voice signal using a second neural network to obtain second characteristic information corresponding to the voice signal according to an embodiment of the present invention; on the basis of the above embodiment, continue to refer to FIG. 7
  • using the second neural network to process the voice signal to obtain the second characteristic information corresponding to the voice signal includes:
  • Step S701 Determine the value feature corresponding to the voice signal.
  • Step S702 Use the static memory neural network to process the value feature to obtain second feature information.
  • the voice signal After the voice signal is acquired, the voice signal can be converted, so that the value feature (V) corresponding to the voice signal can be obtained. Specifically, first determine the conversion information corresponding to the voice signal (with the aforementioned third conversion Information W V ), using the conversion information to convert the voice signal, so that the value characteristics can be obtained.
  • the static memory neural network can be used to process the value feature to obtain the second feature information.
  • the static memory neural network is used to process the value feature to obtain the second feature information.
  • Feature information can include:
  • Step S7021 Obtain filter parameters corresponding to the static memory neural network.
  • Step S7022 Determine the characterization information corresponding to the value feature.
  • Step S7023 Analyze and process the characterization information by using the static memory neural network and filter parameters to obtain second feature information corresponding to the voice signal.
  • a set of initial filter parameters can be pre-configured to use the initial filter parameters to implement data processing.
  • the corresponding filter parameters can be learnable or trainable, that is, as the static memory neural network continuously learns and optimizes the data, the filter parameters can change of.
  • the value feature can be analyzed and processed to determine the characterization information corresponding to the feature; after the filter parameters and the characterization information are obtained, the static memory neural network and the filter parameters can be used to perform the characterization information. Analyzing and processing to obtain the second feature information corresponding to the voice signal, specifically, using static memory neural network and filter parameters to analyze and process the characterization information to obtain the second feature information corresponding to the voice signal, including:
  • Step S70231 Use the following formula to obtain second feature information corresponding to the voice signal:
  • m t is the second feature information
  • h t is the characterization information of the value feature at time t
  • ⁇ t and b t are respectively learnable filter parameters
  • is the dot product
  • Is the characterization information of the value feature at time ts 2*j are preset stride factors, respectively
  • i and j are accumulated index parameters.
  • the second characteristic information acquisition method is different from the first characteristic information acquisition method, by analyzing the obtained second characteristic information and the first characteristic information, the acquisition of the target characteristic information can be effectively improved. The accuracy of this further improves the quality and efficiency of speech signal recognition.
  • the method in this embodiment may further include:
  • Step S901 Send the target characteristic information to the decoder, so that the decoder analyzes and processes the target characteristic information to obtain text information corresponding to the voice signal.
  • the speech processing device may be a speech encoder, which can encode the acquired speech signal, so as to obtain target feature information used to identify the semantics in the speech signal, in order to realize the analysis and recognition of the semantic signal
  • the target feature information can be sent to the decoder, so that the decoder can analyze and process the target feature information after obtaining the target feature information, so as to obtain the same information as the voice signal.
  • Corresponding text information so that the machine can recognize the text information corresponding to the voice signal.
  • the execution body of the voice processing method may be a voice encoder, which is based on a dynamic and static memory neural network (Dynamic memory neural network). and Static Memory Nework, referred to as DSMN), and the above dynamic and static memory neural network combines dynamic self-attention mechanism and static memory neural network, so that DSMN has more advantages than the existing Transformer model and DFSMN model. Strong speech recognition ability, so that the speech encoder or speech processing system based on the DSMN model can obtain better recognition performance.
  • a voice encoder which is based on a dynamic and static memory neural network (Dynamic memory neural network). and Static Memory Nework, referred to as DSMN)
  • DSMN Static Memory Nework
  • the voice processing method may include the following steps:
  • Step 1 Obtain the voice signal input by the user.
  • the voice information can be processed (for example: framing processing, filtering processing, noise reduction processing, etc.), so as to obtain the voice signal input by the user.
  • the voice signal may be voice acoustics.
  • Feature sequence it can be understood that the voice acoustic feature sequence includes a feature sequence for representing semantic information and a feature sequence for identifying user features.
  • Step 2 Determine the voice feature corresponding to the voice signal.
  • Step 3 Use the self-attention mechanism to analyze and process the search term features, keyword features, and value features to obtain the first feature information used to identify the semantics of the speech signal.
  • Attention is the self-attention mechanism
  • Q is the search term feature
  • K is the keyword feature
  • V is the value feature
  • softmax is the normalized number of rows
  • K T is the transposition information of the keyword feature
  • d k is the preset The dimension parameter.
  • Step 4 Use the self-attention mechanism to process the search term features, keyword features, and value features to determine the number of attention mechanisms corresponding to the speech signal.
  • head i is the i-th attention mechanism
  • Attention is the self-attention mechanism
  • Q is the search term feature
  • K is the keyword feature
  • V is the value feature.
  • Step 5 Determine the first feature information corresponding to the speech signal according to the number of attention mechanisms.
  • the fusion conversion information W O corresponding to the search term feature, keyword feature, and value feature is acquired, and the fusion conversion information W O includes conversion information W Q corresponding to the search term feature, and information corresponding to the keyword feature The conversion information W K and the conversion information W V corresponding to the value characteristics. Then, according to the number of attention mechanisms and the fusion conversion information, the first characteristic information corresponding to the speech signal is obtained according to the following formula:
  • c t is the first feature information
  • concat() is the connection function
  • head 1 is the first attention mechanism
  • head h is the h-th attention mechanism
  • W O is the fusion conversion information.
  • Step 6 Use the static memory neural network and filter parameters to analyze and process the value features to obtain the second feature information corresponding to the speech signal.
  • the characterization information corresponding to the value feature is obtained, and the static memory neural network is used to process the characterization information according to the following formula to obtain the second feature information:
  • m t is the second feature information
  • h t is the characterization information of the value feature at time t
  • ⁇ t and b t are respectively learnable filter parameters
  • is the dot product
  • Is the characterization information of the value feature at time ts 2*j are preset stride factors, respectively
  • i and j are accumulated index parameters.
  • Step 7 Determine the sum of the first feature information and the second feature information as the target feature information, and output the target feature information to the speech decoder.
  • the target feature information is used to identify the semantic information included in the voice signal.
  • the target characteristic information can be output to the speech decoder, so that the speech decoder can perform a speech recognition operation based on the target characteristic information.
  • the speech processing method provided by this application embodiment realizes the processing of the input speech signal through the dynamic and static memory neural network, and obtains the target feature information used to identify the semantics in the speech signal, so that the speech signal can be processed based on the obtained target feature information.
  • Processing such as: speech recognition processing, speech synthesis processing, etc., because the target feature information is obtained through two kinds of neural networks with complementary performance, therefore, the quality of the target feature information acquisition is effectively guaranteed, thereby effectively improving
  • the quality and efficiency of speech signal processing further improve the stability and reliability of the method.
  • Fig. 10 is a schematic flowchart of another voice processing method provided by an embodiment of the present invention
  • Fig. 11 is a schematic diagram of another voice processing method provided by an embodiment of the present invention
  • a voice processing method is provided.
  • the execution body of the method may be a voice processing device. It is understandable that the voice processing device may be implemented as software or a combination of software and hardware.
  • the voice processing device may be a voice decoder, which may be connected to the voice encoder in communication, and is used to receive the voice feature signal sent by the voice encoder, and process the voice feature signal to obtain a voice Text information corresponding to the characteristic signal.
  • the voice processing method may include:
  • Step S1001 Receive target feature information sent by the encoder, where the target feature information corresponds to a voice signal.
  • Step S1002 Obtain historical prediction information.
  • Step S1003 Use the multi-head attention mechanism and historical prediction information to process the target feature information to obtain text information corresponding to the voice signal.
  • the historical prediction information may be a voice recognition result obtained by a voice encoder performing a voice recognition operation at a historical moment. It is understandable that at the initial moment, the historical prediction information may be blank information.
  • the voice decoder can obtain the target feature information sent by the voice encoder, and the target feature information is used to identify semantic information in the voice signal.
  • the voice decoder obtains the target feature information, it can obtain historical prediction information, which can be stored in a preset area, and then use the multi-head attention mechanism and historical prediction information to process the target feature information to obtain a voice signal Corresponding text information.
  • the semantics corresponding to the current speech signal is "you are beautiful”.
  • the historical prediction information can include : The probability that the output information after the voice signal "you” is "men” is P1, the probability that the output information after the voice signal "you” is “good” is P2, and the output information after the voice signal "you” is The probability of "in” is P3, the probability of outputting "yes” after the voice signal "you” is P4, etc.
  • the historical prediction information includes information corresponding to the following semantics: "you", “Hello”, “You are”, “You are”.
  • the semantic text information can be directly determined as the final semantic text information; when the number of at least one piece of semantic text information is multiple, it can be The probability information corresponding to each semantic text information is obtained, and then the semantic text information with the largest probability information is determined as the final text information corresponding to the speech signal, which can effectively improve the accuracy and reliability of speech signal recognition.
  • FIG. 12 is a schematic diagram of yet another voice processing method provided by an embodiment of the present invention. referring to FIG. 12, this embodiment provides yet another voice processing method, and the execution subject of the method may be a voice processing device, which can be understood
  • the voice processing device can be implemented as software or a combination of software and hardware.
  • the speech processing device may be a speech encoder, which can process the speech signal to obtain characteristic information used to characterize the semantics of the speech signal.
  • the voice processing method may include:
  • Step S1201 Obtain a voice signal to be processed.
  • Step S1202 Use the first neural network and the second neural network to process the speech signal to obtain the first characteristic information and the second characteristic information corresponding to the speech signal.
  • the calculation efficiency of the first neural network is higher than that of the second neural network.
  • the calculation efficiency of the neural network is that the accuracy of the second feature information output by the second neural network is higher than the accuracy of the first feature information output by the first neural network.
  • Step S1203 According to the first feature information and the second feature information, determine target feature information used to characterize the semantics in the voice signal.
  • the first neural network may include any one of the following: self-attention mechanism, static memory neural network (Static Memory Nework, SMN for short), and the second neural network may include any one of the following: self-attention mechanism, static memory neural network The internet.
  • the computational efficiency of the first neural network is higher than that of the second neural network, and the accuracy of the second feature information output by the second neural network is higher than the accuracy of the first feature information output by the first neural network.
  • the above-mentioned first neural network and the second neural network have their own advantages, that is, the first neural network has advantages in terms of computational efficiency, and the second neural network has advantages in terms of the accuracy of the output characteristic information.
  • the first neural network is not limited to the types of networks exemplified above. Those skilled in the art can also set the first neural network to other types of neural networks according to specific application requirements and design requirements, as long as they can make The first neural network processes the speech signal to obtain feature information used to identify the semantics of the speech signal, which will not be repeated here.
  • the second neural network is not limited to the types of networks exemplified above. Those skilled in the art can also set the second neural network to other types of neural networks according to specific application requirements and design requirements, as long as the first neural network can be guaranteed.
  • the second neural network is different from the first neural network, and the second neural network can process the speech signal to obtain the characteristic information used to identify the semantics in the speech signal, which will not be repeated here.
  • the first neural network and the second neural network may not be limited to the implementations defined in the above embodiments.
  • the calculation efficiency of the second neural network is higher than the calculation efficiency of the first neural network
  • the output of the first neural network is higher than that of the first neural network.
  • the accuracy of the first feature information is higher than the accuracy of the second feature information output by the second neural network.
  • different neural networks can be selected according to different application scenarios. For example, in application scenarios that need to ensure computing efficiency, the first neural network can be selected to process voice information; In the application scenario of the accuracy of the feature information, the second neural network can be selected to process the voice information.
  • different combinations of the first neural network and the second neural network can be selected according to different application scenarios, so that users can select different neural network combinations according to different application scenarios to determine the use of different neural network combinations.
  • the flexibility and reliability of the method are further improved.
  • the voice signal includes the first signal for identifying the semantics of the voice and the second signal for identifying the user characteristics
  • the voice signal can be processed by using the second neural network, so that second feature information corresponding to the voice signal can be obtained, and the second feature information can be used to identify the semantics included in the voice signal. Since the second neural network is different from the first neural network, the second feature information obtained through the second neural network and the first feature information obtained through the first neural network are complementary in the quality and efficiency of speech recognition .
  • the first feature information and the second feature information can be analyzed and processed to determine the target feature information used to characterize the semantics of the speech signal. Since the first neural network and the second neural network are different, the efficiency and quality of the speech signal processing using the first neural network and the second neural network are complementary. After acquiring the first feature information and the second feature information, the sum of the complementary first feature information and the second feature information is determined as the target feature information, because the target feature information at this time is fused with the first feature information And the second feature information, thereby effectively improving the quality and efficiency of recognizing voice signals.
  • the voice processing method provided in this embodiment uses a first neural network to process the acquired voice signal to obtain the first characteristic information, and uses the second neural network to process the acquired voice signal to obtain the second characteristic information, Since the first neural network and the second neural network are different, the obtained first feature information and the second feature information are complementary in the efficiency and quality of speech processing, and then are based on the first feature information and the second feature information. Determining the target feature information used to characterize the semantics of the speech signal effectively guarantees the quality of obtaining the target feature information, further improves the quality and efficiency of the speech signal processing, and ensures the practicability of the method.
  • using the first neural network to process the voice signal to obtain the first feature information corresponding to the voice signal may include: determining voice feature information corresponding to the voice signal, and the voice feature information includes at least one of the following : Retrieving word features, keyword features, and value features; processing voice feature information based on the self-attention mechanism to obtain first feature information.
  • the voice feature information when the voice feature information includes: retrieval word feature, keyword feature, and value feature; the voice feature information is processed based on the self-attention mechanism, and obtaining first feature information may include: acquiring and retrieving word feature, key The fusion conversion information corresponding to the word feature and the value feature.
  • the fusion conversion information includes the conversion information corresponding to the search term feature, the conversion information corresponding to the keyword feature, and the conversion information corresponding to the value feature; use self-attention
  • the mechanism processes the search term features, keyword features, and value features to determine the number of attention mechanisms corresponding to the voice signal; according to the number of attention mechanisms and the fusion conversion information, the first feature information corresponding to the voice signal is obtained .
  • obtaining the first characteristic information corresponding to the speech signal may include: using a connection function to combine all the number of attention mechanisms to obtain the corresponding attention mechanism Wherein, the connection function is used to connect the character strings; the product of the combined information and the fusion conversion information is determined as the first feature information corresponding to the speech signal.
  • obtaining the fusion conversion information corresponding to the search term feature, keyword feature, and value feature may include: obtaining first conversion information and second conversion information corresponding to the search term feature, keyword feature, and value feature, respectively.
  • Information and the third conversion information; the first conversion information, the second conversion information, and the third conversion information are spliced to obtain the fusion conversion information.
  • using the second neural network to process the voice signal to obtain the second feature information corresponding to the voice signal may include: determining the value feature corresponding to the voice signal; using the static memory neural network to process the value feature , Obtain the second feature information.
  • using the static memory neural network to process the value feature, and obtaining the second feature information may include: obtaining filter parameters corresponding to the static memory neural network; determining the characterization information corresponding to the value feature; using the static memory neural network The network and filter parameters analyze and process the characterization information to obtain the second characteristic information corresponding to the voice signal.
  • determining the target feature information used to characterize the semantics in the speech signal may include:
  • the sum of the first feature information and the second feature information is determined as the target feature information.
  • the method in this embodiment may further include: sending the target feature information to the decoder, so that the decoder can analyze and process the target feature information To obtain the text information corresponding to the voice signal.
  • the execution process, implementation manner, and technical effect of the method in this embodiment are similar to the execution process, implementation manner, and technical effects of the method in the embodiment shown in FIG. 1 to FIG. 11.
  • the parts not described in detail in this embodiment can be Refer to the related description of the embodiment shown in FIG. 1 to FIG. 9, which will not be repeated here.
  • Fig. 13 is a schematic structural diagram of a speech encoder provided by an embodiment of the present invention; referring to Fig. 13, this embodiment provides a speech encoder, which can perform the speech processing shown in Fig. 1 above method.
  • the speech encoder may include: a first acquiring unit 11, a first processing unit 12, and a first determining unit 13. Specifically,
  • the first acquiring unit 11 is configured to acquire a voice signal to be processed
  • the first processing unit 12 is configured to process the voice signal by using the first neural network to obtain first feature information corresponding to the voice signal, and the first feature information is used to identify the semantics in the voice signal;
  • the first processing unit 12 is also used to process the voice signal by using the second neural network to obtain second feature information corresponding to the voice signal.
  • the second feature information is used to identify the semantics in the voice signal, where the second feature The information is different from the first characteristic information;
  • the first determining unit 13 is configured to determine target feature information used to characterize the semantics of the speech signal according to the first feature information and the second feature information.
  • the first neural network includes a self-attention mechanism; the second neural network includes a static memory neural network.
  • the first processing unit 12 when the first processing unit 12 uses the first neural network to process the voice signal to obtain the first characteristic information corresponding to the voice signal, the first processing unit 12 may be used to perform: Corresponding voice feature information, the voice feature information includes at least one of the following: search term feature, keyword feature, and value feature; the voice feature information is processed based on the self-attention mechanism to obtain the first feature information.
  • the voice feature information includes: retrieval word features, keyword features, and value features
  • the first processing unit 12 processes the voice feature information based on the self-attention mechanism to obtain the first feature information
  • the first feature information A processing unit 12 can be used to execute: obtain the fusion conversion information corresponding to the search term feature, keyword feature, and value feature, and the fusion conversion information includes conversion information corresponding to the search term feature and key feature Conversion information and conversion information corresponding to value features; use self-attention mechanism to process search term features, keyword features, and value features to determine the number of attention mechanisms corresponding to the speech signal; according to the number of attention mechanisms The information is merged and converted to obtain the first characteristic information corresponding to the voice signal.
  • the first processing unit 12 when the first processing unit 12 obtains the first characteristic information corresponding to the speech signal according to the number of attention mechanisms and the fusion conversion information, the first processing unit 12 may be used to perform: All numbers of attention mechanisms are combined to obtain the combined information corresponding to the attention mechanism.
  • the connection function is used to connect the strings; the product of the combined information and the fusion conversion information is determined as the first corresponding to the speech signal. Characteristic information.
  • the first processing unit 12 when the first processing unit 12 obtains the fusion conversion information corresponding to the search term feature, keyword feature, and value feature, the first processing unit 12 may be used to perform: respectively obtain and retrieve the search term feature, key The first conversion information, the second conversion information, and the third conversion information corresponding to the word feature and the value feature; the first conversion information, the second conversion information, and the third conversion information are spliced to obtain the fusion conversion information.
  • the first processing unit 12 uses the self-attention mechanism to process search term features, keyword features, and value features, and determines the number of attention mechanisms corresponding to the speech signal
  • the first processing unit 12 Can be used for execution: Use the following formula to obtain the number of attention mechanisms corresponding to the speech signal:
  • head i is the i-th attention mechanism
  • Attention is the self-attention mechanism
  • Q is the search term feature
  • K is the keyword feature
  • V is the value feature.
  • the first processing unit 12 when the first processing unit 12 uses the second neural network to process the voice signal to obtain the second characteristic information corresponding to the voice signal, the first processing unit 12 may be used to perform: Corresponding value feature; use static memory neural network to process the value feature to obtain the second feature information.
  • the first processing unit 12 when the first processing unit 12 uses the static memory neural network to process the value features to obtain the second feature information, the first processing unit 12 may be used to perform: obtain the filter corresponding to the static memory neural network Parameter; Determine the characterization information corresponding to the value feature; Use the static memory neural network and filter parameters to analyze and process the characterization information to obtain the second feature information corresponding to the voice signal.
  • the first processing unit 12 when the first processing unit 12 analyzes and processes the characterization information by using the static memory neural network and filter parameters to obtain the second characteristic information corresponding to the speech signal, the first processing unit 12 may be used to execute: Use the following formula to obtain the second feature information corresponding to the voice signal:
  • m t is the second feature information
  • h t is the characterization information of the value feature at time t
  • ⁇ t and b t are respectively learnable filter parameters
  • is the dot product
  • Is the characterization information of the value feature at time ts 2*j are preset stride factors, respectively
  • i and j are accumulated index parameters.
  • the first determining unit 13 may be used to execute: The sum of the characteristic information and the second characteristic information is determined as the target characteristic information.
  • the first processing unit 12 in this embodiment may also be used to execute: sending the target feature information to the decoder so that the decoder can The target feature information is analyzed and processed to obtain text information corresponding to the voice signal.
  • the device shown in FIG. 13 can execute the method of the embodiment shown in FIG. 1 to FIG. 9.
  • parts that are not described in detail in this embodiment please refer to the related description of the embodiment shown in FIG. 1 to FIG. 9.
  • the structure of the speech encoder shown in FIG. 13 can be implemented as an electronic device, which can be various devices such as a mobile phone, a tablet computer, and a server.
  • the electronic device may include: a first processor 21 and a first memory 22.
  • the first memory 22 is used to store a program for the corresponding electronic device to execute the voice processing method provided in the embodiment shown in FIG. 1 to FIG. 9, and the first processor 21 is configured to execute the program stored in the first memory 22 program of.
  • the program includes one or more computer instructions, and when one or more computer instructions are executed by the first processor 21, the following steps can be implemented:
  • the second neural network uses the second neural network to process the voice signal to obtain second feature information corresponding to the voice signal, the second feature information is used to identify the semantics in the voice signal, where the second feature information is different from the first feature information;
  • target feature information used to characterize the semantics of the speech signal is determined.
  • the first processor 21 is also configured to execute all or part of the steps in the embodiment shown in FIG. 1 to FIG. 9 above.
  • the structure of the electronic device may also include a first communication interface 23 for the electronic device to communicate with other devices or a communication network.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions used by an electronic device, which includes a program for executing the voice processing method in the method embodiment shown in FIGS. 1-9.
  • FIG. 15 is a schematic structural diagram of a speech decoder provided by an embodiment of the present invention. referring to FIG. 15, this embodiment provides a speech encoder, which can perform the speech processing shown in FIG. 10 method.
  • the speech encoder may include: a second receiving module 31, a second acquiring module 32, and a second processing module 33. Specifically,
  • the second receiving module 31 is configured to receive target feature information sent by the encoder, where the target feature information corresponds to a voice signal;
  • the second obtaining module 32 is used to obtain historical prediction information
  • the second processing module 33 is used to process the target feature information by using the multi-head attention mechanism and historical prediction information to obtain text information corresponding to the voice signal.
  • the device shown in FIG. 15 can execute the method of the embodiment shown in FIG. 10-11.
  • the parts not described in detail in this embodiment please refer to the related description of the embodiment shown in FIG. 10-11.
  • the implementation process and technical effects of this technical solution please refer to the description in the embodiment shown in FIG. 10 to FIG. 11, which will not be repeated here.
  • the structure of the speech encoder shown in FIG. 15 can be implemented as an electronic device, which can be various devices such as a mobile phone, a tablet computer, and a server.
  • the electronic device may include: a second processor 41 and a second memory 42.
  • the second memory 42 is used to store programs for corresponding electronic devices to execute the voice processing method provided in the embodiments shown in FIGS. 10-11, and the second processor 41 is configured to execute the programs stored in the second memory 42 program of.
  • the program includes one or more computer instructions, and when one or more computer instructions are executed by the second processor 41, the following steps can be implemented:
  • the multi-head attention mechanism and historical prediction information are used to process the target feature information to obtain text information corresponding to the speech signal.
  • the second processor 41 is also configured to execute all or part of the steps in the embodiment shown in FIG. 10 to FIG. 11.
  • the structure of the electronic device may further include a second communication interface 43 for the electronic device to communicate with other devices or a communication network.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions used by an electronic device, which includes a program for executing the voice processing method in the method embodiment shown in FIGS. 10-11.
  • FIG. 17 is a schematic structural diagram of another speech encoder provided by an embodiment of the present invention. referring to FIG. 17, this embodiment provides another speech encoder, which can perform the above-mentioned speech encoder shown in FIG. 12 Voice processing method.
  • the speech encoder may include: a third acquiring unit 51, a third processing unit 52, and a third determining unit 53, specifically,
  • the third acquisition module 51 is configured to acquire the voice signal to be processed
  • the third processing module 52 is used to process the voice signal by using the first neural network and the second neural network to obtain the first feature information and the second feature information corresponding to the voice signal.
  • the calculation of the first neural network The efficiency is higher than the calculation efficiency of the second neural network, and the accuracy of the second feature information output by the second neural network is higher than the accuracy of the first feature information output by the first neural network;
  • the third determining module 53 is configured to determine the target feature information used to represent the semantics in the voice signal according to the first feature information and the second feature information.
  • the first neural network includes a self-attention mechanism; the second neural network includes a static memory neural network.
  • the third processing module 52 when the third processing module 52 uses the first neural network to process the voice signal to obtain the first feature information corresponding to the voice signal, the third processing module 52 may be used to perform:
  • the voice feature information corresponding to the signal, the voice feature information includes at least one of the following: retrieval word feature, keyword feature, and value feature; the voice feature information is processed based on the self-attention mechanism to obtain the first feature information.
  • the voice feature information includes: retrieval word features, keyword features, and value features
  • the third processing module 52 processes the voice feature information based on the self-attention mechanism to obtain the first feature information
  • the first feature information is obtained.
  • the three-processing module 52 can be used to execute: obtain the fusion conversion information corresponding to the search term feature, keyword feature, and value feature, and the fusion conversion information includes conversion information corresponding to the search term feature, and information corresponding to the keyword feature.
  • Conversion information and conversion information corresponding to value features use self-attention mechanism to process search term features, keyword features, and value features to determine the number of attention mechanisms corresponding to the speech signal; according to the number of attention mechanisms The information is merged and converted to obtain the first characteristic information corresponding to the voice signal.
  • the third processing module 52 when the third processing module 52 obtains the first characteristic information corresponding to the voice signal according to the number of attention mechanisms and the fusion conversion information, the third processing module 52 may be used to perform: All numbers of attention mechanisms are combined to obtain the combined information corresponding to the attention mechanism.
  • the connection function is used to connect the strings; the product of the combined information and the fusion conversion information is determined as the first corresponding to the speech signal. Characteristic information.
  • the third processing module 52 when the third processing module 52 obtains the fusion conversion information corresponding to the search term feature, keyword feature, and value feature, the third processing module 52 can be used to perform: obtain and retrieve the search term feature, key feature, and key feature, respectively.
  • the first conversion information, the second conversion information, and the third conversion information corresponding to the word feature and the value feature; the first conversion information, the second conversion information, and the third conversion information are spliced to obtain the fusion conversion information.
  • the third processing module 52 when the third processing module 52 uses the second neural network to process the voice signal to obtain the second feature information corresponding to the voice signal, the third processing module 52 may be used to perform: Corresponding value feature; use static memory neural network to process the value feature to obtain the second feature information.
  • the third processing module 52 when the third processing module 52 uses the static memory neural network to process the value features to obtain the second feature information, the third processing module 52 may be used to execute: obtain the filter corresponding to the static memory neural network Parameter; Determine the characterization information corresponding to the value feature; Use the static memory neural network and filter parameters to analyze and process the characterization information to obtain the second feature information corresponding to the voice signal.
  • the third determining module 53 may be used to execute: The sum of the characteristic information and the second characteristic information is determined as the target characteristic information.
  • the third processing module 52 in this embodiment may also be used to: send the target feature information to the decoder, so that the decoder can target the target feature information.
  • the characteristic information is analyzed and processed to obtain text information corresponding to the voice signal.
  • the device shown in FIG. 17 can execute the method of the embodiment shown in FIG. 12.
  • parts that are not described in detail in this embodiment please refer to the related description of the embodiment shown in FIG. 12.
  • the implementation process and technical effects of this technical solution refer to the description in the embodiment shown in FIG. 12, which will not be repeated here.
  • the structure of the speech encoder shown in FIG. 17 can be implemented as an electronic device, which can be various devices such as a mobile phone, a tablet computer, and a server.
  • the electronic device may include: a third processor 61 and a third memory 62.
  • the third memory 62 is used to store a program for the corresponding electronic device to execute the voice processing method provided in the embodiment shown in FIG. 12, and the third processor 61 is configured to execute the program stored in the third memory 62.
  • the program includes one or more computer instructions, and when one or more computer instructions are executed by the third processor 61, the following steps can be implemented:
  • the first neural network and the second neural network are respectively used to process the speech signal to obtain the first characteristic information and the second characteristic information corresponding to the speech signal.
  • the calculation efficiency of the first neural network is higher than that of the second neural network.
  • Calculation efficiency, the accuracy of the second feature information output by the second neural network is higher than the accuracy of the first feature information output by the first neural network;
  • target feature information used to characterize the semantics of the speech signal is determined.
  • the third processor 61 is also configured to execute all or part of the steps in the embodiment shown in FIG. 12.
  • the structure of the electronic device may further include a third communication interface 63 for the electronic device to communicate with other devices or a communication network.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions used by an electronic device, which includes a program for executing the voice processing method in the method embodiment shown in FIG. 12.
  • Fig. 19 is a schematic structural diagram of a speech recognition system provided by an embodiment of the present invention
  • Fig. 20 is a schematic diagram of an application of a speech recognition system provided by an embodiment of the present invention.
  • this embodiment provides a A speech recognition system, which can realize the recognition processing of the speech signal input by the user, so as to obtain the text information corresponding to the semantic signal.
  • the speech recognition system may include:
  • the voice encoder 71 shown in FIG. 13 or FIG. 17 above can be used to perform data dimensionality reduction processing on the acquired voice signal to obtain voice feature information corresponding to the voice signal.
  • the voice feature information Used to identify the semantic information in the speech signal.
  • system may also include:
  • the voice decoder 72 is configured to receive the voice feature information sent by the voice encoder 71, and output text information corresponding to the voice signal based on the voice feature information.
  • the speech decoder 72 when the speech decoder 72 outputs text information corresponding to the speech signal based on the speech feature information, the speech decoder 72 can be used to perform: obtain historical prediction information; use a multi-head attention mechanism and historical prediction information to pair The voice feature information is processed to obtain text information corresponding to the voice signal.
  • the steps of the voice recognition system for voice recognition may include the following processes:
  • the voice encoder 71 obtains the voice signal S input by the user, uses a preset feedforward network to filter out redundant signals included in the voice signal, obtains the voice signal S1, and then uses the two-way DSMN network to process the voice signal S1, thereby
  • the feature information S2 used to identify the semantic information in the voice signal S can be obtained.
  • the two-way DSMN network can process the voice signal S1 in combination with data at historical moments and future moments, so as to obtain the feature information S2.
  • the feature information S2 can be processed for data regularization, and then the feature information S2 can be processed by the feedforward network, so that the redundant signal included in the feature signal S2 can be removed to obtain the feature signal S3, Then, the characteristic signal S3 is subjected to data regularization processing again, so that the target characteristic signal S4 corresponding to the speech signal S can be obtained, and the target characteristic signal S4 can be sent to the speech decoder 52.
  • the speech decoder 72 obtains the target feature signal S4 sent by the speech encoder 71, and then obtains the historical prediction information M, and encodes the historical prediction information M to obtain the historical prediction information M1, and then uses the feedforward network to filter out the historical predictions
  • the redundant signal included in the information M1 obtains the historical prediction information M2, and then the one-way DSMN network is used to process the historical prediction information M2, so as to obtain the historical prediction information M3 corresponding to the historical prediction information M2.
  • the one-way DSMN network The historical prediction information M2 can be processed in combination with the data at historical moments, so as to obtain the historical prediction information M3, and then the historical prediction information M3 is subjected to data regularization processing, so that the historical prediction information M4 corresponding to the historical prediction information M can be obtained. And can send historical prediction information M4 to the multi-head attention mechanism network.
  • the target characteristic signal S4 can be analyzed and processed in combination with the historical prediction information M4, so that text information W corresponding to the target characteristic signal S4 can be obtained.
  • the obtained text information W can also be processed by data normalization to obtain the text information W1, and then use the feedforward network to filter out the text information W1 including The normalized function is used to process the processed text information, so that the target text information W2 corresponding to the speech signal S can be obtained.
  • the speech signal to be recognized is acquired through the speech encoder 71, and the target characteristic information corresponding to the speech signal is determined, and then the target characteristic information is sent to the speech decoder 72, where 72 After obtaining the target feature information, the target feature information is voice-recognized through the multi-head attention mechanism, so that the text information corresponding to the voice signal can be obtained.
  • This not only effectively implements the voice recognition operation, but also improves the The quality and efficiency of speech signal processing further improve the stability and reliability of the speech recognition system.
  • the device embodiments described above are merely illustrative.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units.
  • Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
  • each implementation manner can be implemented by adding a necessary general hardware platform, and of course, it can also be implemented by a combination of hardware and software.
  • the above technical solution essentially or the part that contributes to the prior art can be embodied in the form of a computer product, and the present invention can be used in one or more computer usable storage containing computer usable program codes.
  • the form of a computer program product implemented on a medium including but not limited to disk storage, CD-ROM, optical storage, etc.).
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device, and the instruction device implements A function specified in a flow or multiple flows in a flowchart and/or a block or multiple blocks in a block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so that the instructions to be executed on the computer or other programmable equipment provide Steps used to implement the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

一种语音处理方法、语音编码器(71)、语音解码器(72)及语音识别系统。方法包括:获取待处理的语音信号(S1201);分别利用第一神经网络、第二神经网络对所述语音信号进行处理,获得与语音信号相对应的第一特征信息、第二特征信息,其中,第一神经网络的计算效率高于第二神经网络的计算效率,第二神经网络输出的第二特征信息的准确性高于第一神经网络输出的第一特征信息的准确性(S1202);根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息(S1203)。通过两个不同的神经网络获得两个特征信息,由于两个特征信息在语音处理的效率和质量上具有互补性,从而提高了对目标特征信息进行获取的准确可靠性。

Description

语音处理方法、语音编码器、语音解码器及语音识别系统
本申请要求2020年03月25日递交的申请号为202010219957.0、发明名称为“语音处理方法、语音编码器、语音解码器及语音识别系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据处理技术领域,尤其涉及一种语音处理方法、语音编码器、语音解码器及语音识别系统。
背景技术
语音识别技术可以实现将人所说的语音波形转成机器可以识别的文本,对于语音识别技术而言,语音识别率是评估语音识别性能的一个重要指标。在2017年,谷歌提出了一种可以进行语音识别的Transformer模型,具体的,Transformer模型可以采用与文本相关的自注意力机制对语音的长时相关性进行语音建模,获得语音识别模型,而后通过建立的语音识别模型实现语音识别操作。
然而,在Transformer模型采用与文本相关的自注意力机制对语音的长时相关性进行语音建模时,由于与文本相关的参数较多,构建语音识别模型的复杂度较高,并且也增加了对语音识别模型进行优化的困难程度,从而极大地影响了对语音信号进行识别的质量和效率。
发明内容
本发明实施例提供了一种语音处理方法、语音编码器、语音解码器及语音识别系统,不仅能够降低对语音信号进行处理的复杂程度,并且也提高了对语音信号进行识别的质量和效率。
第一方面,本发明实施例提供了一种语音处理方法,包括:
获取待处理的语音信号;
利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息,所述第一特征信息用于标识所述语音信号中的语义;
利用第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第二特征信息,所述第二特征信息用于标识所述语音信号中的语义,其中,所述第二特征信息 与所述第一特征信息不同;
根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
第二方面,本发明实施例提供了一种语音编码器,包括:
第一获取单元,用于获取待处理的语音信号;
第一处理单元,用于利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息,所述第一特征信息用于标识所述语音信号中的语义;
所述第一处理单元,还用于利用第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第二特征信息,所述第二特征信息用于标识所述语音信号中的语义,其中,所述第二特征信息与所述第一特征信息不同;
第一确定单元,用于根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
第三方面,本发明实施例提供了一种电子设备,包括:存储器、处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器执行时实现上述第一方面中的语音处理方法。
第四方面,本发明实施例提供了一种计算机存储介质,用于储存计算机程序,所述计算机程序使计算机执行时实现上述第一方面中的语音处理方法。
第五方面,本发明实施例提供了一种语音处理方法,包括:
接收编码器发送的目标特征信息,所述目标特征信息与一语音信号相对应;
获取历史预测信息;
利用多头注意力机制和所述历史预测信息对所述目标特征信息进行处理,获得与所述语音信号相对应的文本信息。
第六方面,本发明实施例提供了一种语音解码器,包括:
第二接收模块,用于接收编码器发送的目标特征信息,所述目标特征信息与一语音信号相对应;
第二获取模块,用于获取历史预测信息;
第二处理模块,用于利用多头注意力机制和所述历史预测信息对所述目标特征信息进行处理,获得与所述语音信号相对应的文本信息。
第七方面,本发明实施例提供了一种电子设备,包括:存储器、处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处 理器执行时实现上述第五方面中的语音处理方法。
第八方面,本发明实施例提供了一种计算机存储介质,用于储存计算机程序,所述计算机程序使计算机执行时实现上述第五方面中的语音处理方法。
第九方面,本发明实施例提供了一种语音识别系统,包括:
上述第二方面所述的语音编码器,用于对所获取到的语音信号进行数据降维处理,获得与所述语音信号相对应的语音特征信息。
第十方面,本发明实施例提供了一种数据处理方法,包括:
获取待处理的语音信号;
分别利用第一神经网络、第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息、第二特征信息,其中,所述第一神经网络的计算效率高于所述第二神经网络的计算效率,所述第二神经网络输出的第二特征信息的准确性高于所述第一神经网络输出的第一特征信息的准确性;
根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
第十一方面,本发明实施例提供了一种语音编码器,包括:
第三获取模块,用于获取待处理的语音信号;
第三处理模块,用于分别利用第一神经网络、第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息、第二特征信息,其中,所述第一神经网络的计算效率高于所述第二神经网络的计算效率,所述第二神经网络输出的第二特征信息的准确性高于所述第一神经网络输出的第一特征信息的准确性;
第三确定模块,用于根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
第十二方面,本发明实施例提供了一种电子设备,包括:存储器、处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器执行时实现上述第十方面中的语音处理方法。
第十二方面,本发明实施例提供了一种计算机存储介质,用于储存计算机程序,所述计算机程序使计算机执行时实现上述第十方面中的语音处理方法。
第十三方面,本发明实施例提供了一种语音识别系统,包括:
上述第十一方面所述的语音编码器,用于对所获取到的语音信号进行数据降维处理,获得与所述语音信号相对应的语音特征信息。
本实施例提供的语音处理方法、语音编码器、语音解码器及语音识别系统,利用第一神经网络对所获取的语音信号进行处理,获得第一特征信息,并利用第二神经网络对所获取的语音信号进行处理,获得第二特征信息,由于第一神经网络和第二神经网络不同,因此,所获得的第一特征信息和第二特征信息在语音处理的效率和质量上具有互补性,而后根据第一特征信息和第二特征信息来确定用于表征所述语音信号中语义的目标特征信息,有效地保证了对目标特征信息进行获取的质量,进一步提高了对语音信号进行处理的质量和效率,保证了该方法的实用性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种语音处理方法的流程示意图;
图2为本发明实施例提供的一种语音处理方法的应用场景示意图;
图3为本发明实施例提供的利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息的流程示意图;
图4为本发明实施例提供的基于所述自注意力机制对所述语音特征信息进行处理,获得所述第一特征信息的流程示意图;
图5为本发明实施例提供的获取与所述检索词特征、所述关键字特征和值特征相对应的融合转换信息的流程示意图;
图6为本发明实施例提供的根据所述注意力机制的数量和融合转换信息,获得与所述语音信号相对应的第一特征信息的流程示意图;
图7为本发明实施例提供的利用第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第二特征信息的流程示意图;
图8为本发明实施例提供的利用静态记忆神经网络对所述值特征进行处理,获得所述第二特征信息的流程示意图;
图9为本发明应用实施例提供的一种语音处理方法的示意图;
图10为本发明实施例提供的另一种语音处理方法的流程示意图;
图11为本发明实施例提供的另一种语音处理方法的示意图;
图12为本发明实施例提供的又一种语音处理方法的示意图;
图13为本发明实施例提供的一种语音编码器的结构示意图;
图14为与图13所示实施例提供的语音编码器对应的电子设备的结构示意图;
图15为本发明实施例提供的一种语音解码器的结构示意图;
图16为与图15所示实施例提供的语音解码器对应的电子设备的结构示意图;
图17为本发明实施例提供的另一种语音编码器的结构示意图;
图18为与图17所示实施例提供的语音编码器对应的电子设备的结构示意图;
图19为本发明实施例提供的一种语音识别系统的结构示意图;
图20为本发明实施例提供的语音识别系统的应用示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义,“多种”一般包含至少两种,但是不排除包含至少一种的情况。
应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
取决于语境,如在此所使用的词语“如果”、“若”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的商品或者系统不仅包括那些要素,而且还包括没有 明确列出的其他要素,或者是还包括为这种商品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的商品或者系统中还存在另外的相同要素。
另外,下述各方法实施例中的步骤时序仅为一种举例,而非严格限定。
为了便于理解本申请的技术方案,下面对现有技术进行简要说明:
现有的端到端语音识别系统所采用的神经网络包括:基于长短时记忆单元的循环神经网络(LSTM-RNN)、基于自注意力机制的Transformer模型、深度前馈序列记忆神经网络(Deep-Feed-forward Sequential Memory Network简称DFSMN)等等,其中,DFSMN是在之前的前馈序列记忆神经网络(Feedforward Sequential Memory Networks,简称FSMN)的基础上,提出的一种改进的FSMN网络结构。
具体的,在Transformer模型构建语音识别系统时,Transformer模型可以采用与文本相关的自注意力机制对语音的长时相关性(long-term dependence)进行语音建模,获得语音识别模型,以便通过所建立的语音识别模型实现语音识别操作;其中,语音的长时相关性是指当前语音信号与历史时刻的语音信号内容和未来时刻的语音信号内容之间所存在的关联性。具体应用时,Transformer模型的语音识别效率相对于LSTM-RNN模型的语音识别效率更高,效果也更好。
在DFSMN模型构建语音识别系统时,DFSMN模型可以采用一些与文本无关的滤波器对语音的长时相关性进行语音建模,获得语音识别模型,以便通过建立的语音识别模型实现语音识别操作。实际应用的实验表明在一些干净的语音上,DFSMN可以获得比Transformer更优的性能,而且复杂度更低,但是对于一些质量比较差的语音,Transformer性能上具有优势。
然而,在利用Transformer模型对语音的长时相关性进行语音建模时,由于与文本相关的参数较多,从而极大地增加了构建语音识别模型的复杂度以及对语音识别模型进行优化的困难程度。在利用DFSMN模型对语音的长时相关性进行语音建模时,由于与文本相关的参数较少,因此,极大地降低了构建语音识别模型的复杂度和对语音识别模型进行优化的困难程度,同时也降低了进行语音识别的鲁棒性。
下面结合附图,对本发明的一些实施方式作详细说明。在各实施例之间不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
图1为本发明实施例提供的一种语音处理方法的流程示意图;图2为本发明实施例提供的一种语音处理方法的应用场景示意图;参考附图1-图2所示,为了解决上述技术 问题,本实施例提供了一种语音处理方法,该方法的执行主体可以为语音处理装置,可以理解的是,该语音处理装置可以实现为软件、或者软件和硬件的组合。具体应用时,该语音处理装置可以为语音编码器,该语音编码器可以实现对语音信号进行处理,获得用于表征语音信号中语义的特征信息。具体的,该语音处理方法可以包括:
步骤S101:获取待处理的语音信号。
步骤S102:利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息,第一特征信息用于标识语音信号中的语义。
步骤S103:利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息,第二特征信息用于标识语音信号中的语义,其中,第二特征信息与第一特征信息不同。
步骤S104:根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
下面针对上述各个步骤进行详细阐述:
步骤S101:获取待处理的语音信号。
其中,待处理的语音信号是指需要进行语音识别或者语音处理的信号,可以理解的是,上述的语音信号可以是用户直接输入的语音信息,例如:语音处理装置可以直接对用户输入的语音信息进行采集,从而可以获得待处理的语音信号。或者,上述的语音信号可以是其他设备发送的语音信息,例如:通过语音采集单元对用户输入的语音信息进行采集,语音处理装置与语音采集单元通讯连接,此时,语音处理装置可以通过语音采集单元获得待处理的语音信号。
步骤S102:利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息,第一特征信息用于标识语音信号中的语义。
其中,第一神经网络可以包括以下任意之一:自注意力机制、静态记忆神经网络(Static Memory Nework,简称SMN)。可以理解的是,第一神经网络并不限于上述所例举的类型网络,本领域技术人员也可以根据具体的应用需求和设计需求将第一神经网络设置为其他类型的神经网络,只要能够使得第一神经网络对语音信号进行处理,获得用于标识语音信号中语义的特征信息即可,在此不再赘述。
另外,对于所接收到的语音信号而言,语音信号中包括用于标识语音语义的第一信号和用于标识用户特征的第二信号,具体的,第二信号用于标识输入语音信号的用户音色信息、用户的口音信息、用户语言类型、用户年龄信息等等。为了提高对语音信号进 行处理的质量和效率,在获取到语音信号之后,可以利用第一神经网络对语音信号进行处理,从而可以获得与语音信号相对应的第一特征信息,该第一特征信息可以用于标识语音信号中所包括的语义。
步骤S103:利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息,第二特征信息用于标识语音信号中的语义,其中,第二特征信息与第一特征信息不同。
其中,第二神经网络可以包括以下任意之一:自注意力机制、静态记忆神经网络。可以理解的是,为了能够使得第二特征信息与第一特征信息不同,可以使得第二神经网络与第一神经网络不同,例如:在第一神经网络包括自注意力机制时,第二神经网络可以包括静态记忆神经网络;在第一神经网络包括静态记忆神经网络时,第二神经网络可以包括自注意力机制。
可以理解的是,第二神经网络并不限于上述所例举的类型网络,本领域技术人员也可以根据具体的应用需求和设计需求将第二神经网络设置为其他类型的神经网络,只要能够保证第二神经网络和第一神经网络不同,并且,能够使得第二神经网络对语音信号进行处理,获得用于标识语音信号中语义的特征信息即可,在此不再赘述。
相类似的,由于语音信号中包括用于标识语音语义的第一信号和用于标识用户特征的第二信号,因此,为了提高对语音信号进行处理的质量和效率,在获取到语音信号之后,可以利用第二神经网络对语音信号进行处理,从而可以获得与语音信号相对应的第二特征信息,该第二特征信息可以用于标识语音信号中所包括的语义。由于第二神经网络与第一神经网络不同,因此,通过第二神经网络所获得的第二特征信息与通过第一神经网络所获得的第一特征信息在语音识别的质量和效率上具有互补性。
步骤S104:根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
在获取到第一特征信息和第二特征信息之后,可以对第一特征信息和第二特征信息进行分析处理,以确定用于表征语音信号中语义的目标特征信息。具体的,根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息可以包括:
步骤S1041:将第一特征信息与第二特征信息的和值确定为目标特征信息。
由于第一神经网络和第二神经网络不同,因此,在利用第一神经网络和第二神经网络对语音信号进行处理的效率和质量具有互补性。在获取到第一特征信息和第二特征信息之后,将具有互补性的第一特征信息与第二特征信息的和值确定为目标特征信息,由 于此时的目标特征信息融合有第一特征信息和第二特征信息,进而有效地提高了对语音信号进行识别的质量和效率。
本实施例提供的语音处理方法,利用第一神经网络对所获取的语音信号进行处理,获得第一特征信息,并利用第二神经网络对所获取的语音信号进行处理,获得第二特征信息,由于第一神经网络和第二神经网络不同,因此,所获得的第一特征信息和第二特征信息在语音处理的效率和质量上具有互补性,而后根据第一特征信息和第二特征信息来确定用于表征所述语音信号中语义的目标特征信息,有效地保证了对目标特征信息进行获取的质量,进一步提高了对语音信号进行处理的质量和效率,保证了该方法的实用性。
图3为本发明实施例提供的利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息的流程示意图;在上述实施例的基础上,继续参考附图3所示,在利用第一神经网络对语音信号进行处理时,本实施例对于其具体的处理实现方式不做限定,本领域技术人员可以根据具体的应用需求和设计需求进行设置,较为优选的,本实施例中的利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息可以包括:
步骤S301:确定与语音信号相对应的语音特征信息,语音特征信息包括以下至少之一:检索词特征、关键字特征、值特征。
步骤S302:基于自注意力机制对语音特征信息进行处理,获得第一特征信息。
具体的,在获取到语音信号之后,可以对语音信号进行转换处理,从而可以获得语音信号相对应的语音特征信息,该语音特征信息可以包括以下至少之一:检索词特征(query)、关键字特征(key)和值特征(value)。可以理解的是,在获取不同的语音特征信息时,对语音信号进行转换的处理过程也不同。
举例来说,在语音特征信息包括检索词特征时,获取该语音特征信息的步骤可以包括:获取与检索词特征相对应的第一转换信息,该第一转换信息可以为转换矩阵,利用第一转换信息对语音信号进行转换处理,从而可以获得检索词特征。
在语音特征信息包括检索词特征和关键字特征时,获取该语音特征信息的步骤可以包括:分别获取与检索词特征相对应的第一转换信息和与关键字特征相对应的第二转换信息,上述的第一转换信息和第二转换信息均可以为转换矩阵,需要注意的是,第一转换信息与第二转换信息不同,而后利用第一转换信息对语音信号进行转换处理,从而可以获得检索词特征,利用第二转换信息对语音信号进行转换处理,从而可以获得关键字特征。
同理的,在语音特征信息包括检索词特征、关键字特征和值特征时,获取该语音特征信息的步骤可以包括:分别获取与检索词特征相对应的第一转换信息、与关键字特征相对应的第二转换信息以及与值特征相对应的第三转换信息,上述的第一转换信息、第二转换信息和第三转换信息均可以为转换矩阵,需要注意的是,第一转换信息、第二转换信息和第三转换信息各不相同,而后利用第一转换信息对语音信号进行转换处理,从而可以获得检索词特征,利用第二转换信息对语音信号进行转换处理,从而可以获得关键字特征,利用第三转换信息对语音信号进行转换处理,从而可以获得值特征。
在获取到语音特征信息之后,可以利用自注意力机制对语音特征信息进行处理,从而可以获得用于标识语音信号中语义的第一特征信息,可以理解的是,语音特征信息所包括的特征信息越多,所获得的第一特征信息的质量和效率更好。
本实施例中,通过确定与语音信号相对应的语音特征信息,而后基于自注意力机制对语音特征信息进行处理,不仅能够准确、有效地获得第一特征信息,并且,由于语音特征信息可以包括检索词特征、关键字特征、值特征中的至少之一,因此,有效地增加了对第一特征信息进行获取的实现方式,进而提高了该方法使用的灵活可靠性。
图4为本发明实施例提供的基于自注意力机制对语音特征信息进行处理,获得第一特征信息的流程示意图;在上述实施例的基础上,继续参考附图4所示,本实施例对于获取第一特征信息的具体实现方式不做限定,本领域技术人员可以根据具体的应用需求和设计需求进行设置,其中,在语音特征信息包括:检索词特征、关键字特征和值特征时;本实施例中的基于自注意力机制对语音特征信息进行处理,获得第一特征信息可以包括:
步骤S401:获取与检索词特征、关键字特征和值特征相对应的融合转换信息,融合转换信息中包括与检索词特征相对应的转换信息、与关键字特征相对应的转换信息以及与值特征相对应的转换信息。
具体的,参考附图5所示,获取与检索词特征、关键字特征和值特征相对应的融合转换信息可以包括:
步骤S4011:分别获取与检索词特征、关键字特征和值特征相对应的第一转换信息、第二转换信息和第三转换信息;
步骤S4012:对第一转换信息、第二转换信息和第三转换信息进行拼接处理,获得融合转换信息。
其中,在获取到语音信号之后,可以基于语音信号确定第一转换信息、第二转换信 息和第三转换信息,上述的第一转换信息用于对语音信号进行转换处理,从而可以获得检索词特征,第二转换信息用于对语音信号进行转换处理,从而可以获得关键字特征,第三转换信息用于对语音信号进行转换处理,从而可以获得值特征。具体应用时,在获取到语音信号之后,可以利用预设的语音识别算法或者语音识别模型对语音信号进行处理,从而可以获得与语音信号相对应的检索词特征、关键字特征和值特征,而上述的语音识别算法或者语音识别模型中包括有分别与检索词特征、关键字特征和值特征相对应的第一转换信息、第二转换信息和第三转换信息。
在获取到第一转换信息、第二转换信息和第三转换信息之后,可以对第一转换信息、第二转换信息和第三转换信息进行拼接处理,从而可以获得融合转换信息,该融合转换信息中包括有上述的三个转换信息。举例来说,语音信号为I,检索词特征为Q,关键字特征为K,值特征为V,第一转换信息为转换矩阵W Q,第二转换信息为转换矩阵W k,第三转换信息为转换矩阵W V,上述转换矩阵与语音信号之间的关系为:Q=W Q*I,K=W K*I,V=W V*I。而在获取到上述的转换关系之后,可以对转换矩阵W Q、转换矩阵W K和转换矩阵W V进行拼接处理,从而可以获得融合转换信息W O,该融合转换信息也为矩阵信息。
步骤S402:利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应的注意力机制的数量。
其中,在不同的应用场景下,注意力机制的数量可以不同,例如:在比较简单的应用场景下,注意力机制的数量可以较少;在比较复杂的应用场景下,注意力机制的数量可以较多。一般情况下,在获取到检索词特征、关键字特征和值特征之后,可以利用自注意力机制对上述特征进行处理,从而可以确定与语音信号相对应的注意力机制的数量。具体的,利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应的注意力机制的数量可以包括:
步骤S4021:利用以下公式,获得与语音信号相对应的注意力机制的数量:
Figure PCTCN2021081457-appb-000001
其中,head i为第i个注意力机制,Attention为自注意力机制,Q为检索词特征、K为关键字特征、V为值特征,
Figure PCTCN2021081457-appb-000002
为与第i个检索词特征相对应的第一转换信息,
Figure PCTCN2021081457-appb-000003
为与第i个关键字特征相对应的第二转换信息,
Figure PCTCN2021081457-appb-000004
为与第i个值特征相对应的第三转换信 息。
本步骤中,通过上述公式可以快速、有效地确定出与语音信号相对应的注意力机制的数量,从而便于基于注意力机制的数量对语音信号进行快速、准确地分析处理。
步骤S403:根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息。
在获取到注意力机制的数量和融合转换信息之后,可以对上述的注意力机制的数量和融合转换信息进行分析处理,以确定语音信号相对应的第一特征信息。具体的,参考附图6所示,根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息,包括:
步骤S4031:利用连接函数将所有数量的注意力机制进行组合,获得与注意力机制相对应的组合信息,其中,连接函数用于连接字符串;
步骤S4032:将组合信息与融合转换信息的乘积,确定为与语音信号相对应的第一特征信息。
其中,连接函数concat用于将多个区域和/或字符串中的文本组合起来。在获取到注意力机制的数量之后,可以利用连接函数将所有数量的注意力机制进行组合连接,获得与注意力机制相对应的组合信息,具体如下公式:H=concat(head 1,...,head h),其中,H为与注意力机制相对应的组合信息,concat()为连接函数,head 1为第一个注意力机制,head h为第h个注意力机制。
在获取到组合信息之后,可以将组合信息与融合转换信息的乘积确定为第一特征信息,即MutliHead(Q,K,V)=c t=concat(head 1,...,head h)W O,其中,c t为第一特征信息,W O为融合转换信息,从而准确、有效地获取到与语音信号相对应的第一特征信息。
本实施例中,通过获取与检索词特征、关键字特征和值特征相对应的融合转换信息,确定与语音信号相对应的注意力机制的数量,而后根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息,有效地保证了对第一特征信息进行获取的准确可靠性,进一步提高了对语音信号进行识别的质量和效率。
图7为本发明实施例提供的利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息的流程示意图;在上述实施例的基础上,继续参考附图7所示,本实施例中,利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特 征信息,包括:
步骤S701:确定与语音信号相对应的值特征。
步骤S702:利用静态记忆神经网络对值特征进行处理,获得第二特征信息。
在获取到语音信号之后,可以对语音信号进行转换处理,从而可以获得与语音信号相对应的值特征(V),具体的,先确定与语音信号相对应的转换信息(与上述的第三转换信息W V),利用转换信息对语音信号进行转换,从而可以获得值特征。
在获取到值特征之后,可以利用静态记忆神经网络对值特征进行处理,以获得第二特征信息,具体的,参考附图8所示,利用静态记忆神经网络对值特征进行处理,获得第二特征信息可以包括:
步骤S7021:获取与静态记忆神经网络相对应的滤波参数。
步骤S7022:确定与值特征相对应的表征信息。
步骤S7023:利用静态记忆神经网络和滤波参数对表征信息进行分析处理,获得与语音信号相对应的第二特征信息。
其中,对于静态记忆神经网络而言,可以预先配置一组初始的滤波参数,以利用该初始的滤波参数实现数据处理。而为了提高静态记忆神经网络进行数据处理的质量和效率,其所对应的滤波参数可以是可学习或者可训练的,即随着静态记忆神经网络对数据的不断学习优化,滤波参数是可以发生变化的。
另外,在确定值特征之后,可以对值特征进行分析处理,以确定与至特征相对应的表征信息;在获取到滤波参数和表征信息之后,可以利用静态记忆神经网络和滤波参数对表征信息进行分析处理,以获得与语音信号相对应的第二特征信息,具体的,利用静态记忆神经网络和滤波参数对表征信息进行分析处理,获得与语音信号相对应的第二特征信息,包括:
步骤S70231:利用以下公式,获得与语音信号相对应的第二特征信息:
Figure PCTCN2021081457-appb-000005
其中,m t为第二特征信息,h t为值特征在t时刻的表征信息,α t、b t分别为可学习的滤波参数,⊙为点乘积,
Figure PCTCN2021081457-appb-000006
为值特征在t-s 1*i时刻的表征信息,
Figure PCTCN2021081457-appb-000007
为值特征在t-s 2*j时刻的表征信息,s 1*i、s 2*j分别为预设的步幅因子,i和j为累加的索引参数。
本实施例中,通过确定与语音信号相对应的值特征,而后利用静态记忆神经网络对值特征进行处理,获得第二特征信息,不仅有效地保证了对第二特征信息进行获取的准确可靠性,并且,由于第二特征信息的获取方式与第一特征信息的获取方式不同,因此,通过对所获得的第二特征信息和第一特征信息进行分析,可以有效地提高对目标特征信息进行获取的准确性,进一步提高了对语音信号进行识别的质量和效率。
在上述任意一个实施例的基础上,在确定用于表征语音信号中语义的目标特征信息之后,本实施例中的方法还可以包括:
步骤S901:将目标特征信息发送至解码器,以使解码器对目标特征信息进行分析处理,获得与语音信号相对应的文本信息。
其中,语音处理装置可以为语音编码器,该语音编码器可以将所获取的语音信号进行编码处理,从而可以获取用于标识语音信号中语义的目标特征信息,为了能够实现对语义信号进行分析识别,在语音编码器获取到目标特征信息之后,可以将目标特征信息发送至解码器,以使得解码器在获取到目标特征信息之后,可以对目标特征信息进行分析处理,从而可以获得与语音信号相对应的文本信息,进而使得机器可以识别出语音信号所对应的文本信息。
具体应用时,参考附图9所示,本应用实施例提供了一种语音处理方法,该语音处理方法的执行主体可以为语音编码器,该语音编码器是基于一种动静记忆神经网络(Dynamic and Static Memory Nework,简称DSMN)实现的,而上述的动静记忆神经网络结合了动态自注意力机制和静态记忆神经网络,从而使得DSMN相比于现有的Transformer模型和DFSMN模型而言,具有更强的语音识别能力,从而使得基于DSMN模型所构建的语音编码器或者语音处理系统可以获得更优的识别性能。
具体的,该语音处理方法可以包括以下步骤:
步骤1:获取用户输入的语音信号。
具体的,在用户输入的语音信息之后,可以对语音信息进行处理(例如:分帧处理、滤波处理、降噪处理等等),从而可以获得用户输入的语音信号,该语音信号可以是语音声学特征序列,可以理解的是,该语音声学特征序列中包括有用于表示语义信息的特征序列和用于标识用户特征的特征序列。
步骤2:确定与语音信号相对应的语音特征。
具体的,先获取第一转换信息W Q、第二转换信息W K和第三转换信息W V,而后利用第一转换信息对语音信号I进行转换处理,获得与语音信号相对应的检索词特征Q, 即Q=W Q*I;利用第二转换信息对语音信号I进行转换处理,获得与语音信号相对应的关键字特征K,即K=W K*I;利用第三转换信息对语音信号I进行转换处理,获得与语音信号相对应的值特征V,即V=W V*I。
步骤3:利用自注意力机制对检索词特征、关键字特征和值特征进行分析处理,获得用于标识语音信号中语义的第一特征信息。
具体的,先利用归一化函数对检索词特征和关键字特征进行归一化处理,而后利用自注意力机制对归一化处理后的数据和值特征按照以下公式进行处理,从而可以获得第一注意力信息。
Figure PCTCN2021081457-appb-000008
其中,Attention为自注意力机制,Q为检索词特征、K为关键字特征、V为值特征,softmax为归一化行数,K T为关键字特征的转置信息,d k为预设的维度参数。
步骤4:利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应的注意力机制的数量。
利用以下公式,获得与语音信号相对应的注意力机制的数量:
Figure PCTCN2021081457-appb-000009
其中,head i为第i个注意力机制,Attention为自注意力机制,Q为检索词特征、K为关键字特征、V为值特征,
Figure PCTCN2021081457-appb-000010
为与第i个检索词特征相对应的第一转换信息,
Figure PCTCN2021081457-appb-000011
为与第i个关键字特征相对应的第二转换信息,
Figure PCTCN2021081457-appb-000012
为与第i个值特征相对应的第三转换信息。
步骤5:根据注意力机制的数量,确定与语音信号相对应的第一特征信息。
具体的,获取与检索词特征、关键字特征和值特征相对应的融合转换信息W O,融合转换信息W O中包括与检索词特征相对应的转换信息W Q、与关键字特征相对应的转换信息W K以及与值特征相对应的转换信息W V。而后根据注意力机制的数量和融合转换信息,按照以下公式获得与语音信号相对应的第一特征信息:
c t=concat(head 1,...,head h)W o
其中,c t为第一特征信息,concat()为连接函数,head 1为第一个注意力机制,head h为第h个注意力机制,W O为融合转换信息。
步骤6:利用静态记忆神经网络和滤波参数对值特征进行分析处理,获得与语音信号相对应的第二特征信息。
具体的,获取与值特征相对应的表征信息,利用静态记忆神经网络对表征信息按照以下公式进行处理,获得第二特征信息:
Figure PCTCN2021081457-appb-000013
其中,m t为第二特征信息,h t为值特征在t时刻的表征信息,α t、b t分别为可学习的滤波参数,⊙为点乘积,
Figure PCTCN2021081457-appb-000014
为值特征在t-s 1*i时刻的表征信息,
Figure PCTCN2021081457-appb-000015
为值特征在t-s 2*j时刻的表征信息,s 1*i、s 2*j分别为预设的步幅因子,i和j为累加的索引参数。
步骤7:将第一特征信息与第二特征信息的和值确定为目标特征信息,并可以输出目标特征信息至语音解码器。
其中,目标特征信息用于标识语音信号中所包括的语义信息,具体的,目标特征信息为第一特征信息与第二特征信息的和值,即y t=c t+m t,上述的y t为目标特征信息,第一特征信息为c t,第二特征信息为m t。在获取到目标特征信息之后,可以输出该目标特征信息至语音解码器,以使得语音解码器可以基于该目标特征信息进行语音识别操作。
本应用实施例提供的语音处理方法,通过动静记忆神经网络实现对输入的语音信号进行处理,获得用于标识语音信号中语义的目标特征信息,从而可以基于所获得的目标特征信息对语音信号进行处理,例如:语音识别处理、语音合成处理等等,由于目标特征信息是通过具有互补性能的两种神经网络所获得的,因此,有效地保证了目标特征信息获取的质量,从而有效地提高了对语音信号进行处理的质量和效率,进一步提高了该方法使用的稳定可靠性。
图10为本发明实施例提供的另一种语音处理方法的流程示意图;图11为本发明实施例提供的另一种语音处理方法的示意图;参考附图10-图11所示,本实施例提供了一种语音处理方法,该方法的执行主体可以为语音处理装置,可以理解的是,该语音处理装置可以实现为软件、或者软件和硬件的组合。具体应用时,该语音处理装置可以为语音解码器,该语音解码器可以与语音编码器通信连接,用于接收语音编码器所发送的语 音特征信号,并对语音特征信号进行处理,获得与语音特征信号相对应的文本信息。具体的,该语音处理方法可以包括:
步骤S1001:接收编码器发送的目标特征信息,目标特征信息与一语音信号相对应。
步骤S1002:获取历史预测信息。
步骤S1003:利用多头注意力机制和历史预测信息对目标特征信息进行处理,获得与语音信号相对应的文本信息。
其中,历史预测信息可以是语音编码器在历史时刻进行语音识别操作所获得的语音识别结果,可以理解的是,在初始时刻时,历史预测信息可以为空白信息。在语音编码器将目标特征信息发送至语音解码器之后,语音解码器可以获取到语音编码器发送的目标特征信息,该目标特征信息用于标识语音信号中的语义信息。在语音解码器获取到目标特征信息之后,可以获取历史预测信息,该历史预测信息可以存储在预设区域中,而后利用多头注意力机制和历史预测信息对目标特征信息进行处理,获得与语音信号相对应的文本信息。
举例来说,现在的语音信号所对应的语义为“你好漂亮”,在获取到与上述语音信号“你”相对应的目标特征信息之后,可以获取历史预测信息,假设,历史预测信息可以包括:在语音信号“你”之后的输出信息为“们”的概率为P1,在语音信号“你”之后的输出信息为“好”的概率为P2,在语音信号“你”之后的输出信息为“在”的概率为P3,在语音信号“你”之后的输出信息为“是”的概率为P4等等,此时的历史预测信息中包括有与下列语义相对应的信息:“你们”、“你好”、“你是”、“你在”。
在获取到历史预测信息之后,利用多头注意力机制和历史预测信息对目标特征信息进行分析识别,进而可以准确地获取到与语音信号相对应的至少一个语义文本信息以及每个语义文本信息所对应的概率信息。可以理解的是,在至少一个语义文本信息的个数为一个时,则可以直接将该语义文本信息确定为最终的语义文本信息;在至少一个语义文本信息的个数为多个时,则可以获取与每个语义文本信息相对应的概率信息,而后将概率信息最大的语义文本信息确定为与语音信号相对应的最终文本信息,这样可以有效地提高对语音信号进行识别的准确可靠性。
图12为本发明实施例提供的又一种语音处理方法的示意图;参考附图12所示,本实施例提供了又一种语音处理方法,该方法的执行主体可以为语音处理装置,可以理解的是,该语音处理装置可以实现为软件、或者软件和硬件的组合。具体应用时,该语音处理装置可以为语音编码器,该语音编码器可以实现对语音信号进行处理,获得用于表 征语音信号中语义的特征信息。具体的,该语音处理方法可以包括:
步骤S1201:获取待处理的语音信号。
步骤S1202:分别利用第一神经网络、第二神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息、第二特征信息,其中,第一神经网络的计算效率高于第二神经网络的计算效率,第二神经网络输出的第二特征信息的准确性高于第一神经网络输出的第一特征信息的准确性。
步骤S1203:根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
其中,第一神经网络可以包括以下任意之一:自注意力机制、静态记忆神经网络(Static Memory Nework,简称SMN),第二神经网络可以包括以下任意之一:自注意力机制、静态记忆神经网络。需要注意的是,第一神经网络的计算效率高于第二神经网络的计算效率,第二神经网络输出的第二特征信息的准确性高于第一神经网络输出的第一特征信息的准确性,上述的第一神经网络与第二神经网络之间有各自的优点,即第一神经网络在计算效率方便比较具有优势,第二神经网络在输出的特征信息的准确性方便比较具有优势。
可以理解的是,第一神经网络并不限于上述所例举的类型网络,本领域技术人员也可以根据具体的应用需求和设计需求将第一神经网络设置为其他类型的神经网络,只要能够使得第一神经网络对语音信号进行处理,获得用于标识语音信号中语义的特征信息即可,在此不再赘述。
相类似的,第二神经网络并不限于上述所例举的类型网络,本领域技术人员也可以根据具体的应用需求和设计需求将第二神经网络设置为其他类型的神经网络,只要能够保证第二神经网络和第一神经网络不同,并且,能够使得第二神经网络对语音信号进行处理,获得用于标识语音信号中语义的特征信息即可,在此不再赘述。
需要注意的是,第一神经网络和第二神经网络可以并不限于上述实施例限定的实现方式,例如:第二神经网络的计算效率高于第一神经网络的计算效率,第一神经网络输出的第一特征信息的准确性高于第二神经网络输出的第二特征信息的准确性。或者,在具体应用时,可以根据不同的应用场景来选择不同的神经网络来实现,例如:在需要保证计算效率的应用场景时,可以选择第一神经网络来对语音信息进行处理;在需要保证特征信息的准确性的应用场景时,可以选择第二神经网络来对语音信息进行处理。或者,在具体应用时,还可以根据不同的应用场景来选择第一神经网络和第二神经网络的不同 组合,从而实现了用户可以根据不同的应用场景来选择而不同的神经网络组合来确定用于表征语音信号中语义的目标特征信息,进一步提高了该方法使用的灵活可靠性。
相类似的,由于语音信号中包括用于标识语音语义的第一信号和用于标识用户特征的第二信号,因此,为了提高对语音信号进行处理的质量和效率,在获取到语音信号之后,可以利用第二神经网络对语音信号进行处理,从而可以获得与语音信号相对应的第二特征信息,该第二特征信息可以用于标识语音信号中所包括的语义。由于第二神经网络与第一神经网络不同,因此,通过第二神经网络所获得的第二特征信息与通过第一神经网络所获得的第一特征信息在语音识别的质量和效率上具有互补性。
在获取到第一特征信息和第二特征信息之后,可以对第一特征信息和第二特征信息进行分析处理,以确定用于表征语音信号中语义的目标特征信息。由于第一神经网络和第二神经网络不同,因此,在利用第一神经网络和第二神经网络对语音信号进行处理的效率和质量具有互补性。在获取到第一特征信息和第二特征信息之后,将具有互补性的第一特征信息与第二特征信息的和值确定为目标特征信息,由于此时的目标特征信息融合有第一特征信息和第二特征信息,进而有效地提高了对语音信号进行识别的质量和效率。
本实施例提供的语音处理方法,利用第一神经网络对所获取的语音信号进行处理,获得第一特征信息,并利用第二神经网络对所获取的语音信号进行处理,获得第二特征信息,由于第一神经网络和第二神经网络不同,因此,所获得的第一特征信息和第二特征信息在语音处理的效率和质量上具有互补性,而后根据第一特征信息和第二特征信息来确定用于表征语音信号中语义的目标特征信息,有效地保证了对目标特征信息进行获取的质量,进一步提高了对语音信号进行处理的质量和效率,保证了该方法的实用性。
在一些实例中,分别利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息可以包括:确定与语音信号相对应的语音特征信息,语音特征信息包括以下至少之一:检索词特征、关键字特征、值特征;基于自注意力机制对语音特征信息进行处理,获得第一特征信息。
在一些实例中,在语音特征信息包括:检索词特征、关键字特征和值特征时;基于自注意力机制对语音特征信息进行处理,获得第一特征信息可以包括:获取与检索词特征、关键字特征和值特征相对应的融合转换信息,融合转换信息中包括与检索词特征相对应的转换信息、与关键字特征相对应的转换信息以及与值特征相对应的转换信息;利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应 的注意力机制的数量;根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息。
在一些实例中,根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息可以包括:利用连接函数将所有数量的注意力机制进行组合,获得与注意力机制相对应的组合信息,其中,连接函数用于连接字符串;将组合信息与融合转换信息的乘积,确定为与语音信号相对应的第一特征信息。
在一些实例中,获取与检索词特征、关键字特征和值特征相对应的融合转换信息可以包括:分别获取与检索词特征、关键字特征和值特征相对应的第一转换信息、第二转换信息和第三转换信息;对第一转换信息、第二转换信息和第三转换信息进行拼接处理,获得融合转换信息。
在一些实例中,利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息可以包括:确定与语音信号相对应的值特征;利用静态记忆神经网络对值特征进行处理,获得第二特征信息。
在一些实例中,利用静态记忆神经网络对值特征进行处理,获得第二特征信息可以包括:获取与静态记忆神经网络相对应的滤波参数;确定与值特征相对应的表征信息;利用静态记忆神经网络和滤波参数对表征信息进行分析处理,获得与语音信号相对应的第二特征信息。
在一些实例中,根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息可以包括:
将第一特征信息与第二特征信息的和值确定为目标特征信息。
在一些实例中,在确定用于表征语音信号中语义的目标特征信息之后,本实施例中的方法还可以包括:将目标特征信息发送至解码器,以使解码器对目标特征信息进行分析处理,获得与语音信号相对应的文本信息。
本实施例中的方法的执行过程、实现方式和技术效果与上述图1-图11所示实施例的方法的执行过程、实现方式和技术效果相类似,本实施例未详细描述的部分,可参考对图1-图9所示实施例的相关说明,在此不再赘述。
图13为本发明实施例提供的一种语音编码器的结构示意图;参考附图13所示,本实施例提供了一种语音编码器,该语音编码器可以执行上述图1所示的语音处理方法。该语音编码器可以包括:第一获取单元11、第一处理单元12和第一确定单元13,具体的,
第一获取单元11,用于获取待处理的语音信号;
第一处理单元12,用于利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息,第一特征信息用于标识语音信号中的语义;
第一处理单元12,还用于利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息,第二特征信息用于标识语音信号中的语义,其中,第二特征信息与第一特征信息不同;
第一确定单元13,用于根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
在一些实例中,第一神经网络包括自注意力机制;第二神经网络包括静态记忆神经网络。
在一些实例中,在第一处理单元12利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息时,该第一处理单元12可以用于执行:确定与语音信号相对应的语音特征信息,语音特征信息包括以下至少之一:检索词特征、关键字特征、值特征;基于自注意力机制对语音特征信息进行处理,获得第一特征信息。
在一些实例中,在语音特征信息包括:检索词特征、关键字特征和值特征时;在第一处理单元12基于自注意力机制对语音特征信息进行处理,获得第一特征信息时,该第一处理单元12可以用于执行:获取与检索词特征、关键字特征和值特征相对应的融合转换信息,融合转换信息中包括与检索词特征相对应的转换信息、与关键字特征相对应的转换信息以及与值特征相对应的转换信息;利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应的注意力机制的数量;根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息。
在一些实例中,在第一处理单元12根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息时,该第一处理单元12可以用于执行:利用连接函数将所有数量的注意力机制进行组合,获得与注意力机制相对应的组合信息,其中,连接函数用于连接字符串;将组合信息与融合转换信息的乘积,确定为与语音信号相对应的第一特征信息。
在一些实例中,在第一处理单元12获取与检索词特征、关键字特征和值特征相对应的融合转换信息时,该第一处理单元12可以用于执行:分别获取与检索词特征、关键字特征和值特征相对应的第一转换信息、第二转换信息和第三转换信息;对第一转换信息、第二转换信息和第三转换信息进行拼接处理,获得融合转换信息。
在一些实例中,在第一处理单元12利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应的注意力机制的数量时,该第一处理单元12可以用于执行:利用以下公式,获得与语音信号相对应的注意力机制的数量:
Figure PCTCN2021081457-appb-000016
其中,head i为第i个注意力机制,Attention为自注意力机制,Q为检索词特征、K为关键字特征、V为值特征,
Figure PCTCN2021081457-appb-000017
为与第i个检索词特征相对应的第一转换信息,
Figure PCTCN2021081457-appb-000018
为与第i个关键字特征相对应的第二转换信息,
Figure PCTCN2021081457-appb-000019
为与第i个值特征相对应的第三转换信息。
在一些实例中,在第一处理单元12利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息时,该第一处理单元12可以用于执行:确定与语音信号相对应的值特征;利用静态记忆神经网络对值特征进行处理,获得第二特征信息。
在一些实例中,在第一处理单元12利用静态记忆神经网络对值特征进行处理,获得第二特征信息时,该第一处理单元12可以用于执行:获取与静态记忆神经网络相对应的滤波参数;确定与值特征相对应的表征信息;利用静态记忆神经网络和滤波参数对表征信息进行分析处理,获得与语音信号相对应的第二特征信息。
在一些实例中,在第一处理单元12利用静态记忆神经网络和滤波参数对表征信息进行分析处理,获得与语音信号相对应的第二特征信息时,该第一处理单元12可以用于执行:利用以下公式,获得与语音信号相对应的第二特征信息:
Figure PCTCN2021081457-appb-000020
其中,m t为第二特征信息,h t为值特征在t时刻的表征信息,α t、b t分别为可学习的滤波参数,⊙为点乘积,
Figure PCTCN2021081457-appb-000021
为值特征在t-s 1*i时刻的表征信息,
Figure PCTCN2021081457-appb-000022
为值特征在t-s 2*j时刻的表征信息,s 1*i、s 2*j分别为预设的步幅因子,i和j为累加的索引参数。
在一些实例中,在第一确定单元13根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息时,该第一确定单元13可以用于执行:将第一特征信息与第二特征信息的和值确定为目标特征信息。
在一些实例中,在确定用于表征语音信号中语义的目标特征信息之后,本实施例中 的第一处理单元12还可以用于执行:将目标特征信息发送至解码器,以使解码器对目标特征信息进行分析处理,获得与语音信号相对应的文本信息。
图13所示装置可以执行图1-图9所示实施例的方法,本实施例未详细描述的部分,可参考对图1-图9所示实施例的相关说明。该技术方案的执行过程和技术效果参见图1-图9所示实施例中的描述,在此不再赘述。
在一个可能的设计中,图13所示语音编码器的结构可实现为一电子设备,该电子设备可以是手机、平板电脑、服务器等各种设备。如图14所示,该电子设备可以包括:第一处理器21和第一存储器22。其中,第一存储器22用于存储相对应电子设备执行上述图1-图9所示实施例中提供的语音处理方法的程序,第一处理器21被配置为用于执行第一存储器22中存储的程序。
程序包括一条或多条计算机指令,其中,一条或多条计算机指令被第一处理器21执行时能够实现如下步骤:
获取待处理的语音信号;
利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息,第一特征信息用于标识语音信号中的语义;
利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息,第二特征信息用于标识语音信号中的语义,其中,第二特征信息与第一特征信息不同;
根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
进一步的,第一处理器21还用于执行前述图1-图9所示实施例中的全部或部分步骤。
其中,电子设备的结构中还可以包括第一通信接口23,用于电子设备与其他设备或通信网络通信。
另外,本发明实施例提供了一种计算机存储介质,用于储存电子设备所用的计算机软件指令,其包含用于执行上述图1-图9所示方法实施例中语音处理方法所涉及的程序。
图15为本发明实施例提供的一种语音解码器的结构示意图;参考附图15所示,本实施例提供了一种语音编码器,该语音编码器可以执行上述图10所示的语音处理方法。该语音编码器可以包括:第二接收模块31、第二获取模块32和第二处理模块33,具体的,
第二接收模块31,用于接收编码器发送的目标特征信息,目标特征信息与一语音信号相对应;
第二获取模块32,用于获取历史预测信息;
第二处理模块33,用于利用多头注意力机制和历史预测信息对目标特征信息进行处理,获得与语音信号相对应的文本信息。
图15所示装置可以执行图10-图11所示实施例的方法,本实施例未详细描述的部分,可参考对图10-图11所示实施例的相关说明。该技术方案的执行过程和技术效果参见图10-图11所示实施例中的描述,在此不再赘述。
在一个可能的设计中,图15所示语音编码器的结构可实现为一电子设备,该电子设备可以是手机、平板电脑、服务器等各种设备。如图16所示,该电子设备可以包括:第二处理器41和第二存储器42。其中,第二存储器42用于存储相对应电子设备执行上述图10-图11所示实施例中提供的语音处理方法的程序,第二处理器41被配置为用于执行第二存储器42中存储的程序。
程序包括一条或多条计算机指令,其中,一条或多条计算机指令被第二处理器41执行时能够实现如下步骤:
接收编码器发送的目标特征信息,目标特征信息与一语音信号相对应;
获取历史预测信息;
利用多头注意力机制和历史预测信息对目标特征信息进行处理,获得与语音信号相对应的文本信息。
进一步的,第二处理器41还用于执行前述图10-图11所示实施例中的全部或部分步骤。
其中,电子设备的结构中还可以包括第二通信接口43,用于电子设备与其他设备或通信网络通信。
另外,本发明实施例提供了一种计算机存储介质,用于储存电子设备所用的计算机软件指令,其包含用于执行上述图10-图11所示方法实施例中语音处理方法所涉及的程序。
图17为本发明实施例提供的另一种语音编码器的结构示意图;参考附图17所示,本实施例提供了另一种语音编码器,该语音编码器可以执行上述图12所示的语音处理方法。该语音编码器可以包括:第三获取单元51、第三处理单元52和第三确定单元53,具体的,
第三获取模块51,用于获取待处理的语音信号;
第三处理模块52,用于分别利用第一神经网络、第二神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息、第二特征信息,其中,第一神经网络的计算效 率高于第二神经网络的计算效率,第二神经网络输出的第二特征信息的准确性高于第一神经网络输出的第一特征信息的准确性;
第三确定模块53,用于根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
在一些实例中,第一神经网络包括自注意力机制;第二神经网络包括静态记忆神经网络。
在一些实例中,在第三处理模块52分别利用第一神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息时,该第三处理模块52可以用于执行:确定与语音信号相对应的语音特征信息,语音特征信息包括以下至少之一:检索词特征、关键字特征、值特征;基于自注意力机制对语音特征信息进行处理,获得第一特征信息。
在一些实例中,在语音特征信息包括:检索词特征、关键字特征和值特征时;在第三处理模块52基于自注意力机制对语音特征信息进行处理,获得第一特征信息时,该第三处理模块52可以用于执行:获取与检索词特征、关键字特征和值特征相对应的融合转换信息,融合转换信息中包括与检索词特征相对应的转换信息、与关键字特征相对应的转换信息以及与值特征相对应的转换信息;利用自注意力机制对检索词特征、关键字特征和值特征进行处理,确定与语音信号相对应的注意力机制的数量;根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息。
在一些实例中,在第三处理模块52根据注意力机制的数量和融合转换信息,获得与语音信号相对应的第一特征信息时,该第三处理模块52可以用于执行:利用连接函数将所有数量的注意力机制进行组合,获得与注意力机制相对应的组合信息,其中,连接函数用于连接字符串;将组合信息与融合转换信息的乘积,确定为与语音信号相对应的第一特征信息。
在一些实例中,在第三处理模块52获取与检索词特征、关键字特征和值特征相对应的融合转换信息时,该第三处理模块52可以用于执行:分别获取与检索词特征、关键字特征和值特征相对应的第一转换信息、第二转换信息和第三转换信息;对第一转换信息、第二转换信息和第三转换信息进行拼接处理,获得融合转换信息。
在一些实例中,在第三处理模块52利用第二神经网络对语音信号进行处理,获得与语音信号相对应的第二特征信息时,该第三处理模块52可以用于执行:确定与语音信号相对应的值特征;利用静态记忆神经网络对值特征进行处理,获得第二特征信息。
在一些实例中,在第三处理模块52利用静态记忆神经网络对值特征进行处理,获得 第二特征信息时,该第三处理模块52可以用于执行:获取与静态记忆神经网络相对应的滤波参数;确定与值特征相对应的表征信息;利用静态记忆神经网络和滤波参数对表征信息进行分析处理,获得与语音信号相对应的第二特征信息。
在一些实例中,在第三确定模块53根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息时,该第三确定模块53可以用于执行:将第一特征信息与第二特征信息的和值确定为目标特征信息。
在一些实例中,在确定用于表征语音信号中语义的目标特征信息之后,本实施例中的第三处理模块52还可以用于:将目标特征信息发送至解码器,以使解码器对目标特征信息进行分析处理,获得与语音信号相对应的文本信息。
图17所示装置可以执行图12所示实施例的方法,本实施例未详细描述的部分,可参考对图12所示实施例的相关说明。该技术方案的执行过程和技术效果参见图12所示实施例中的描述,在此不再赘述。
在一个可能的设计中,图17所示语音编码器的结构可实现为一电子设备,该电子设备可以是手机、平板电脑、服务器等各种设备。如图18所示,该电子设备可以包括:第三处理器61和第三存储器62。其中,第三存储器62用于存储相对应电子设备执行上述图12所示实施例中提供的语音处理方法的程序,第三处理器61被配置为用于执行第三存储器62中存储的程序。
程序包括一条或多条计算机指令,其中,一条或多条计算机指令被第三处理器61执行时能够实现如下步骤:
获取待处理的语音信号;
分别利用第一神经网络、第二神经网络对语音信号进行处理,获得与语音信号相对应的第一特征信息、第二特征信息,其中,第一神经网络的计算效率高于第二神经网络的计算效率,第二神经网络输出的第二特征信息的准确性高于第一神经网络输出的第一特征信息的准确性;
根据第一特征信息和第二特征信息,确定用于表征语音信号中语义的目标特征信息。
进一步的,第三处理器61还用于执行前述图12所示实施例中的全部或部分步骤。
其中,电子设备的结构中还可以包括第三通信接口63,用于电子设备与其他设备或通信网络通信。
另外,本发明实施例提供了一种计算机存储介质,用于储存电子设备所用的计算机软件指令,其包含用于执行上述图12所示方法实施例中语音处理方法所涉及的程序。
图19为本发明实施例提供的一种语音识别系统的结构示意图;图20为本发明实施例提供的语音识别系统的应用示意图,参考附图19-图20所示,本实施例提供了一种语音识别系统,该语音识别系统可以实现对用户输入的语音信号进行识别处理,从而可以获得与语义信号相对应的文本信息,具体的,该语音识别系统可以包括:
上述图13或者图17所示的语音编码器71,该语音编码器71可以用于对所获取到的语音信号进行数据降维处理,获得与语音信号相对应的语音特征信息,该语音特征信息用于标识语音信号中的语义信息。
在一些实例中,该系统还可以包括:
语音解码器72,用于接收语音编码器71发送的语音特征信息,并基于语音特征信息输出与语音信号相对应的文本信息。
在一些实例中,在语音解码器72基于语音特征信息输出与语音信号相对应的文本信息时,该语音解码器72可以用于执行:获取历史预测信息;利用多头注意力机制和历史预测信息对语音特征信息进行处理,获得与语音信号相对应的文本信息。
具体的,参考附图19-图20所示,该语音识别系统进行语音识别的步骤可以包括以下过程:
语音编码器71,获取用户输入的语音信号S,利用预设的前馈网络过滤掉语音信号中包括的冗余信号,获取到语音信号S1,而后利用双向DSMN网络对语音信号S1进行处理,从而可以获得用于标识语音信号S中语义信息的特征信息S2,其中,双向DSMN网络可以结合历史时刻和未来时刻的数据对语音信号S1进行处理,从而可以获得特征信息S2。
在获取到特征信息S2之后,可以对特征信息S2进行数据规整化处理,而后可以利用前馈网络对特征信息S2进行处理,从而可以去除特征信号S2中包括的冗余信号,获得特征信号S3,而后再次对特征信号S3进行数据规整化处理,从而可以获得与语音信号S相对应的目标特征信号S4,并可以将目标特征信号S4发送至语音解码器52。
语音解码器72,获取语音编码器71发送的目标特征信号S4,而后可以获取历史预测信息M,并对历史预测信息M进行编码处理,获得历史预测信息M1,而后利用前馈网络过滤掉历史预测信息M1中包括的冗余信号,获得历史预测信息M2,而后利用单向DSMN网络对历史预测信息M2进行处理,从而可以获得历史预测信息M2相对应的历史预测信息M3,其中,单向DSMN网络可以结合历史时刻的数据对历史预测信息M2进行处理,从而可以获得历史预测信息M3,而后对历史预测信息M3进行数据规整化处 理,从而可以获得与历史预测信息M相对应的历史预测信息M4,并可以将历史预测信息M4发送至多头注意力机制网络。
在多头注意力机制网络获取到历史预测信息M4和目标特征信号S4之后,可以结合历史预测信息M4对目标特征信号S4进行分析处理,从而可以获得与目标特征信号S4相对应的文本信息W。
在获取到文本信息W之后,为了能够提高语音识别的质量和效率,还可以对所获得的文本信息W进行数据规整化处理,获得文本信息W1,而后利用前馈网络过滤掉文本信息W1中包括的冗余信号,并利用归一化函数对处理后的文本信息进行处理,从而可以获得与语音信号S相对应的目标文本信息W2。
本实施例提供的语音识别系统,通过语音编码器71获取到待识别的语音信号,并确定与语音信号相对应的目标特征信息,而后将目标特征信息发送至语音解码器72,在语音解码器72获取到目标特征信息之后,通过多头注意力机制对目标特征信息进行语音识别操作,从而可以获取到与语音信号相对应的文本信息,这样不仅有效地实现了语音识别操作,并且也提高了对语音信号进行处理的质量和效率,进一步提高了该语音识别系统使用的稳定可靠性。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助加必需的通用硬件平台的方式来实现,当然也可以通过硬件和软件结合的方式来实现。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以计算机产品的形式体现出来,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程设备的处理器 以产生一个机器,使得通过计算机或其他可编程设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (31)

  1. 一种数据处理方法,其特征在于,包括:
    获取待处理的语音信号;
    分别利用第一神经网络、第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息、第二特征信息,其中,所述第一神经网络的计算效率高于所述第二神经网络的计算效率,所述第二神经网络输出的第二特征信息的准确性高于所述第一神经网络输出的第一特征信息的准确性;
    根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一神经网络包括自注意力机制;所述第二神经网络包括静态记忆神经网络。
  3. 根据权利要求2所述的方法,其特征在于,分别利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息,包括:
    确定与所述语音信号相对应的语音特征信息,所述语音特征信息包括以下至少之一:检索词特征、关键字特征、值特征;
    基于所述自注意力机制对所述语音特征信息进行处理,获得所述第一特征信息。
  4. 根据权利要求3所述的方法,其特征在于,在所述语音特征信息包括:检索词特征、关键字特征和值特征时;基于所述自注意力机制对所述语音特征信息进行处理,获得所述第一特征信息,包括:
    获取与所述检索词特征、所述关键字特征和值特征相对应的融合转换信息,所述融合转换信息中包括与所述检索词特征相对应的转换信息、与所述关键字特征相对应的转换信息以及与所述值特征相对应的转换信息;
    利用所述自注意力机制对所述检索词特征、关键字特征和值特征进行处理,确定与所述语音信号相对应的注意力机制的数量;
    根据所述注意力机制的数量和融合转换信息,获得与所述语音信号相对应的第一特征信息。
  5. 根据权利要求4所述的方法,其特征在于,根据所述注意力机制的数量和融合转换信息,获得与所述语音信号相对应的第一特征信息,包括:
    利用连接函数将所有数量的注意力机制进行组合,获得与所述注意力机制相对应的组合信息,其中,所述连接函数用于连接字符串;
    将所述组合信息与所述融合转换信息的乘积,确定为与所述语音信号相对应的第一特征信息。
  6. 根据权利要求4所述的方法,其特征在于,获取与所述检索词特征、所述关键字特征和值特征相对应的融合转换信息,包括:
    分别获取与所述检索词特征、所述关键字特征和值特征相对应的第一转换信息、第二转换信息和第三转换信息;
    对所述第一转换信息、第二转换信息和第三转换信息进行拼接处理,获得所述融合转换信息。
  7. 根据权利要求2所述的方法,其特征在于,利用第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第二特征信息,包括:
    确定与所述语音信号相对应的值特征;
    利用静态记忆神经网络对所述值特征进行处理,获得所述第二特征信息。
  8. 根据权利要求7所述的方法,其特征在于,利用静态记忆神经网络对所述值特征进行处理,获得所述第二特征信息,包括:
    获取与所述静态记忆神经网络相对应的滤波参数;
    确定与所述值特征相对应的表征信息;
    利用所述静态记忆神经网络和滤波参数对所述表征信息进行分析处理,获得与所述语音信号相对应的第二特征信息。
  9. 根据权利要求1-8中任意一项所述的方法,其特征在于,根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息,包括:
    将所述第一特征信息与所述第二特征信息的和值确定为所述目标特征信息。
  10. 根据权利要求1-8中任意一项所述的方法,其特征在于,在确定用于表征所述语音信号中语义的目标特征信息之后,所述方法还包括:
    将所述目标特征信息发送至解码器,以使所述解码器对所述目标特征信息进行分析处理,获得与所述语音信号相对应的文本信息。
  11. 一种语音处理方法,其特征在于,包括:
    获取待处理的语音信号;
    利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息,所述第一特征信息用于标识所述语音信号中的语义;
    利用第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第二特 征信息,所述第二特征信息用于标识所述语音信号中的语义,其中,所述第二特征信息与所述第一特征信息不同;
    根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
  12. 根据权利要求11所述的方法,其特征在于,
    所述第一神经网络包括自注意力机制;所述第二神经网络包括静态记忆神经网络。
  13. 根据权利要求12所述的方法,其特征在于,利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息,包括:
    确定与所述语音信号相对应的语音特征信息,所述语音特征信息包括以下至少之一:检索词特征、关键字特征、值特征;
    基于所述自注意力机制对所述语音特征信息进行处理,获得所述第一特征信息。
  14. 根据权利要求13所述的方法,其特征在于,在所述语音特征信息包括:检索词特征、关键字特征和值特征时;基于所述自注意力机制对所述语音特征信息进行处理,获得所述第一特征信息,包括:
    获取与所述检索词特征、所述关键字特征和值特征相对应的融合转换信息,所述融合转换信息中包括与所述检索词特征相对应的转换信息、与所述关键字特征相对应的转换信息以及与所述值特征相对应的转换信息;
    利用所述自注意力机制对所述检索词特征、关键字特征和值特征进行处理,确定与所述语音信号相对应的注意力机制的数量;
    根据所述注意力机制的数量和融合转换信息,获得与所述语音信号相对应的第一特征信息。
  15. 根据权利要求14所述的方法,其特征在于,根据所述注意力机制的数量和融合转换信息,获得与所述语音信号相对应的第一特征信息,包括:
    利用连接函数将所有数量的注意力机制进行组合,获得与所述注意力机制相对应的组合信息,其中,所述连接函数用于连接字符串;
    将所述组合信息与所述融合转换信息的乘积,确定为与所述语音信号相对应的第一特征信息。
  16. 根据权利要求14所述的方法,其特征在于,获取与所述检索词特征、所述关键字特征和值特征相对应的融合转换信息,包括:
    分别获取与所述检索词特征、所述关键字特征和值特征相对应的第一转换信息、第 二转换信息和第三转换信息;
    对所述第一转换信息、第二转换信息和第三转换信息进行拼接处理,获得所述融合转换信息。
  17. 根据权利要求12所述的方法,其特征在于,利用第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第二特征信息,包括:
    确定与所述语音信号相对应的值特征;
    利用静态记忆神经网络对所述值特征进行处理,获得所述第二特征信息。
  18. 根据权利要求17所述的方法,其特征在于,利用静态记忆神经网络对所述值特征进行处理,获得所述第二特征信息,包括:
    获取与所述静态记忆神经网络相对应的滤波参数;
    确定与所述值特征相对应的表征信息;
    利用所述静态记忆神经网络和滤波参数对所述表征信息进行分析处理,获得与所述语音信号相对应的第二特征信息。
  19. 根据权利要求11-18中任意一项所述的方法,其特征在于,根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息,包括:
    将所述第一特征信息与所述第二特征信息的和值确定为所述目标特征信息。
  20. 根据权利要求11-18中任意一项所述的方法,其特征在于,在确定用于表征所述语音信号中语义的目标特征信息之后,所述方法还包括:
    将所述目标特征信息发送至解码器,以使所述解码器对所述目标特征信息进行分析处理,获得与所述语音信号相对应的文本信息。
  21. 一种语音处理方法,其特征在于,包括:
    接收编码器发送的目标特征信息,所述目标特征信息与一语音信号相对应;
    获取历史预测信息;
    利用多头注意力机制和所述历史预测信息对所述目标特征信息进行处理,获得与所述语音信号相对应的文本信息。
  22. 一种语音编码器,其特征在于,包括:
    第一获取单元,用于获取待处理的语音信号;
    第一处理单元,用于利用第一神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息,所述第一特征信息用于标识所述语音信号中的语义;
    所述第一处理单元,还用于利用第二神经网络对所述语音信号进行处理,获得与所 述语音信号相对应的第二特征信息,所述第二特征信息用于标识所述语音信号中的语义,其中,所述第二特征信息与所述第一特征信息不同;
    第一确定单元,用于根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
  23. 一种电子设备,其特征在于,包括:存储器、处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器执行时实现如权利要求11-20中任意一项所述的语音处理方法。
  24. 一种语音解码器,其特征在于,包括:
    第二接收模块,用于接收编码器发送的目标特征信息,所述目标特征信息与一语音信号相对应;
    第二获取模块,用于获取历史预测信息;
    第二处理模块,用于利用多头注意力机制和所述历史预测信息对所述目标特征信息进行处理,获得与所述语音信号相对应的文本信息。
  25. 一种电子设备,其特征在于,包括:存储器、处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器执行时实现如权利要求21所述的语音处理方法。
  26. 一种语音识别系统,其特征在于,包括:
    权利要求22中所述的语音编码器,用于对所获取到的语音信号进行数据降维处理,获得与所述语音信号相对应的语音特征信息。
  27. 根据权利要求26所述的系统,其特征在于,所述系统还包括:
    语音解码器,用于接收所述语音编码器发送的语音特征信息,并基于所述语音特征信息输出与所述语音信号相对应的文本信息。
  28. 根据权利要求27所述的系统,其特征在于,所述语音解码器,还用于:
    获取历史预测信息;
    利用多头注意力机制和所述历史预测信息对所述语音特征信息进行处理,获得与所述语音信号相对应的文本信息。
  29. 一种语音编码器,其特征在于,包括:
    第三获取模块,用于获取待处理的语音信号;
    第三处理模块,用于分别利用第一神经网络、第二神经网络对所述语音信号进行处理,获得与所述语音信号相对应的第一特征信息、第二特征信息,其中,所述第一神经 网络的计算效率高于所述第二神经网络的计算效率,所述第二神经网络输出的第二特征信息的准确性高于所述第一神经网络输出的第一特征信息的准确性;
    第三确定模块,用于根据所述第一特征信息和所述第二特征信息,确定用于表征所述语音信号中语义的目标特征信息。
  30. 一种电子设备,其特征在于,包括:存储器、处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器执行时实现如权利要求1-10中任意一项所述的数据处理方法。
  31. 一种语音识别系统,其特征在于,包括:
    权利要求29中所述的语音编码器,用于对所获取到的语音信号进行数据降维处理,获得与所述语音信号相对应的语音特征信息。
PCT/CN2021/081457 2020-03-25 2021-03-18 语音处理方法、语音编码器、语音解码器及语音识别系统 WO2021190389A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/951,569 US20230009633A1 (en) 2020-03-25 2022-09-23 Speech Processing method, Speech Encoder, Speech Decoder and Speech Recognition System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010219957.0 2020-03-25
CN202010219957.0A CN113450781B (zh) 2020-03-25 2020-03-25 语音处理方法、语音编码器、语音解码器及语音识别系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/951,569 Continuation US20230009633A1 (en) 2020-03-25 2022-09-23 Speech Processing method, Speech Encoder, Speech Decoder and Speech Recognition System

Publications (1)

Publication Number Publication Date
WO2021190389A1 true WO2021190389A1 (zh) 2021-09-30

Family

ID=77806835

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/081457 WO2021190389A1 (zh) 2020-03-25 2021-03-18 语音处理方法、语音编码器、语音解码器及语音识别系统

Country Status (3)

Country Link
US (1) US20230009633A1 (zh)
CN (1) CN113450781B (zh)
WO (1) WO2021190389A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115834310A (zh) * 2023-02-15 2023-03-21 四川轻化工大学 一种基于LGTransformer的通信信号调制识别方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700369A (zh) * 2013-11-26 2014-04-02 安徽科大讯飞信息科技股份有限公司 语音导航方法及系统
CN106875940A (zh) * 2017-03-06 2017-06-20 吉林省盛创科技有限公司 一种基于神经网络的机器自学习构建知识图谱训练方法
CN109977401A (zh) * 2019-03-15 2019-07-05 上海火商智能科技有限公司 一种基于神经网络的语义识别方法
US10388274B1 (en) * 2016-03-31 2019-08-20 Amazon Technologies, Inc. Confidence checking for speech processing and query answering
CN110288975A (zh) * 2019-05-17 2019-09-27 北京达佳互联信息技术有限公司 语音风格迁移方法、装置、电子设备及存储介质
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102543071B (zh) * 2011-12-16 2013-12-11 安徽科大讯飞信息科技股份有限公司 用于移动设备的语音识别系统和方法
CN103247291B (zh) * 2013-05-07 2016-01-13 华为终端有限公司 一种语音识别设备的更新方法、装置及系统
CN106340297A (zh) * 2016-09-21 2017-01-18 广东工业大学 一种基于云计算与置信度计算的语音识别方法与系统
US10867595B2 (en) * 2017-05-19 2020-12-15 Baidu Usa Llc Cold fusing sequence-to-sequence models with language models
CN110162799B (zh) * 2018-11-28 2023-08-04 腾讯科技(深圳)有限公司 模型训练方法、机器翻译方法以及相关装置和设备
CN110211574B (zh) * 2019-06-03 2022-03-11 哈尔滨工业大学 基于瓶颈特征和多尺度多头注意力机制的语音识别模型建立方法
CN110473540B (zh) * 2019-08-29 2022-05-31 京东方科技集团股份有限公司 语音交互方法及系统、终端设备、计算机设备及介质
CN110544469B (zh) * 2019-09-04 2022-04-19 秒针信息技术有限公司 语音识别模型的训练方法及装置、存储介质、电子装置
CN110706690A (zh) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 语音识别方法及其装置
CN110851644A (zh) * 2019-11-04 2020-02-28 泰康保险集团股份有限公司 图像检索方法及装置、计算机可读存储介质、电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103700369A (zh) * 2013-11-26 2014-04-02 安徽科大讯飞信息科技股份有限公司 语音导航方法及系统
US10388274B1 (en) * 2016-03-31 2019-08-20 Amazon Technologies, Inc. Confidence checking for speech processing and query answering
CN106875940A (zh) * 2017-03-06 2017-06-20 吉林省盛创科技有限公司 一种基于神经网络的机器自学习构建知识图谱训练方法
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
CN109977401A (zh) * 2019-03-15 2019-07-05 上海火商智能科技有限公司 一种基于神经网络的语义识别方法
CN110288975A (zh) * 2019-05-17 2019-09-27 北京达佳互联信息技术有限公司 语音风格迁移方法、装置、电子设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115834310A (zh) * 2023-02-15 2023-03-21 四川轻化工大学 一种基于LGTransformer的通信信号调制识别方法

Also Published As

Publication number Publication date
CN113450781A (zh) 2021-09-28
US20230009633A1 (en) 2023-01-12
CN113450781B (zh) 2022-08-09

Similar Documents

Publication Publication Date Title
CN109960810B (zh) 一种实体对齐方法及装置
WO2019169996A1 (zh) 视频处理、视频检索方法、装置、存储介质及服务器
WO2019148586A1 (zh) 多人发言中发言人识别方法以及装置
CN110097870B (zh) 语音处理方法、装置、设备和存储介质
CN111402891A (zh) 语音识别方法、装置、设备和存储介质
CN111444382B (zh) 一种音频处理方法、装置、计算机设备以及存储介质
US11170763B2 (en) Voice interaction system, its processing method, and program therefor
US10474706B2 (en) Organizing speech search results
US11562735B1 (en) Multi-modal spoken language understanding systems
CN104199825A (zh) 一种信息查询方法和系统
CN111816166A (zh) 声音识别方法、装置以及存储指令的计算机可读存储介质
US11170765B2 (en) Contextual multi-channel speech to text
WO2021190389A1 (zh) 语音处理方法、语音编码器、语音解码器及语音识别系统
Duan et al. Multimodal matching transformer for live commenting
CN109670073B (zh) 一种信息转换方法及装置、交互辅助系统
CN114627868A (zh) 意图识别方法、装置、模型及电子设备
WO2024093578A1 (zh) 语音识别方法、装置、电子设备、存储介质及计算机程序产品
US20220067038A1 (en) Methods, apparatuses and computer program products for providing a conversational data-to-text system
CN113254620B (zh) 基于图神经网络的应答方法、装置、设备及存储介质
CN113611284A (zh) 语音库构建方法、识别方法、构建系统和识别系统
US9251135B2 (en) Correcting N-gram probabilities by page view information
JP2023511864A (ja) 小さいフットプリントのマルチチャネルキーワードスポッティング
CN113793598B (zh) 语音处理模型的训练方法和数据增强方法、装置及设备
CN117371440B (zh) 基于aigc的话题文本大数据分析方法及系统
CN112820274B (zh) 一种语音信息识别校正方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21777183

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21777183

Country of ref document: EP

Kind code of ref document: A1