WO2021139772A1 - 一种音频信息处理方法、装置、电子设备以及存储介质 - Google Patents
一种音频信息处理方法、装置、电子设备以及存储介质 Download PDFInfo
- Publication number
- WO2021139772A1 WO2021139772A1 PCT/CN2021/070879 CN2021070879W WO2021139772A1 WO 2021139772 A1 WO2021139772 A1 WO 2021139772A1 CN 2021070879 W CN2021070879 W CN 2021070879W WO 2021139772 A1 WO2021139772 A1 WO 2021139772A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- feature
- information
- audio feature
- decoded
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 94
- 238000003672 processing method Methods 0.000 title claims abstract description 65
- 238000003860 storage Methods 0.000 title claims abstract description 32
- 238000000605 extraction Methods 0.000 claims description 28
- 230000003993 interaction Effects 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 4
- 238000000034 method Methods 0.000 abstract description 61
- 230000008569 process Effects 0.000 description 35
- 238000013519 translation Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Definitions
- This application relates to the field of computer technology, in particular to an audio information processing method, device, electronic equipment, and storage medium.
- end-to-end speech recognition methods have received more and more attention in the field of speech recognition.
- the end-to-end speech recognition method unifies the acoustic model and the language model in the traditional speech recognition method, and can directly obtain the text information corresponding to the audio information based on the audio information, thereby simplifying the process of speech recognition.
- the existing end-to-end speech recognition methods are mainly based on neural networks such as RNN (Recurrent Neural Network) or CNN (Convolutional Neural Networks).
- RNN Recurrent Neural Network
- CNN Convolutional Neural Networks
- the present application provides an audio information processing method, device, electronic equipment, and storage medium, so as to reduce the computational complexity in the audio information processing process and improve the efficiency of audio information processing.
- This application provides an audio information processing method, including:
- the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time is encoded to obtain the second audio feature corresponding to the audio information ;
- the encoding the audio feature at the specified time according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time includes:
- the audio feature at the specified time is encoded according to the audio feature at the specified time and the multiple audio features at the target time.
- the encoding the audio feature at the specified time according to the audio feature at the specified time and multiple audio features at the target time includes:
- the audio feature at the specified time is encoded a second time to obtain the audio feature at the specified time.
- the above steps are performed in sequence until the number of encoding times reaches the specified encoding times, and the encoding of the audio feature at the specified time is completed;
- the final encoded audio feature corresponding to the first audio feature is used as the second audio feature.
- the audio feature at the specified time is encoded for the first time according to the audio feature at the specified time and a plurality of audio features at the target time to obtain the first audio feature corresponding to the first audio feature.
- Encoding audio features includes: performing the first step based on the linear audio features at the specified time, the nonlinear audio features at the specified time, the linear audio features at the multiple target moments, and the nonlinear audio features at the multiple target moments. Encode once to obtain the first encoded audio feature corresponding to the first audio feature.
- the second encoding to obtain the second encoded audio feature corresponding to the first audio feature includes: according to the first encoded linear audio feature corresponding to the audio feature at the specified time, the first encoded non-linear audio feature corresponding to the audio feature at the specified time.
- the linear audio features, the first encoded linear audio features corresponding to the multiple audio features at the target time, and the first encoded nonlinear audio features corresponding to the multiple audio features at the target time are encoded for the second time to obtain the first A second encoded audio feature corresponding to an audio feature.
- it also includes:
- Linear rectification is performed on the first coded linear audio feature corresponding to the first audio feature to obtain the first coded nonlinear audio feature corresponding to the first audio feature.
- the selecting multiple audio features at the target time from among the audio features adjacent to the audio features at the designated time includes:
- multiple audio features at the target time are selected from among the audio features adjacent to the audio feature at the specified time.
- the determining the range of the audio feature adjacent to the audio feature at the specified time includes: determining the audio feature adjacent to the audio feature at the specified time before the audio feature at the specified time A first range, and determining a second range of audio features adjacent to the audio feature at the specified time after the audio feature at the specified time;
- the selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time according to the range of the audio feature adjacent to the audio feature at the specified time includes: In the range and the second range, multiple audio features at the target time are selected from among the audio features adjacent to the audio features at the designated time.
- selecting multiple audio features at a target time from among audio features adjacent to the audio features at the specified time includes:
- multiple audio features at the target time are selected from among the audio features adjacent to the audio features at the specified time.
- the stride factor according to the first range and the second range, selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time.
- the method includes: selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time according to the first step amplitude factor and the first range.
- the method includes: selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time according to the second stride factor and the second range.
- the obtaining the text information corresponding to the audio information according to the second audio feature and the decoded text information includes: performing a comparison between the second audio feature and the decoded text information.
- the to-be-decoded audio information corresponding to the second audio information is decoded to obtain text information corresponding to the audio information.
- the decoding the to-be-decoded audio information corresponding to the second audio information according to the second audio feature and the decoded text information to obtain the text information corresponding to the audio information includes:
- the second audio feature and the decoded text information decode the second to-be-decoded audio information to obtain the second decoded text information, and perform the above steps in sequence until all corresponding to the second audio information
- the audio information to be decoded is decoded to obtain text information corresponding to the audio information.
- the decoded information includes: instruction information used to instruct to decode audio information to be decoded corresponding to the second audio information.
- the decoding the first to-be-decoded audio information according to the second audio feature and the decoded text information to obtain the first decoded text information includes:
- the decoding the first audio information to be decoded according to the second audio feature and the decoded text information to obtain the text information corresponding to the first audio information to be decoded includes:
- the obtaining the first audio feature corresponding to the audio information includes:
- the performing feature extraction on the audio information to obtain the first audio feature includes: performing feature extraction on the audio information to obtain a first audio feature sequence corresponding to the audio information.
- an audio information processing device including:
- the first audio feature obtaining unit is configured to obtain the first audio feature corresponding to the audio information
- the second audio feature obtaining unit is configured to encode the audio feature at the specified time according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time to obtain the audio feature at the specified time.
- the second audio feature corresponding to the audio information
- the decoded text information obtaining unit is used to obtain the decoded text information corresponding to the audio information; the text information obtaining unit is used to obtain the information corresponding to the audio information according to the second audio feature and the decoded text information. Text information.
- an electronic device including: a processor;
- the memory is used to store the program of the audio information processing method. After the device is powered on and runs the program of the audio information processing method through the processor, the following steps are executed:
- the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time is encoded to obtain the second audio feature corresponding to the audio information ;
- a storage device that stores a program of an audio information processing method, and the program is run by a processor to perform the following steps: obtaining a first audio feature corresponding to the audio information;
- the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time is encoded to obtain the second audio feature corresponding to the audio information ;
- a smart speaker including: an audio collection device and an audio recognition device, wherein the audio recognition device includes: an audio feature extraction module, an audio feature encoding module, a decoded text storage module, and an audio feature Encoding module; the audio collection device is used to obtain audio information;
- the audio feature extraction module is configured to obtain the first audio feature corresponding to the audio information
- the audio feature encoding module is configured to encode the audio feature at the specified time according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time to obtain the audio feature at the specified time.
- the second audio feature corresponding to the audio information
- the decoded text storage module is used to obtain the decoded text information corresponding to the audio information; the audio feature encoding module is used to obtain the audio according to the second audio feature and the decoded text information The text message corresponding to the message.
- a vehicle-mounted intelligent voice interaction device including: audio collection equipment, audio recognition equipment and execution equipment, wherein the audio recognition equipment includes: audio feature extraction module, audio feature encoding module, decoded Text storage module and audio feature encoding module;
- the audio collection device is used to obtain audio information
- the audio feature extraction module is configured to obtain the first audio feature corresponding to the audio information
- the audio feature encoding module is configured to encode the audio feature at the specified time according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time to obtain the audio feature at the specified time.
- the second audio feature corresponding to the audio information
- the decoded text storage module is used to obtain the decoded text information corresponding to the audio information;
- the audio feature encoding module is used to obtain the audio according to the second audio feature and the decoded text information Text information corresponding to the information;
- the execution device is configured to execute corresponding instructions according to the text information corresponding to the audio information.
- an audio information processing system including: a client and a server; the client is configured to obtain audio information; and send the audio information to the server;
- the server is configured to obtain the first audio feature corresponding to the audio information; according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time, the specified time Encode the audio feature of the audio information to obtain the second audio feature corresponding to the audio information; obtain the decoded text information corresponding to the audio information; obtain the audio information according to the second audio feature and the decoded text information Corresponding text information; providing text information corresponding to the audio information to the client.
- the audio information processing method provided in this application firstly obtains the first audio feature corresponding to the audio information; secondly, according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time, the specified time Encode the audio feature to obtain the second audio feature corresponding to the audio information; again, obtain the decoded text information corresponding to the audio information; finally, obtain the text information corresponding to the audio information according to the second audio feature and the decoded text information.
- the audio information processing method provided by the present application can compare the audio feature at the specified time in the first audio feature based on the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time in the first audio feature.
- the audio information processing method provided by this application is used to obtain the second audio feature and In the process of obtaining text information corresponding to the audio information according to the second audio feature and the decoded text information, fewer parameters need to be used, thereby reducing the computational complexity in the audio information processing process and improving the efficiency of audio information processing.
- FIG. 1 is a schematic diagram of a first application scenario embodiment of an audio information processing method provided by this application.
- FIG. 2 is a schematic diagram of a second application scenario embodiment of the audio information processing method provided by this application.
- FIG. 3 is a flowchart of an audio information processing method provided in the first embodiment of this application.
- FIG. 4 is a flowchart of a method for encoding audio features at a specified time provided in the first implementation of this application.
- Fig. 5 is a flowchart of a method for selecting audio features at multiple target moments provided in the first implementation of this application.
- Fig. 6 is a flowchart of a method for obtaining text information corresponding to audio information provided in the first embodiment of the application.
- FIG. 7 is a schematic diagram of an audio information processing device provided in the second embodiment of this application.
- FIG. 8 is a schematic diagram of an electronic device provided in an embodiment of the application.
- FIG. 1 is a schematic diagram of a first application scenario embodiment of the audio information processing method provided in this application.
- FIG. 1 is a schematic diagram of a first application scenario embodiment of the audio information processing method provided in this application.
- the application scenario of applying the audio information processing method provided in this application to the simultaneous translation headset is taken as an example to describe the audio information processing method provided in this application in detail.
- the audio information is the user's voice information.
- the simultaneous translation headset When the user uses the simultaneous translation headset for a conversation, the simultaneous translation headset will collect the voice information of the target user through its own sound collection device. After the voice information of the target user is collected, the simultaneous translation headset will first identify the voice corresponding And further determine whether the language of the voice is the language to be translated preset by the user. If it is, the simultaneous translation headset will process the user’s voice information, recognize and translate the voice information.
- the specific process of the simultaneous translation headset identifying the target user’s voice information is as follows: first, noise reduction is performed on the voice information, and after the noise reduction processing, the voice information is further subjected to acoustic feature extraction to obtain the first voice corresponding to the voice information feature.
- the first voice feature is specifically a voice feature sequence, that is, the voice feature of the voice information in N voice frames, and the voice feature includes phoneme feature of the voice, frequency spectrum feature of the voice, and so on.
- the coding unit of the simultaneous translation headset will encode the voice feature at the specified time according to the voice feature at the specified time in the first voice feature and the voice feature adjacent to the voice feature at the specified time to obtain the second voice information corresponding to the voice feature. Voice characteristics.
- the specified time is determined according to the preset encoding times and audio length. Specifically, the encoding time interval is calculated according to the audio length and the preset encoding times, and a moment is selected as the starting point. Starting time, each designated time can be obtained according to the starting time, coding times, and time interval.
- the decoding unit of the simultaneous translation headset will obtain the second voice feature and the decoded text information corresponding to the voice information as the output of the decoded voice information in the decoded voice information.
- the decoded information may be instruction information used to instruct to decode the to-be-decoded voice information corresponding to the second voice information.
- the decoding unit of the simultaneous translation headset will obtain the text information corresponding to the voice information according to the second voice feature and the decoded text information.
- the coding unit of the simultaneous translation headset encodes the voice feature at the specified time according to the voice feature at the specified time in the first voice feature and the voice feature adjacent to the voice feature at the specified time to obtain the corresponding voice information
- the specific process of the second voice feature is as follows: first, multiple voice features at the target time are selected from voice features adjacent to the voice feature at the specified time.
- the voice feature at the specified time is encoded for the first time to obtain the first encoded voice feature corresponding to the first voice feature; according to the voice feature at the specified time The first coded voice feature and the first coded voice feature corresponding to the voice feature at the multiple target moments are obtained, the second coded voice feature corresponding to the first voice feature is obtained, and the above steps are performed in sequence until the coding times reach the specified coding times, and the assignment is completed Encoding of the voice feature at the moment; the final encoded voice feature corresponding to the first voice feature is used as the second voice feature.
- the specific process of obtaining the first encoded voice feature corresponding to the first voice feature is: according to the linear voice feature at the specified time, the non-linear voice feature at the specified time, and multiple targets
- the linear voice features at a time and the non-linear voice features at multiple target moments are encoded for the first time, and the first encoded voice feature corresponding to the first voice feature is obtained.
- the specific process of obtaining the N-th coded voice feature corresponding to the first voice feature is: according to the N-1 coded linear voice feature corresponding to the voice feature at the specified time and the time at the specified time
- the N-1 coded nonlinear voice feature corresponding to the voice feature, the N-1 coded linear voice feature corresponding to the voice feature at multiple target moments, and the N-1 coded nonlinear voice feature corresponding to the voice feature at multiple target moments Perform the N-th encoding to obtain the N-th encoded voice feature corresponding to the first voice feature.
- N is the preset coding times.
- the specific process of obtaining the text information corresponding to the voice information is: after the decoding unit of the simultaneous translation headset obtains the second voice feature information and the decoded text information, it will Obtain the first voice information to be decoded corresponding to the second voice feature; decode the first voice information to be decoded according to the second voice feature and the decoded text information to obtain the first decoded text information; obtain the first voice information corresponding to the second voice feature 2.
- Voice information to be decoded update the first decoded text information to decoded information; decode the second voice information to be decoded according to the second voice feature and the decoded text information to obtain the second decoded text information, and perform the above steps in sequence, Until all the voice information to be decoded corresponding to the second voice information is decoded, the text information corresponding to the voice information is obtained.
- the first to-be-decoded voice information is decoded, and when the first decoded text information is obtained, it is necessary to first obtain the first to-be-decoded based on the second voice feature and the decoded text information The predicted value of the text unit corresponding to the voice information; then, the probability distribution of the text unit is obtained; and finally, the text unit with the largest probability value is obtained as the text information corresponding to the first voice information to be decoded.
- the decoding unit of the simultaneous translation After obtaining the text information corresponding to the voice information, the decoding unit of the simultaneous translation will provide the text information corresponding to the voice information to the translation module unit of the simultaneous translation. Since the translation module unit translates the text information corresponding to the voice information, The text information corresponding to the voice information is translated into the text information of the preset breeding, and the text information is converted into the voice information of the preset language and output.
- the audio information processing method provided in this application can be applied to a speech conversion into a text scene, as shown in FIG. 2, which is a schematic diagram of a second application scenario embodiment of the audio information processing method provided in this application.
- the audio information processing method provided in the present application is applied to the application scenario of converting voice into text in social software as an example, and the audio information processing method provided in the present application is described in detail.
- the audio information is voice information.
- the social software When the social software converts the received voice information into text information, it will first send the voice information to the voice recognition system, and the voice information will be recognized by the voice recognition system.
- the speech recognition system includes a speech feature extraction module 201, an encoding module 202, and a decoding module 203.
- the process of recognizing voice information through the voice recognition system is as follows:
- the voice feature extraction module 201 performs feature extraction on the voice information to obtain the first voice feature corresponding to the voice information, and further provides the first voice feature to the encoding module 202.
- the encoding module 202 After the encoding module 202 obtains the first voice feature, it linearly transforms the first voice feature through the linear projection layer 202-1 in the encoding module 202 to obtain the linear voice feature of the first voice feature, and then passes the linear rectification layer. 202-2 performs linear rectification on the linear voice feature of the first voice feature to obtain the non-linear voice feature of the first voice feature.
- the N-layer coding layer 202-3 in the coding module 202 encodes the voice feature at the specified time according to the voice feature at the specified time in the first voice feature and the voice feature adjacent to the voice feature at the specified time to obtain the voice The second voice feature corresponding to the message.
- the decoding module 203 obtains the decoded text information and the second voice feature corresponding to the voice information, and obtains the text information corresponding to the voice information according to the second voice feature and the decoded text information.
- the first embodiment of the present application provides a method for displaying the content of an electronic reading object on a handheld device, which is described below with reference to FIGS. 1 to 6.
- FIG. 3 is a flowchart of an audio information processing method provided in the first embodiment of this application.
- step S301 the first audio feature corresponding to the audio information is obtained.
- Audio features include audio phoneme features, audio frequency spectrum features, and so on.
- the audio information in the first embodiment of the present application is generally the voice information sent by a person or the voice information sent by an audio device, such as singing.
- the specific steps of obtaining the first audio feature corresponding to the audio information are: obtaining the audio information; performing feature extraction on the audio information to obtain the first audio feature.
- performing feature extraction on the audio information to obtain the first audio feature includes: performing feature extraction on the audio information to obtain the first audio feature sequence corresponding to the audio information. That is, the audio characteristics of the audio information in N speech frames are obtained.
- step S302 the audio feature at the specified time is encoded according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time to obtain the second audio feature corresponding to the audio information.
- the process of encoding the audio feature at the specified time is to input the audio feature at the specified time and the audio feature adjacent to the audio feature at the specified time into the encoder for linear and non-linear transformation, thereby Perform feature attribute dimensionality reduction on the first audio feature to obtain the expression of a new audio feature.
- the second audio feature information is the audio feature information obtained after encoding the first audio feature.
- the process of encoding the audio feature at the specified time is: select among the audio features adjacent to the audio feature at the specified time Audio features at multiple target moments; according to the audio features at the specified moment and multiple target moments, the audio features at the specified moment are encoded.
- FIG. 4 is an audio feature at the specified time provided in the first implementation of this application.
- step S401 the audio feature at the specified time is encoded for the first time according to the audio feature at the specified time and the audio features at multiple target moments to obtain the first encoded audio feature corresponding to the first audio feature.
- the audio feature at the specified time is encoded for the first time to obtain the first encoded audio feature corresponding to the first audio feature, including: according to the linear audio feature at the specified time, The non-linear audio features at the specified time, the linear audio features at multiple target moments, and the non-linear audio features at multiple target moments are encoded for the first time to obtain the first encoded audio feature corresponding to the first audio feature.
- step S402 according to the first encoded audio feature corresponding to the audio feature at the specified time and the first encoded audio feature corresponding to the audio features at the multiple target moments, the audio feature at the specified time is encoded a second time to obtain the first audio
- the above steps are executed in sequence until the number of encoding times reaches the specified encoding number, and the encoding of the audio feature at the specified time is completed.
- the number of encoding times is related to the audio length.
- the audio feature of one frame of audio information is usually extracted every 10ms, for example, 6s audio information can be extracted Output 600 frames of audio features, thereby obtaining the first audio feature corresponding to the audio information.
- the audio feature of 600 frames in the first audio feature corresponding to the audio information is the audio feature of non-adjacent frames
- the second audio feature corresponding to the audio information it will The 600-frame audio feature in the first audio feature is subjected to adjacent frame splicing processing and sampling processing.
- the 600-frame audio feature will be further converted into a 100-frame spliced audio feature.
- the audio features at any specified time will be encoded, and the number of encoding times will also be 100.
- the process of performing the second-N encoding of the audio feature at the specified time is similar. Therefore, in the first embodiment of the present application, only the process of the second encoding of the audio feature at the specified time is performed.
- the audio feature at the specified time is encoded a second time to obtain the second audio feature corresponding to the first audio feature.
- the encoded audio features include: the first encoded linear audio feature corresponding to the audio feature at the specified moment, the first encoded nonlinear audio feature corresponding to the audio feature at the specified moment, and the first encoded linear audio corresponding to the audio features at multiple target moments
- the feature and the first coded nonlinear audio feature corresponding to the audio feature at the multiple target moments are coded a second time to obtain the second coded audio feature corresponding to the first audio feature.
- the second encoding of the audio features at the specified time is used to describe the audio features at the specified time.
- the first encoded audio feature corresponding to the first audio feature needs to be linearly transformed to obtain the first encoded linear audio feature corresponding to the first audio feature; the first encoded linear audio corresponding to the first audio feature is obtained The feature is linearly rectified to obtain the first encoded nonlinear audio feature corresponding to the first audio feature.
- linear rectification of linear audio features to obtain non-linear audio features is generally implemented through a ReLU function (Rectified Linear Unit, linear rectification function).
- step S302 it is necessary to select multiple audio features at the target time from among the audio features adjacent to the audio features at the specified time.
- among the audio features adjacent to the audio features at the specified time please refer to FIG. 5, which is a flowchart of a method for selecting audio features at multiple target moments provided in the first implementation of this application.
- step S501 the range of the audio feature adjacent to the audio feature at the specified time is determined.
- Determining the range of the audio feature adjacent to the audio feature at the specified time includes: determining the first range of the audio feature adjacent to the audio feature at the specified time before the audio feature at the specified time, and determining the audio feature at the specified time After that, the second range of audio features adjacent to the audio feature at the specified time.
- step S502 according to the range of the audio feature adjacent to the audio feature at the specified time, multiple audio features at the target time are selected from the audio features adjacent to the audio feature at the specified time.
- selecting multiple audio features at the target time from among the audio features adjacent to the audio feature at the specified time includes: The audio features of multiple target moments are selected from the adjacent audio features of the audio features at the specified time.
- the stride factor which is used for Indicate the time interval for selecting the audio features of multiple target moments among the audio features adjacent to the audio features at the specified time; then, according to the stride factor, according to the first range and the second range, at the specified time Select multiple audio features at the target moment from the adjacent audio features of the audio feature.
- the audio features of multiple target moments are selected from the audio features adjacent to the audio features at the specified time, including: according to the first step amplitude factor and In the first range, multiple audio features at the target time are selected among the audio features adjacent to the audio features at the specified time.
- selecting multiple audio features at the target time from among the audio features adjacent to the audio feature at the specified time includes: according to the second step factor and In the second range, multiple audio features at the target time are selected among the audio features adjacent to the audio features at the specified time.
- step S403 the final encoded audio feature corresponding to the first audio feature is used as the second audio feature.
- step S303 the decoded text information corresponding to the audio information is obtained.
- the decoded text information may be the text information corresponding to the audio information obtained before the current time.
- the decoded text information may also be used For the instruction information indicating the decoding of the to-be-decoded audio information corresponding to the second audio information.
- step S304 the text information corresponding to the audio information is obtained according to the second audio feature and the decoded text information.
- obtaining the text information corresponding to the audio information specifically includes: according to the second audio feature and the decoded text information, decoding the to-be-decoded audio information corresponding to the second audio information to obtain the audio information corresponding
- FIG. 6 is a flowchart of a method for obtaining text information corresponding to audio information provided in the first embodiment of this application.
- step S601 the first to-be-decoded audio information corresponding to the second audio feature is obtained.
- the decoding process is a process of inputting the decoding result of the previous moment and the encoding expression of the encoder into a decoder to obtain the corresponding decoding output.
- step S602 according to the second audio feature and the decoded text information, the first to-be-decoded audio information is decoded to obtain the first decoded text information.
- the first to-be-decoded audio information is decoded, and the specific process of obtaining the first decoded text information is:
- the first to-be-decoded audio information is decoded to obtain text information corresponding to the first to-be-decoded audio information.
- the first decoded text information is obtained. That is, according to the second audio feature and the decoded text information, the predicted value of the text unit corresponding to the first audio information to be decoded is obtained; the probability distribution of the text unit is obtained; the text unit with the largest probability value is obtained as the first audio information to be decoded Corresponding text information.
- step S603 the first decoded text information is updated as decoded information.
- step S604 the second audio information to be decoded is decoded according to the second audio feature and the decoded text information to obtain the second decoded text information, and the above steps are performed in sequence until all audio information to be decoded corresponding to the second audio information is performed.
- the information is decoded to obtain the text information corresponding to the audio information.
- the audio information processing method provided in this application firstly obtains the first audio feature corresponding to the audio information; secondly, according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time, the specified time Encode the audio feature to obtain the second audio feature corresponding to the audio information; again, obtain the decoded text information corresponding to the audio information; finally, obtain the text information corresponding to the audio information according to the second audio feature and the decoded text information.
- the audio information processing method provided by the present application can compare the audio feature at the specified time in the first audio feature based on the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time in the first audio feature.
- the audio information processing method provided by this application is used to obtain the second audio feature and In the process of obtaining text information corresponding to the audio information according to the second audio feature and the decoded text information, fewer parameters need to be used, thereby reducing the computational complexity in the audio information processing process and improving the efficiency of audio information processing.
- the audio information processing method provided in the first embodiment of the present application further includes: outputting text information corresponding to the audio information.
- the second embodiment of the present application provides an audio information processing device. Since the device embodiment is basically similar to the first embodiment of the method, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
- the device embodiments described below are only illustrative.
- FIG. 7 it is a schematic diagram of an audio information processing device provided in the second embodiment of this application.
- the audio information processing device includes:
- the first audio feature obtaining unit 701 is configured to obtain the first audio feature corresponding to the audio information
- the second audio feature obtaining unit 702 is configured to encode the audio feature at the specified time according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time to obtain The second audio feature corresponding to the audio information;
- the decoded text information obtaining unit 703 is configured to obtain decoded text information corresponding to the audio information
- the text information obtaining unit 704 is configured to obtain text information corresponding to the audio information according to the second audio feature and the decoded text information.
- the second audio feature obtaining unit 702 is specifically configured to select audio features at multiple target moments from among the audio features adjacent to the audio features at the specified time; according to the audio features at the specified time and A plurality of audio features at the target moment are encoded for the audio features at the specified moment.
- the encoding the audio feature at the specified time according to the audio feature at the specified time and multiple audio features at the target time includes:
- the audio feature at the specified time is encoded a second time to obtain the audio feature at the specified time.
- the above steps are performed in sequence until the number of encoding times reaches the specified encoding times, and the encoding of the audio feature at the specified time is completed;
- the final encoded audio feature corresponding to the first audio feature is used as the second audio feature.
- the audio feature at the specified time is encoded for the first time according to the audio feature at the specified time and a plurality of audio features at the target time to obtain the first audio feature corresponding to the first audio feature.
- Encoding audio features includes: performing the first step based on the linear audio features at the specified time, the nonlinear audio features at the specified time, the linear audio features at the multiple target moments, and the nonlinear audio features at the multiple target moments. Encode once to obtain the first encoded audio feature corresponding to the first audio feature.
- the second coded audio feature is performed on the audio feature at the specified time.
- the second encoding to obtain the second encoded audio feature corresponding to the first audio feature includes: according to the first encoded linear audio feature corresponding to the audio feature at the specified time, the first encoded non-linear audio feature corresponding to the audio feature at the specified time.
- the linear audio features, the first encoded linear audio features corresponding to the multiple audio features at the target time, and the first encoded nonlinear audio features corresponding to the multiple audio features at the target time are encoded for the second time to obtain the first A second encoded audio feature corresponding to an audio feature.
- it also includes:
- Linear rectification is performed on the first coded linear audio feature corresponding to the first audio feature to obtain the first coded nonlinear audio feature corresponding to the first audio feature.
- the selecting multiple audio features at the target time from among the audio features adjacent to the audio features at the designated time includes:
- multiple audio features at the target time are selected from among the audio features adjacent to the audio feature at the specified time.
- the determining the range of the audio feature adjacent to the audio feature at the specified time includes: determining the audio feature adjacent to the audio feature at the specified time before the audio feature at the specified time A first range, and determining a second range of audio features adjacent to the audio feature at the specified time after the audio feature at the specified time;
- the selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time according to the range of the audio feature adjacent to the audio feature at the specified time includes: In the range and the second range, multiple audio features at the target time are selected from among the audio features adjacent to the audio features at the designated time.
- selecting multiple audio features at a target time from among audio features adjacent to the audio features at the specified time includes:
- multiple audio features at the target time are selected from among the audio features adjacent to the audio features at the specified time.
- the stride factor according to the first range and the second range, selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time.
- the method includes: selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time according to the first step amplitude factor and the first range.
- the method includes: selecting a plurality of audio features at the target time from among the audio features adjacent to the audio feature at the specified time according to the second stride factor and the second range.
- the text information obtaining unit 704 is specifically configured to decode the to-be-decoded audio information corresponding to the second audio information according to the second audio feature and the decoded text information to obtain the audio The text message corresponding to the message.
- the decoding the to-be-decoded audio information corresponding to the second audio information according to the second audio feature and the decoded text information to obtain the text information corresponding to the audio information includes:
- the second audio feature and the decoded text information decode the second to-be-decoded audio information to obtain the second decoded text information, and perform the above steps in sequence until all corresponding to the second audio information
- the audio information to be decoded is decoded to obtain text information corresponding to the audio information.
- the decoded information includes: instruction information used to instruct to decode audio information to be decoded corresponding to the second audio information.
- the decoding the first to-be-decoded audio information according to the second audio feature and the decoded text information to obtain the first decoded text information includes:
- the decoding the first audio information to be decoded according to the second audio feature and the decoded text information to obtain the text information corresponding to the first audio information to be decoded includes:
- the first audio feature obtaining unit 701 is specifically configured to obtain the audio information; perform feature extraction on the audio information to obtain the first audio feature.
- the performing feature extraction on the audio information to obtain the first audio feature includes: performing feature extraction on the audio information to obtain a first audio feature sequence corresponding to the audio information.
- the audio information processing device further includes: a text information output unit configured to output text information corresponding to the audio information.
- the audio information processing device provided in the second embodiment of the present application firstly obtains the first audio feature corresponding to the audio information; secondly, according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time Feature, encode the audio feature at the specified time to obtain the second audio feature corresponding to the audio information; again, obtain the decoded text information corresponding to the audio information; finally, obtain the audio information corresponding to the second audio feature and the decoded text information Text information.
- the audio information processing device provided in the present application can compare the audio feature at the specified time in the first audio feature based on the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time in the first audio feature.
- the audio information processing device provided in this application obtains the second audio feature and In the process of obtaining text information corresponding to the audio information according to the second audio feature and the decoded text information, fewer parameters need to be used, thereby reducing the computational complexity in the audio information processing process and improving the efficiency of audio information processing.
- an electronic device is provided in the third embodiment of the present application.
- FIG. 8 is a schematic diagram of an electronic device provided in an embodiment of this application.
- the electronic equipment includes:
- the memory 802 is configured to store a computer program. After the device is powered on and runs the computer program through the processor, it executes the audio information processing method described in the first implementation of this application.
- the electronic device provided in the third embodiment of the present application firstly obtains the first audio feature corresponding to the audio information; secondly, according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time, Encode the audio feature at the specified time to obtain the second audio feature corresponding to the audio information; again, obtain the decoded text information corresponding to the audio information; finally, obtain the text corresponding to the audio information according to the second audio feature and the decoded text information information.
- the audio information processing electronic device provided in the present application can compare the audio feature at the specified time in the first audio feature based on the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time in the first audio feature.
- the feature is encoded to obtain the second audio feature corresponding to the audio information, and the text information corresponding to the audio information is further obtained according to the second audio feature and the decoded text information.
- the audio information processing electronic device provided in this application obtains the second audio Features and the process of obtaining text information corresponding to audio information according to the second audio feature and the decoded text information requires fewer parameters, thereby reducing the computational complexity in the audio information processing process and improving the audio information processing effectiveness.
- the fourth embodiment of the present application provides a storage medium that stores a computer program, and the computer program is run by a processor to execute the first Implement the audio information processing method described in.
- the storage medium provided in the fourth embodiment of the present application firstly obtains the first audio feature corresponding to the audio information; secondly, according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time, Encode the audio feature at the specified time to obtain the second audio feature corresponding to the audio information; again, obtain the decoded text information corresponding to the audio information; finally, obtain the text corresponding to the audio information according to the second audio feature and the decoded text information information.
- the audio information processing storage medium provided by the present application can compare the audio feature at the specified time in the first audio feature based on the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time in the first audio feature.
- the feature is encoded to obtain the second audio feature corresponding to the audio information, and the text information corresponding to the audio information is further obtained according to the second audio feature and the decoded text information.
- the audio information processing storage medium provided in this application is used for obtaining the second audio Features and the process of obtaining text information corresponding to audio information according to the second audio feature and the decoded text information requires fewer parameters, thereby reducing the computational complexity in the audio information processing process and improving the audio information processing effectiveness.
- an audio signal processing method is provided.
- a fifth embodiment of the present application provides a smart speaker.
- the smart speaker provided in the fifth embodiment of the present application includes: an audio collection device and an audio recognition device, wherein the audio recognition device includes: an audio feature extraction module, an audio feature encoding module, a decoded text storage module, and an audio feature encoding Module; the audio acquisition device, used to obtain audio information;
- the audio feature extraction module is configured to obtain the first audio feature corresponding to the audio information;
- the audio feature encoding module is configured to obtain the audio feature at a specified time in the first audio feature and the audio feature at the specified time. Audio features adjacent to the audio feature, encoding the audio feature at the specified moment to obtain the second audio feature corresponding to the audio information;
- the decoded text storage module is used to obtain the decoded text information corresponding to the audio information; the audio feature encoding module is used to obtain the audio according to the second audio feature and the decoded text information The text message corresponding to the message.
- an audio signal processing method is provided.
- the sixth embodiment of the present application provides an in-vehicle intelligent voice interaction device.
- the vehicle-mounted intelligent voice interaction device provided in the sixth embodiment of the present application includes: an audio collection device, an audio recognition device, and an execution device, wherein the audio recognition device includes: an audio feature extraction module, an audio feature encoding module, and decoded text Storage module and audio feature encoding module;
- the audio collection device is used to obtain audio information
- the audio feature extraction module is configured to obtain the first audio feature corresponding to the audio information;
- the audio feature encoding module is configured to obtain the audio feature at a specified time in the first audio feature and the audio feature at the specified time. Audio features adjacent to the audio feature, encoding the audio feature at the specified moment to obtain the second audio feature corresponding to the audio information;
- the decoded text storage module is used to obtain the decoded text information corresponding to the audio information;
- the audio feature encoding module is used to obtain the audio according to the second audio feature and the decoded text information Text information corresponding to the information;
- the execution device is configured to execute corresponding instructions according to the text information corresponding to the audio information.
- an audio signal processing method is provided.
- the seventh embodiment of the present application provides an audio information processing system.
- the audio information processing system includes: a client and a server; the client is configured to obtain audio information; send the audio information to the server; the server, Used to obtain the first audio feature corresponding to the audio information; encoding the audio feature at the specified time according to the audio feature at the specified time in the first audio feature and the audio feature adjacent to the audio feature at the specified time , Obtain the second audio feature corresponding to the audio information; obtain the decoded text information corresponding to the audio information; obtain the text information corresponding to the audio information according to the second audio feature and the decoded text information; Provide text information corresponding to the audio information to the client.
- the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
- processors CPUs
- input/output interfaces network interfaces
- memory volatile and non-volatile memory
- the memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flashRAM). Memory is an example of computer readable media.
- RAM random access memory
- ROM read-only memory
- flashRAM flash memory
- Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
- the information can be computer readable instructions, information structures, program modules, or other information.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage media or any other non-transmission media can be used to store information that can be accessed by computing devices.
- computer-readable media does not include non-transitory computer-readable media (transitory media), such as modulated information signals and carrier waves.
- this application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
- a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种音频信息处理方法、装置、电子设备以及存储介质,该方法包括:获得音频信息对应的第一音频特征;根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征;获得音频信息对应的已解码文本信息;根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。该方法降低了音频信息处理过程中的计算复杂度,提高了音频信息处理的效率。
Description
本申请要求2020年01月10日递交的申请号为202010026971.9、发明名称为“一种音频信息处理方法、装置、电子设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机技术领域,具体涉及一种音频信息处理方法、装置、电子设备以及存储介质。
随着计算机技术和物联网技术的发展,越来越多的智能设备开始支持人机语音交互。在人机语音交互过程中,智能设备需要采集与用户指令相关的语音信息,并进一步根据用户指令做出相应的反馈,从而实现人机语音交互。在用户与智能设备进行人机语音交互过程中,智能设备如何识别用户指令相关的语音信息成为完成人机语音交互的关键。传统的语音识别方法一般基于ASR(Automatic Speech Recognition,自动语音识别技术)的语音识别技术,传统的语音识别方法训练流程繁琐,需要引入很多人为设定的先验知识,此外,传统的语音识别方法还需要单独训练声学模型和语言模型,没法获得联合优化带来的收益。
近年来端到端的语音识别方法在语音识别领域得到了越来越多的关注。端到端的语音识别方法将传统语音识别方法中的声学模型和语言模型统一为一体,能够直接根据音频信息,得到音频信息对应的文本信息,从而简化了语音识别的过程。现有的端到端语音识别方法主要基于RNN(Recurrent NeuralNetwork,循环神经网络)或者CNN(Convolutional Neural Networks,卷积神经网络)的神经网络。但是,基于RNN或者CNN的端到端语音识别方法往往会存在由计算复杂度高而导致语音识别效率低的问题。
发明内容
本申请提供一种音频信息处理方法、装置、电子设备和存储介质,以降低音频信息处理过程中的计算复杂度,提高音频信息处理的效率。
本申请提供一种音频信息处理方法,包括:
获得音频信息对应的第一音频特征;
根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
获得所述音频信息对应的已解码文本信息;
根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
可选的,所述根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,包括:
在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征;
根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行编码。
可选的,所述根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行编码,包括:
根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征;
根据所述指定时刻的音频特征对应的第一编码音频特征和多个所述目标时刻的音频特征对应的第一编码音频特征,对所述指定时刻的音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征,依次执行上述步骤,直至所述编码次数到达指定编码次数,完成对所述指定时刻的音频特征的编码;
将所述第一音频特征对应的最终编码音频特征作为所述第二音频特征。可选的,所述根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征,包括:根据所述指定时刻的线性音频特征、所述指定时刻的非线性音频特征、多个所述目标时刻的线性音频特征以及多个所述目标时刻的非线性音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征。
可选的,所述根据所述指定时刻的音频特征对应的第一编码音频特征和多个所述目标时刻的音频特征对应的第一编码音频特征,对所述指定时刻的音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征,包括:根据所述指定时刻的音频特征对应的第一编码线性音频特征、所述指定时刻的音频特征对应的第一编码非线性音频特征、多个所述目标时刻的音频特征对应的第一编码线性音频特征以及多个所述目标时刻的音频特征对应的第一编码非线性音频特征进行第二次编码,获得所述第一音频特 征对应的第二编码音频特征。
可选的,还包括:
对所述第一音频特征对应的第一编码音频特征进行线性变换,获得所述第一音频特征对应的第一编码线性音频特征;
对所述第一音频特征对应的第一编码线性音频特征进行线性整流,获得所述第一音频特征对应的第一编码非线性音频特征。
可选的,所述在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:
确定与所述指定时刻的音频特征相邻的音频特征的范围;
根据与所述指定时刻的音频特征相邻的音频特征的范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述确定与所述指定时刻的音频特征相邻的音频特征的范围,包括:确定在所述指定时刻的音频特征之前、与所述指定时刻的音频特征相邻的音频特征的第一范围,并确定在所述指定时刻的音频特征之后、与所述指定时刻的音频特征相邻的音频特征的第二范围;
所述根据与所述指定时刻的音频特征相邻的音频特征的范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据所述第一范围和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述第一范围和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:
确定步幅因子,所述步幅因子为用于指示在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征时的取值时间间隔;
根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第一步幅因子和所述第一范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述 指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第二步幅因子和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息,包括:根据所述第二音频特征和所述已解码文本信息,对所述第二音频信息对应的待解码音频信息进行解码,获得所述音频信息对应的文本信息。
可选的,所述根据所述第二音频特征和所述已解码文本信息,对所述第二音频信息对应的待解码音频信息进行解码,获得所述音频信息对应的文本信息,包括:
获得所述第二音频特征对应的第一待解码音频信息;
根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得第一解码文本信息;
获得所述第二音频特征对应的第二待解码音频信息;更新所述第一解码文本信息为所述已解码信息;
根据所述第二音频特征和所述已解码文本信息,对所述第二待解码音频信息进行解码,获得第二解码文本信息,依次执行上述步骤,直至对所述第二音频信息对应的全部待解码音频信息进行解码,获得所述音频信息对应的文本信息。
可选的,所述已解码信息包括:用于指示对所述第二音频信息对应的待解码音频信息进行解码的指示信息。
可选的,所述根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得第一解码文本信息,包括:
根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得所述第一待解码音频信息对应的文本信息;
根据所述第一待解码音频信息对应的文本信息和所述已解码文本信息,获得第一解码文本信息。
可选的,所述根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得所述第一待解码音频信息对应的文本信息,包括:
根据所述第二音频特征和所述已解码文本信息,获得所述第一待解码音频信息对应的文本单位的预测值;
获得所述文本单位的概率分布;
获得概率值最大的文本单位,作为所述第一待解码音频信息对应的文本信息。可选 的,所述获得音频信息对应的第一音频特征,包括:
获得所述音频信息;
对所述音频信息进行特征提取,获得所述第一音频特征。
可选的,所述对所述音频信息进行特征提取,获得所述第一音频特征,包括:对所述音频信息进行特征提取,获得所述音频信息对应的第一音频特征序列。
可选的,还包括:输出所述音频信息对应的文本信息。本申请另一方面,还提供一种音频信息处理装置,包括:
第一音频特征获得单元,用于获得音频信息对应的第一音频特征;
第二音频特征获得单元,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
已解码文本信息获得单元,用于获得所述音频信息对应的已解码文本信息;文本信息获得单元,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
本申请另一方面,还提供一种电子设备,包括:处理器;
存储器,用于存储音频信息处理方法的程序,该设备通电并通过所述处理器运行所述音频信息处理方法的程序后,执行下述步骤:
获得音频信息对应的第一音频特征;
根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
获得所述音频信息对应的已解码文本信息;
根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
本申请另一方面,还提供一种存储设备,存储有音频信息处理方法的程序,该程序被处理器运行,执行下述步骤:获得音频信息对应的第一音频特征;
根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
获得所述音频信息对应的已解码文本信息;
根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
本申请另一方面,还提供一种智能音箱,包括:音频采集设备和音频识别设备,其中,所述音频识别设备包括:音频特征提取模块、音频特征编码模块、已解码文本存储模块以及音频特征编码模块;所述音频采集设备,用于获得音频信息;
所述音频特征提取模块,用于获得所述音频信息对应的第一音频特征;
所述音频特征编码模块,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
所述已解码文本存储模块,用于获得所述音频信息对应的已解码文本信息;所述音频特征编码模块,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
本申请另一方面,还提供一种车载智能语音交互装置,包括:音频采集设备、音频识别设备和执行设备,其中,所述音频识别设备包括:音频特征提取模块、音频特征编码模块、已解码文本存储模块以及音频特征编码模块;
所述音频采集设备,用于获得音频信息;
所述音频特征提取模块,用于获得所述音频信息对应的第一音频特征;
所述音频特征编码模块,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
所述已解码文本存储模块,用于获得所述音频信息对应的已解码文本信息;所述音频特征编码模块,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息;
所述执行设备,用于根据所述音频信息对应的文本信息执行相应指令。
本申请另一方面,还提供一种音频信息处理系统,包括:客户端、服务端;所述客户端,用于获得音频信息;将所述音频信息发送给所述服务端;
所述服务端,用于获得音频信息对应的第一音频特征;根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;获得所述音频信息对应的已解码文本信息;根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息;将所述音频信息对应的文本信息提供给所述客户端。
与现有技术相比,本申请具有以下优点:
本申请提供的音频信息处理方法,首先,获得音频信息对应的第一音频特征;其次,根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征;再次,获得音频信息对应的已解码文本信息;最后,根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。本申请提供的音频信息处理方法,能够根据第一音频特征中指定时刻的音频特征和第一音频特征中与指定时刻的音频特征相邻的音频特征,对第一音频特征中指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征,并进一步根据第二音频特征和已解码文本信息来获得音频信息对应的文本信息,本申请提供的音频信息处理方法,在获得第二音频特征以及根据第二音频特征和已解码文本信息获得音频信息对应的文本信息的过程中,需要使用到的参数较少,从而降低了音频信息处理过程中的计算复杂度,提高了音频信息处理的效率。
图1为本申请提供的音频信息处理方法的第一应用场景实施例的示意图。
图2为本申请提供的音频信息处理方法的第二应用场景实施例的示意图。
图3为本申请第一实施例中提供的一种音频信息处理方法的流程图。
图4为本申请第一实施中提供的一种对指定时刻的音频特征进行编码的方法的流程图。
图5为本申请第一实施中提供的一种选择多个目标时刻的音频特征的方法的流程图。
图6为本申请第一实施例中提供的一种获得音频信息对应的文本信息的方法的流程图。
图7为本申请第二实施例中提供的一种音频信息处理装置的示意图。
图8为本申请实施例中提供的一种电子设备的示意图。
在下面的描述中阐述了很多具体细节以便于充分理解本发明。但是本发明能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施的限制。
为了更清楚地展示本申请提供的音频信息处理方法,先介绍一下本申请提供的音频 信息处理方法的应用场景。本申请提供的音频信息处理方法可以应用于机器翻译场景,如图1所示,其为本申请提供的音频信息处理方法的第一应用场景实施例的示意图。本申请第一场景实施例中具体以将本申请提供的音频信息处理方法应用于同声翻译耳机的应用场景为例,对本申请提供的音频信息处理方法进行详细说明。在将本申请提供的音频信息处理方法应用于同声翻译耳机时,音频信息为用户的语音信息。
当用户使用同声翻译耳机进行对话时,同声翻译耳机会通过自带的声音采集设备采集目标用户的语音信息,在采集到目标用户的语音信息后,同声翻译耳机会先识别该语音对应的语种,并进一步判断该语音的语种是否为用户预设的待翻译语种,若是,则同声翻译耳机会对用户的语音信息进行处理,识别并翻译该语音信息。
同声翻译耳机识别目标用户的语音信息的具体过程如下:首先,对语音信息进行降噪处理,并在降噪处理后,进一步对该语音信息进行声学特征提取,获得语音信息对应的第一语音特征。其中,第一语音特征具体为语音特征序列,即,该语音信息在N个语音帧的语音特征,语音特征包括语音的音素特征、语音的频谱特征等。其次,同声翻译耳机的编码单元会根据第一语音特征中指定时刻的语音特征和与指定时刻的语音特征相邻的语音特征,对指定时刻的语音特征进行编码,获得语音信息对应的第二语音特征。本申请场景实施例中,指定时刻为根据预先设定的编码次数和音频长度确定的,具体的,根据音频长度和预设的编码次数,求出编码的时间间隔,在选定一个时刻作为起始时刻,根据起始时刻、编码次数、时间间隔即可获得每一指定时刻。再次,同声翻译耳机的解码单元会获得第二语音特征以及语音信息对应的已解码文本信息作为解码语音信息中为解码语音信息的输出。其中,已解码信息可以为用于指示对第二语音信息对应的待解码语音信息进行解码的指示信息。最后,同声翻译耳机的解码单元会根据第二语音特征和已解码文本信息,获得语音信息对应的文本信息。
需要说明的是,同声翻译耳机的编码单元根据第一语音特征中指定时刻的语音特征和与指定时刻的语音特征相邻的语音特征,对指定时刻的语音特征进行编码,获得语音信息对应的第二语音特征的具体过程为:首先,在与指定时刻的语音特征相邻的语音特征中选择多个目标时刻的语音特征。其次,根据指定时刻的语音特征和多个目标时刻的语音特征,对指定时刻的语音特征进行第一次编码,获得第一语音特征对应的第一编码语音特征;根据指定时刻的语音特征对应的第一编码语音特征和多个目标时刻的语音特征对应的第一编码语音特征,获得第一语音特征对应的第二编码语音特征,依次执行上述步骤,直至编码次数到达指定编码次数,完成对指定时刻的语音特征的编码;将第一 语音特征对应的最终编码语音特征作为第二语音特征。
在对指定时刻的语音特征进行第一次编码,获得第一语音特征对应的第一编码语音特征的具体过程是:根据对指定时刻的线性语音特征、指定时刻的非线性语音特征、多个目标时刻的线性语音特征以及多个目标时刻的非线性语音特征进行第一次编码,获得第一语音特征对应的第一编码语音特征。
在对指定时刻的语音特征进行第N次编码,获得第一语音特征对应的第N编码语音特征的具体过程是:根据指定时刻的语音特征对应的第N-1编码线性语音特征、指定时刻的语音特征对应的第N-1编码非线性语音特征、多个目标时刻的语音特征对应的第N-1编码线性语音特征以及多个目标时刻的语音特征对应的第N-1编码非线性语音特征进行第N次编码,获得第一语音特征对应的第N编码语音特征。其中,N为预先设置好的编码次数。
需要说明的是,根据第二语音特征和已解码文本信息,获得语音信息对应的文本信息的具体过程为:同声翻译耳机的解码单元在获得第二语音特征信息和已解码文本信息后,会获得第二语音特征对应的第一待解码语音信息;根据第二语音特征和已解码文本信息,对第一待解码语音信息进行解码,获得第一解码文本信息;获得第二语音特征对应的第二待解码语音信息;更新第一解码文本信息为已解码信息;根据第二语音特征和已解码文本信息,对第二待解码语音信息进行解码,获得第二解码文本信息,依次执行上述步骤,直至对第二语音信息对应的全部待解码语音信息进行解码,获得语音信息对应的文本信息。其中,根据第二语音特征和已解码文本信息,对第一待解码语音信息进行解码,获得第一解码文本信息时,需要首先,根据第二语音特征和已解码文本信息,获得第一待解码语音信息对应的文本单位的预测值;然后,获得文本单位的概率分布;最后,获得概率值最大的文本单位,作为第一待解码语音信息对应的文本信息。
在获得语音信息对应的文本信息后,同声翻译的解码单元会将该语音信息对应的文本信息提供给同声翻译的翻译模块单元,由于翻译模块单元针对语音信息对应的文本信息进行翻译,将语音信息对应的文本信息翻译为预设育种的文本信息,并将文本信息转化为预设语种的语音信息并输出。
本申请提供的音频信息处理方法可以应用于语音转换为文字场景,如图2所示,其为本申请提供的音频信息处理方法的第二应用场景实施例的示意图。本申请第二场景实施例中具体以将本申请提供的音频信息处理方法应用于社交软件中将语音转换为文字的应用场景为例,对本申请提供的音频信息处理方法进行详细说明。本申请第二应用场景 实施例中,音频信息为语音信息。
社交软件在将接受到的语音信息转换成文字信息时,会先将语音信息发送至语音识别系统,通过该语音识别系统对语音信息进行语音识别。具体的,该语音识别系统包括语音特征提取模块201、编码模块202、解码模块203。通过语音识别系统对语音信息进行识别的过程如下:
首先,由语音特征提取模块201对语音信息进行特征提取,获得语音信息对应的第一语音特征,并进一步将第一语音特征提供给编码模块202。
其次,编码模块202获得第一语音特征后,通过依次通过该编码模块202中的线性投影层202-1对第一语音特征进行线性变换,获得第一语音特征的线性语音特征,通过线性整流层202-2对第一语音特征的线性语音特征进行线性整流,获得第一语音特征的非线性语音特征。
再次,通过编码模块202中的N层编码层202-3根据第一语音特征中指定时刻的语音特征和与指定时刻的语音特征相邻的语音特征,对指定时刻的语音特征进行编码,获得语音信息对应的第二语音特征。
最后,由解码模块203获得语音信息对应的已解码文本信息和第二语音特征,并根据第二语音特征和已解码文本信息,获得语音信息对应的文本信息。
需要说明的是,上述两个应用场景仅仅是本申请提供的音频信息处理方法的应用场景的两个实施例,提供这两个应用场景实施例的目的是便于理解本申请提供的音频信息处理方法,而并非用于限定本申请提供的音频信息处理方法。本申请第一实施例提供一种在手持设备上展示电子阅读对象的内容的方法,以下结合图1至图6进行说明。
请参照图3,其为本申请第一实施例中提供的一种音频信息处理方法的流程图。
在步骤S301中,获得音频信息对应的第一音频特征。
音频特征包括音频的音素特征、音频的频谱特征等。本申请第一实施例中的音频信息一般为人发出的语音信息、音频设备发出的语音信息,如:歌声等。
获得音频信息对应的第一音频特征的具体步骤为:获得音频信息;对音频信息进行特征提取,获得第一音频特征。其中,对音频信息进行特征提取,获得第一音频特征,包括:对音频信息进行特征提取,获得音频信息对应的第一音频特征序列。即,获得音频信息在N个语音帧的音频特征。
在步骤S302中,根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征。
本申请第一实施例中,对指定时刻的音频特征进行编码的过程为将指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征输入到编码器中进行线性和非线性变换,从而对第一音频特征进行特征属性降维,获得新的音频特征的表达,本申请第一实施例中,第二音频特征信息是对第一音频特征进行编码后获得音频特征信息。
根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码的过程为:在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征;根据指定时刻的音频特征和多个目标时刻的音频特征,对指定时刻的音频特征进行编码。其中,根据指定时刻的音频特征和多个目标时刻的音频特征,对指定时刻的音频特征进行编码的过程请参照图4,其为本申请第一实施中提供的一种对指定时刻的音频特征进行编码的方法的流程图。
在步骤S401中,根据指定时刻的音频特征和多个目标时刻的音频特征,对指定时刻的音频特征进行第一次编码,获得第一音频特征对应的第一编码音频特征。
根据指定时刻的音频特征和多个目标时刻的音频特征,对指定时刻的音频特征进行第一次编码,获得第一音频特征对应的第一编码音频特征,包括:根据指定时刻的线性音频特征、指定时刻的非线性音频特征、多个目标时刻的线性音频特征以及多个目标时刻的非线性音频特征进行第一次编码,获得第一音频特征对应的第一编码音频特征。
在步骤S402中,根据指定时刻的音频特征对应的第一编码音频特征和多个目标时刻的音频特征对应的第一编码音频特征,对指定时刻的音频特征进行第二次编码,获得第一音频特征对应的第二编码音频特征,依次执行上述步骤,直至编码次数到达指定编码次数,完成对指定时刻的音频特征的编码。
编码次数和音频长度有关,在本申请第一实施例中,在获得音频信息对应的第一音频特征时,通常每10ms提取一帧音频信息的音频特征,如:6s的音频信息,就能够提取出600帧的音频特征,从而获得音频信息对应的第一音频特征。在获得音频信息对应的第一音频特征后,由于音频信息对应的第一音频特征中的600帧的音频特征是不相邻帧的音频特征,在获得音频信息对应的第二音频特征时,会对第一音频特征中的600帧的音频特征进行相邻帧拼接处理和采样处理,如果采样率为6,那么600帧音频特征会进一步会转化成100帧拼接的音频特征。在将600帧音频特征转化成100帧拼接的音频特征时,会对任一指定时刻的音频特征进行编码,编码次数也为100。
本申请第一实施例中在对指定时刻的音频特征进行第二-N次编码时的过程类似,所以,本申请第一实施例中仅对指定时刻的音频特征进行第二次编码时的过程进行详细地 说明。根据指定时刻的音频特征对应的第一编码音频特征和多个目标时刻的音频特征对应的第一编码音频特征,对指定时刻的音频特征进行第二次编码,获得第一音频特征对应的第二编码音频特征,包括:根据指定时刻的音频特征对应的第一编码线性音频特征、指定时刻的音频特征对应的第一编码非线性音频特征、多个目标时刻的音频特征对应的第一编码线性音频特征以及多个目标时刻的音频特征对应的第一编码非线性音频特征进行第二次编码,获得第一音频特征对应的第二编码音频特征。
由于在每次编码过程都需要用到线性音频特征和非线性音频特征,在本申请第一实施例中具体以对指定时刻的音频特征进行第二次编码进行说明,在对指定时刻的音频特征进行第二次编码之前还需要对第一音频特征对应的第一编码音频特征进行线性变换,获得第一音频特征对应的第一编码线性音频特征;对第一音频特征对应的第一编码线性音频特征进行线性整流,获得第一音频特征对应的第一编码非线性音频特征。
本申请第一实施例中对线性音频特征进行线性整流获得非线性音频特征时,一般是通过ReLU函数(RectifiedLinearUnit,线性整流函数)来实现。
在执行步骤S302过程中,需要在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,本申请第一实施例中在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征的步骤请参照图5,其为本申请第一实施中提供的一种选择多个目标时刻的音频特征的方法的流程图。
在步骤S501中,确定与指定时刻的音频特征相邻的音频特征的范围。
确定与指定时刻的音频特征相邻的音频特征的范围,包括:确定在指定时刻的音频特征之前、与指定时刻的音频特征相邻的音频特征的第一范围,并确定在指定时刻的音频特征之后、与指定时刻的音频特征相邻的音频特征的第二范围。
在步骤S502中,根据与指定时刻的音频特征相邻的音频特征的范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
根据与指定时刻的音频特征相邻的音频特征的范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第一范围和第二范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。具体的,在根据第一范围和第二范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征时,需要首先,确定步幅因子,步幅因子为用于指示在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征时的取值时间间隔;然后,再根据步幅因子、根据第一范围以及第二范围,在与指定时刻的音频特征相邻的音频特征中选择多个 目标时刻的音频特征。
需要说明的是,根据步幅因子、根据第一范围以及第二范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第一步幅因子和第一范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
需要说明的是,根据步幅因子、根据第一范围以及第二范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第二步幅因子和第二范围,在与指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
在步骤S403中,将第一音频特征对应的最终编码音频特征作为第二音频特征。
在获得第二音频特征,需要进一步根据第二特征来获得音频信息对应的文本信息。
在步骤S303中,获得音频信息对应的已解码文本信息。
本申请第一实施例中,已解码文本信息可以为在当前时刻之前已经获得的音频信息对应的文本信息,当前时刻之前未获得的音频信息对应的文本信息时,已解码文本信息也可以为用于指示对第二音频信息对应的待解码音频信息进行解码的指示信息。
在步骤S304中,根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。
根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息具体包括:根据第二音频特征和已解码文本信息,对第二音频信息对应的待解码音频信息进行解码,获得音频信息对应的文本信息,具体过程请参照图6,其为本申请第一实施例中提供的一种获得音频信息对应的文本信息的方法的流程图。
在步骤S601中,获得第二音频特征对应的第一待解码音频信息。
本申请第一实施例中,解码过程是将前一时刻的解码结果和编码器的编码表达输入到一个解码器中,得到相对应的解码输出的过程。
在步骤S602中,根据第二音频特征和已解码文本信息,对第一待解码音频信息进行解码,获得第一解码文本信息。
根据第二音频特征和已解码文本信息,对第一待解码音频信息进行解码,获得第一解码文本信息的具体过程为:
首先,根据第二音频特征和已解码文本信息,对第一待解码音频信息进行解码,获得第一待解码音频信息对应的文本信息。
然后,根据第一待解码音频信息对应的文本信息和已解码文本信息,获得第一解码文本信息。即,根据第二音频特征和已解码文本信息,获得第一待解码音频信息对应的 文本单位的预测值;获得文本单位的概率分布;获得概率值最大的文本单位,作为第一待解码音频信息对应的文本信息。
在步骤S603中,更新第一解码文本信息为已解码信息。
在步骤S604中,根据第二音频特征和已解码文本信息,对第二待解码音频信息进行解码,获得第二解码文本信息,依次执行上述步骤,直至对第二音频信息对应的全部待解码音频信息进行解码,获得音频信息对应的文本信息。
对本申请对第二-M待解码音频信息进行解码的过程,请参照在步骤S602中对第一待解码音频信息进行解码的过程。
本申请提供的音频信息处理方法,首先,获得音频信息对应的第一音频特征;其次,根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征;再次,获得音频信息对应的已解码文本信息;最后,根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。本申请提供的音频信息处理方法,能够根据第一音频特征中指定时刻的音频特征和第一音频特征中与指定时刻的音频特征相邻的音频特征,对第一音频特征中指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征,并进一步根据第二音频特征和已解码文本信息来获得音频信息对应的文本信息,本申请提供的音频信息处理方法,在获得第二音频特征以及根据第二音频特征和已解码文本信息获得音频信息对应的文本信息的过程中,需要使用到的参数较少,从而降低了音频信息处理过程中的计算复杂度,提高了音频信息处理的效率。
本申请第一实施例中提供的音频信息处理方法还包括:输出音频信息对应的文本信息。
第二实施例
与本申请第一实施例提供的一种音频信息处理方法相对应的,本申请第二实施例提供了一种音频信息处理装置。由于装置实施例基本相似于方法第一实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的装置实施例仅仅示意性的。
如图7所示,其为本申请第二实施例中提供的一种音频信息处理装置的示意图。
该音频信息处理装置包括:
第一音频特征获得单元701,用于获得音频信息对应的第一音频特征;
第二音频特征获得单元702,用于根据所述第一音频特征中指定时刻的音频特征和 与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
已解码文本信息获得单元703,用于获得所述音频信息对应的已解码文本信息;
文本信息获得单元704,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
可选的,所述第二音频特征获得单元702,具体用于在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征;根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行编码。可选的,所述根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行编码,包括:
根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征;
根据所述指定时刻的音频特征对应的第一编码音频特征和多个所述目标时刻的音频特征对应的第一编码音频特征,对所述指定时刻的音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征,依次执行上述步骤,直至所述编码次数到达指定编码次数,完成对所述指定时刻的音频特征的编码;
将所述第一音频特征对应的最终编码音频特征作为所述第二音频特征。可选的,所述根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征,包括:根据所述指定时刻的线性音频特征、所述指定时刻的非线性音频特征、多个所述目标时刻的线性音频特征以及多个所述目标时刻的非线性音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征。
可选的,所述根据所述指定时刻的音频特征对应的第一编码音频特征和多个所述目标时刻的音频特征对应的第一编码音频特征,对所述指定时刻的音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征,包括:根据所述指定时刻的音频特征对应的第一编码线性音频特征、所述指定时刻的音频特征对应的第一编码非线性音频特征、多个所述目标时刻的音频特征对应的第一编码线性音频特征以及多个所述目标时刻的音频特征对应的第一编码非线性音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征。
可选的,还包括:
对所述第一音频特征对应的第一编码音频特征进行线性变换,获得所述第一音频特征对应的第一编码线性音频特征;
对所述第一音频特征对应的第一编码线性音频特征进行线性整流,获得所述第一音频特征对应的第一编码非线性音频特征。
可选的,所述在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:
确定与所述指定时刻的音频特征相邻的音频特征的范围;
根据与所述指定时刻的音频特征相邻的音频特征的范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述确定与所述指定时刻的音频特征相邻的音频特征的范围,包括:确定在所述指定时刻的音频特征之前、与所述指定时刻的音频特征相邻的音频特征的第一范围,并确定在所述指定时刻的音频特征之后、与所述指定时刻的音频特征相邻的音频特征的第二范围;
所述根据与所述指定时刻的音频特征相邻的音频特征的范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据所述第一范围和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述第一范围和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:
确定步幅因子,所述步幅因子为用于指示在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征时的取值时间间隔;
根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第一步幅因子和所述第一范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
可选的,所述根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第二步幅因子和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目 标时刻的音频特征。
可选的,所述文本信息获得单元704,具体用于根据所述第二音频特征和所述已解码文本信息,对所述第二音频信息对应的待解码音频信息进行解码,获得所述音频信息对应的文本信息。
可选的,所述根据所述第二音频特征和所述已解码文本信息,对所述第二音频信息对应的待解码音频信息进行解码,获得所述音频信息对应的文本信息,包括:
获得所述第二音频特征对应的第一待解码音频信息;
根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得第一解码文本信息;
获得所述第二音频特征对应的第二待解码音频信息;更新所述第一解码文本信息为所述已解码信息;
根据所述第二音频特征和所述已解码文本信息,对所述第二待解码音频信息进行解码,获得第二解码文本信息,依次执行上述步骤,直至对所述第二音频信息对应的全部待解码音频信息进行解码,获得所述音频信息对应的文本信息。
可选的,所述已解码信息包括:用于指示对所述第二音频信息对应的待解码音频信息进行解码的指示信息。
可选的,所述根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得第一解码文本信息,包括:
根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得所述第一待解码音频信息对应的文本信息;
根据所述第一待解码音频信息对应的文本信息和所述已解码文本信息,获得第一解码文本信息。
可选的,所述根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得所述第一待解码音频信息对应的文本信息,包括:
根据所述第二音频特征和所述已解码文本信息,获得所述第一待解码音频信息对应的文本单位的预测值;
获得所述文本单位的概率分布;
获得概率值最大的文本单位,作为所述第一待解码音频信息对应的文本信息。可选的,所述第一音频特征获得单元701,具体用于获得所述音频信息;对所述音频信息进行特征提取,获得所述第一音频特征。
可选的,所述对所述音频信息进行特征提取,获得所述第一音频特征,包括:对所述音频信息进行特征提取,获得所述音频信息对应的第一音频特征序列。
可选的,所述音频信息处理装置还包括:文本信息输出单元,用于输出所述音频信息对应的文本信息。
本申请第二实施例中提供的音频信息处理装置,首先,获得音频信息对应的第一音频特征;其次,根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征;再次,获得音频信息对应的已解码文本信息;最后,根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。本申请提供的音频信息处理装置,能够根据第一音频特征中指定时刻的音频特征和第一音频特征中与指定时刻的音频特征相邻的音频特征,对第一音频特征中指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征,并进一步根据第二音频特征和已解码文本信息来获得音频信息对应的文本信息,本申请提供的音频信息处理装置,在获得第二音频特征以及根据第二音频特征和已解码文本信息获得音频信息对应的文本信息的过程中,需要使用到的参数较少,从而降低了音频信息处理过程中的计算复杂度,提高了音频信息处理的效率。
第三实施例
与本申请第一实施例提供的音频信息处理方法相对应的,本申请第三实施例中提供一种电子设备。
如图8所示,图8为本申请实施例中提供的一种电子设备的示意图。所述电子设备包括:
处理器801;以及
存储器802,用于存储计算机程序,该设备通电并通过所述处理器运行该计算机程序后,执行本申请第一实施中所述的音频信息处理方法。
本申请第三实施例中提供的电子设备,首先,获得音频信息对应的第一音频特征;其次,根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征;再次,获得音频信息对应的已解码文本信息;最后,根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。本申请提供的音频信息处理电子设备,能够根据第一音频特征中指定时刻的音频特征和第一音频特征中与指定时刻的音频特征相邻的音频特征,对第一音频特征中指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征,并进 一步根据第二音频特征和已解码文本信息来获得音频信息对应的文本信息,本申请提供的音频信息处理电子设备,在获得第二音频特征以及根据第二音频特征和已解码文本信息获得音频信息对应的文本信息的过程中,需要使用到的参数较少,从而降低了音频信息处理过程中的计算复杂度,提高了音频信息处理的效率。
需要说明的是,对于本申请第三实施例提供的电子设备执行的音频信息处理方法的详细描述可以参考对本申请第一实施例的相关描述,这里不再赘述。
第四实施例
与本申请第一实施例提供的音频信息处理方法相对应的,本申请第四实施例提供一种存储介质,该存储介质存储有计算机程序,该计算机程序被处理器运行,执行本申请第一实施中所述的音频信息处理方法。
本申请第四实施例中提供的存储介质,首先,获得音频信息对应的第一音频特征;其次,根据第一音频特征中指定时刻的音频特征和与指定时刻的音频特征相邻的音频特征,对指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征;再次,获得音频信息对应的已解码文本信息;最后,根据第二音频特征和已解码文本信息,获得音频信息对应的文本信息。本申请提供的音频信息处理存储介质,能够根据第一音频特征中指定时刻的音频特征和第一音频特征中与指定时刻的音频特征相邻的音频特征,对第一音频特征中指定时刻的音频特征进行编码,获得音频信息对应的第二音频特征,并进一步根据第二音频特征和已解码文本信息来获得音频信息对应的文本信息,本申请提供的音频信息处理存储介质,在获得第二音频特征以及根据第二音频特征和已解码文本信息获得音频信息对应的文本信息的过程中,需要使用到的参数较少,从而降低了音频信息处理过程中的计算复杂度,提高了音频信息处理的效率。
需要说明的是,对于本申请第四实施例提供的存储介质的详细描述可以参考对本申请第一实施例的相关描述,这里不再赘述。
第五实施例
在上述第一实施例中,提供了一种音频信号处理方法,与之相对应的,本申请第五实施例提供了一种智能音箱。
本申请第五实施例中提供的智能音箱,包括:音频采集设备和音频识别设备,其中,所述音频识别设备包括:音频特征提取模块、音频特征编码模块、已解码文本存储模块以及音频特征编码模块;所述音频采集设备,用于获得音频信息;
所述音频特征提取模块,用于获得所述音频信息对应的第一音频特征;所述音频特 征编码模块,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
所述已解码文本存储模块,用于获得所述音频信息对应的已解码文本信息;所述音频特征编码模块,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
第六实施例
在上述第一实施例中,提供了一种音频信号处理方法,与之相对应的,本申请第六实施例提供了一种车载智能语音交互装置。
本申请第六实施例中提供的车载智能语音交互装置,包括:音频采集设备、音频识别设备和执行设备,其中,所述音频识别设备包括:音频特征提取模块、音频特征编码模块、已解码文本存储模块以及音频特征编码模块;
所述音频采集设备,用于获得音频信息;
所述音频特征提取模块,用于获得所述音频信息对应的第一音频特征;所述音频特征编码模块,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;
所述已解码文本存储模块,用于获得所述音频信息对应的已解码文本信息;所述音频特征编码模块,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息;
所述执行设备,用于根据所述音频信息对应的文本信息执行相应指令。
第七实施例
在上述第一实施例中,提供了一种音频信号处理方法,与之相对应的,本申请第七实施例提供了一种音频信息处理系统。
本申请第七实施例中提供的音频信息处理系统,包括:客户端、服务端;所述客户端,用于获得音频信息;将所述音频信息发送给所述服务端;所述服务端,用于获得音频信息对应的第一音频特征;根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;获得所述音频信息对应的已解码文本信息;根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息;将所述音频信息 对应的文本信息提供给所述客户端。
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flashRAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、信息结构、程序的模块或其他信息。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储介质或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitorymedia),如调制的信息信号和载波。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
Claims (25)
- 一种音频信息处理方法,其特征在于,包括:获得音频信息对应的第一音频特征;根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;获得所述音频信息对应的已解码文本信息;根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
- 根据权利要求1所述的音频信息处理方法,其特征在于,所述根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,包括:在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征;根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行编码。
- 根据权利要求2所述的音频信息处理方法,其特征在于,所述根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行编码,包括:根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征;根据所述指定时刻的音频特征对应的第一编码音频特征和多个所述目标时刻的音频特征对应的第一编码音频特征,对所述指定时刻的音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征,依次执行上述步骤,直至所述编码次数到达指定编码次数,完成对所述指定时刻的音频特征的编码;将所述第一音频特征对应的最终编码音频特征作为所述第二音频特征。
- 根据权利要求3所述的音频信息处理方法,其特征在于,所述根据所述指定时刻的音频特征和多个所述目标时刻的音频特征,对所述指定时刻的音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征,包括:根据所述指定时刻的线性音频特征、所述指定时刻的非线性音频特征、多个所述目标时刻的线性音频特征以及多个所述目标时刻的非线性音频特征进行第一次编码,获得所述第一音频特征对应的第一编码音频特征。
- 根据权利要求3所述的音频信息处理方法,其特征在于,所述根据所述指定时刻的音频特征对应的第一编码音频特征和多个所述目标时刻的音频特征对应的第一编码音频特征,对所述指定时刻的音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征,包括:根据所述指定时刻的音频特征对应的第一编码线性音频特征、所述指定时刻的音频特征对应的第一编码非线性音频特征、多个所述目标时刻的音频特征对应的第一编码线性音频特征以及多个所述目标时刻的音频特征对应的第一编码非线性音频特征进行第二次编码,获得所述第一音频特征对应的第二编码音频特征。
- 根据权利要求5所述的音频信息处理方法,其特征在于,还包括:对所述第一音频特征对应的第一编码音频特征进行线性变换,获得所述第一音频特征对应的第一编码线性音频特征;对所述第一音频特征对应的第一编码线性音频特征进行线性整流,获得所述第一音频特征对应的第一编码非线性音频特征。
- 根据权利要求2所述的音频信息处理方法,其特征在于,所述在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:确定与所述指定时刻的音频特征相邻的音频特征的范围;根据与所述指定时刻的音频特征相邻的音频特征的范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
- 根据权利要求7所述的音频信息处理方法,其特征在于,所述确定与所述指定时刻的音频特征相邻的音频特征的范围,包括:确定在所述指定时刻的音频特征之前、与所述指定时刻的音频特征相邻的音频特征的第一范围,并确定在所述指定时刻的音频特征之后、与所述指定时刻的音频特征相邻的音频特征的第二范围;所述根据与所述指定时刻的音频特征相邻的音频特征的范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据所述第一范围和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
- 根据权利要求8所述的音频信息处理方法,其特征在于,所述根据所述第一范围和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:确定步幅因子,所述步幅因子为用于指示在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征时的取值时间间隔;根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
- 根据权利要求9所述的音频信息处理方法,其特征在于,所述根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第一步幅因子和所述第一范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
- 根据权利要求9所述的音频信息处理方法,其特征在于,所述根据所述步幅因子、根据所述第一范围以及所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征,包括:根据第二步幅因子和所述第二范围,在与所述指定时刻的音频特征相邻的音频特征中选择多个目标时刻的音频特征。
- 根据权利要求1所述的音频信息处理方法,其特征在于,所述根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息,包括:根据所述第二音频特征和所述已解码文本信息,对所述第二音频信息对应的待解码音频信息进行解码,获得所述音频信息对应的文本信息。
- 根据权利要求12所述的音频信息处理方法,其特征在于,所述根据所述第二音频特征和所述已解码文本信息,对所述第二音频信息对应的待解码音频信息进行解码,获得所述音频信息对应的文本信息,包括:获得所述第二音频特征对应的第一待解码音频信息;根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得第一解码文本信息;获得所述第二音频特征对应的第二待解码音频信息;更新所述第一解码文本信息为所述已解码信息;根据所述第二音频特征和所述已解码文本信息,对所述第二待解码音频信息进行解码,获得第二解码文本信息,依次执行上述步骤,直至对所述第二音频信息对应的全部待解码音频信息进行解码,获得所述音频信息对应的文本信息。
- 根据权利要求13所述的音频信息处理方法,其特征在于,所述已解码信息包括:用于指示对所述第二音频信息对应的待解码音频信息进行解码的指示信息。
- 根据权利要求13所述的音频信息处理方法,其特征在于,所述根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得第一解码文本信息,包括:根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得所述第一待解码音频信息对应的文本信息;根据所述第一待解码音频信息对应的文本信息和所述已解码文本信息,获得第一解码文本信息。
- 根据权利要求15所述的音频信息处理方法,其特征在于,所述根据所述第二音频特征和所述已解码文本信息,对所述第一待解码音频信息进行解码,获得所述第一待解码音频信息对应的文本信息,包括:根据所述第二音频特征和所述已解码文本信息,获得所述第一待解码音频信息对应的文本单位的预测值;获得所述文本单位的概率分布;获得概率值最大的文本单位,作为所述第一待解码音频信息对应的文本信息。
- 根据权利要求1所述的音频信息处理方法,其特征在于,所述获得音频信息对应的第一音频特征,包括:获得所述音频信息;对所述音频信息进行特征提取,获得所述第一音频特征。
- 根据权利要求17所述的音频信息处理方法,其特征在于,所述对所述音频信息进行特征提取,获得所述第一音频特征,包括:对所述音频信息进行特征提取,获得所述音频信息对应的第一音频特征序列。
- 根据权利要求1所述的音频信息处理方法,其特征在于,还包括:输出所述音频信息对应的文本信息。
- 一种音频信息处理装置,其特征在于,包括:第一音频特征获得单元,用于获得音频信息对应的第一音频特征;第二音频特征获得单元,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;已解码文本信息获得单元,用于获得所述音频信息对应的已解码文本信息;文本信息获得单元,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
- 一种电子设备,其特征在于,包括:处理器;存储器,用于存储音频信息处理方法的程序,该设备通电并通过所述处理器运行所 述音频信息处理方法的程序后,执行下述步骤:获得音频信息对应的第一音频特征;根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;获得所述音频信息对应的已解码文本信息;根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
- 一种存储设备,其特征在于,存储有音频信息处理方法的程序,该程序被处理器运行,执行下述步骤:获得音频信息对应的第一音频特征;根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;获得所述音频信息对应的已解码文本信息;根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
- 一种智能音箱,其特征在于,包括:音频采集设备和音频识别设备,其中,所述音频识别设备包括:音频特征提取模块、音频特征编码模块、已解码文本存储模块以及音频特征编码模块;所述音频采集设备,用于获得音频信息;所述音频特征提取模块,用于获得所述音频信息对应的第一音频特征;所述音频特征编码模块,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;所述已解码文本存储模块,用于获得所述音频信息对应的已解码文本信息;所述音频特征编码模块,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息。
- 一种车载智能语音交互装置,其特征在于,包括:音频采集设备、音频识别设备和执行设备,其中,所述音频识别设备包括:音频特征提取模块、音频特征编码模块、已解码文本存储模块以及音频特征编码模块;所述音频采集设备,用于获得音频信息;所述音频特征提取模块,用于获得所述音频信息对应的第一音频特征;所述音频特征编码模块,用于根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;所述已解码文本存储模块,用于获得所述音频信息对应的已解码文本信息;所述音频特征编码模块,用于根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息;所述执行设备,用于根据所述音频信息对应的文本信息执行相应指令。
- 一种音频信息处理系统,其特征在于,包括:客户端、服务端;所述客户端,用于获得音频信息;将所述音频信息发送给所述服务端;所述服务端,用于获得音频信息对应的第一音频特征;根据所述第一音频特征中指定时刻的音频特征和与所述指定时刻的音频特征相邻的音频特征,对所述指定时刻的音频特征进行编码,获得所述音频信息对应的第二音频特征;获得所述音频信息对应的已解码文本信息;根据所述第二音频特征和所述已解码文本信息,获得所述音频信息对应的文本信息;将所述音频信息对应的文本信息提供给所述客户端。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/789,055 US20230047378A1 (en) | 2020-01-10 | 2021-01-08 | Processing accelerator architectures |
EP21738888.3A EP4089671A4 (en) | 2020-01-10 | 2021-01-08 | AUDIO INFORMATION PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010026971.9 | 2020-01-10 | ||
CN202010026971.9A CN113112993B (zh) | 2020-01-10 | 2020-01-10 | 一种音频信息处理方法、装置、电子设备以及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021139772A1 true WO2021139772A1 (zh) | 2021-07-15 |
Family
ID=76708744
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/070879 WO2021139772A1 (zh) | 2020-01-10 | 2021-01-08 | 一种音频信息处理方法、装置、电子设备以及存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20230047378A1 (zh) |
EP (1) | EP4089671A4 (zh) |
CN (1) | CN113112993B (zh) |
WO (1) | WO2021139772A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240066398A1 (en) * | 2022-08-27 | 2024-02-29 | Courtney Robinson | Gaming headset |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5640490A (en) * | 1994-11-14 | 1997-06-17 | Fonix Corporation | User independent, real-time speech recognition system and method |
CN101178897A (zh) * | 2007-12-05 | 2008-05-14 | 浙江大学 | 利用基频包络剔除情感语音的说话人识别方法 |
CN101740030A (zh) * | 2008-11-04 | 2010-06-16 | 北京中星微电子有限公司 | 语音信号的发送及接收方法、及其装置 |
CN103123787A (zh) * | 2011-11-21 | 2013-05-29 | 金峰 | 一种移动终端与媒体同步与交互的方法 |
CN103236260A (zh) * | 2013-03-29 | 2013-08-07 | 京东方科技集团股份有限公司 | 语音识别系统 |
CN107170453A (zh) * | 2017-05-18 | 2017-09-15 | 百度在线网络技术(北京)有限公司 | 基于人工智能的跨语种语音转录方法、设备及可读介质 |
CN108417202A (zh) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | 语音识别方法及系统 |
CN110197658A (zh) * | 2019-05-30 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置以及电子设备 |
CN110444223A (zh) * | 2019-06-26 | 2019-11-12 | 平安科技(深圳)有限公司 | 基于循环神经网络和声学特征的说话人分离方法及装置 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ES2224121T3 (es) * | 1994-04-01 | 2005-03-01 | Sony Corporation | Metodo y dispositivo para codificar y descodificar informacion. |
US6549147B1 (en) * | 1999-05-21 | 2003-04-15 | Nippon Telegraph And Telephone Corporation | Methods, apparatuses and recorded medium for reversible encoding and decoding |
JP3620787B2 (ja) * | 2000-02-28 | 2005-02-16 | カナース・データー株式会社 | 音声データの符号化方法 |
EP1288911B1 (en) * | 2001-08-08 | 2005-06-29 | Nippon Telegraph and Telephone Corporation | Emphasis detection for automatic speech summary |
EP1497935B1 (en) * | 2002-04-05 | 2008-02-20 | International Business Machines Corporation | Feature-based audio content identification |
JP4792703B2 (ja) * | 2004-02-26 | 2011-10-12 | 株式会社セガ | 音声解析装置、音声解析方法及び音声解析プログラム |
KR100903958B1 (ko) * | 2006-08-02 | 2009-06-25 | 엠텍비젼 주식회사 | 디지털 오디오 데이터의 복호화 방법, 디지털 오디오데이터의 복호화 장치 및 디지털 오디오 데이터의 복호화방법을 수행하는 기록매체 |
WO2011048815A1 (ja) * | 2009-10-21 | 2011-04-28 | パナソニック株式会社 | オーディオ符号化装置、復号装置、方法、回路およびプログラム |
CN102810314B (zh) * | 2011-06-02 | 2014-05-07 | 华为终端有限公司 | 音频编码方法及装置、音频解码方法及装置、编解码系统 |
CN102708862B (zh) * | 2012-04-27 | 2014-09-24 | 苏州思必驰信息科技有限公司 | 触控辅助的实时语音识别系统及其同步解码方法 |
CN106559588B (zh) * | 2015-09-30 | 2021-01-26 | 中兴通讯股份有限公司 | 一种文本信息上传的方法及装置 |
JP6517670B2 (ja) * | 2015-11-13 | 2019-05-22 | 日本電信電話株式会社 | 音声認識装置、音声認識方法及び音声認識プログラム |
CN107871497A (zh) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | 语音识别方法和装置 |
KR20180071029A (ko) * | 2016-12-19 | 2018-06-27 | 삼성전자주식회사 | 음성 인식 방법 및 장치 |
CN109661664B (zh) * | 2017-06-22 | 2021-04-27 | 腾讯科技(深圳)有限公司 | 一种信息处理的方法及相关装置 |
CN109509475B (zh) * | 2018-12-28 | 2021-11-23 | 出门问问信息科技有限公司 | 语音识别的方法、装置、电子设备及计算机可读存储介质 |
-
2020
- 2020-01-10 CN CN202010026971.9A patent/CN113112993B/zh active Active
-
2021
- 2021-01-08 EP EP21738888.3A patent/EP4089671A4/en active Pending
- 2021-01-08 WO PCT/CN2021/070879 patent/WO2021139772A1/zh unknown
- 2021-01-08 US US17/789,055 patent/US20230047378A1/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5640490A (en) * | 1994-11-14 | 1997-06-17 | Fonix Corporation | User independent, real-time speech recognition system and method |
CN101178897A (zh) * | 2007-12-05 | 2008-05-14 | 浙江大学 | 利用基频包络剔除情感语音的说话人识别方法 |
CN101740030A (zh) * | 2008-11-04 | 2010-06-16 | 北京中星微电子有限公司 | 语音信号的发送及接收方法、及其装置 |
CN103123787A (zh) * | 2011-11-21 | 2013-05-29 | 金峰 | 一种移动终端与媒体同步与交互的方法 |
CN103236260A (zh) * | 2013-03-29 | 2013-08-07 | 京东方科技集团股份有限公司 | 语音识别系统 |
CN107170453A (zh) * | 2017-05-18 | 2017-09-15 | 百度在线网络技术(北京)有限公司 | 基于人工智能的跨语种语音转录方法、设备及可读介质 |
CN108417202A (zh) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | 语音识别方法及系统 |
CN110197658A (zh) * | 2019-05-30 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | 语音处理方法、装置以及电子设备 |
CN110444223A (zh) * | 2019-06-26 | 2019-11-12 | 平安科技(深圳)有限公司 | 基于循环神经网络和声学特征的说话人分离方法及装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4089671A4 * |
Also Published As
Publication number | Publication date |
---|---|
CN113112993A (zh) | 2021-07-13 |
CN113112993B (zh) | 2024-04-02 |
EP4089671A4 (en) | 2024-02-21 |
US20230047378A1 (en) | 2023-02-16 |
EP4089671A1 (en) | 2022-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11132172B1 (en) | Low latency audio data pipeline | |
US20240028841A1 (en) | Speech translation method, device, and storage medium | |
JP2021086154A (ja) | 音声認識方法、装置、機器及びコンピュータ読み取り可能な記憶媒体 | |
WO2023222088A1 (zh) | 语音识别与分类方法和装置 | |
CN112767954B (zh) | 音频编解码方法、装置、介质及电子设备 | |
US11355097B2 (en) | Sample-efficient adaptive text-to-speech | |
EP3255633B1 (en) | Audio content recognition method and device | |
WO2021051765A1 (zh) | 一种语音合成方法及装置、存储介质 | |
US11996084B2 (en) | Speech synthesis method and apparatus, device and computer storage medium | |
US20230127787A1 (en) | Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium | |
US20120201386A1 (en) | Automatic Generation of Metadata for Audio Dominance Effects | |
WO2023116660A2 (zh) | 一种模型训练以及音色转换方法、装置、设备及介质 | |
CN103413553B (zh) | 音频编码方法、音频解码方法、编码端、解码端和系统 | |
WO2024055752A1 (zh) | 语音合成模型的训练方法、语音合成方法和相关装置 | |
CN111798821A (zh) | 声音转换方法、装置、可读存储介质及电子设备 | |
CN113053357A (zh) | 语音合成方法、装置、设备和计算机可读存储介质 | |
US8868419B2 (en) | Generalizing text content summary from speech content | |
CN114495977B (zh) | 语音翻译和模型训练方法、装置、电子设备以及存储介质 | |
WO2021139772A1 (zh) | 一种音频信息处理方法、装置、电子设备以及存储介质 | |
JP2019525233A (ja) | 音声認識方法及び装置 | |
CN113409756B (zh) | 语音合成方法、系统、设备及存储介质 | |
US20230015112A1 (en) | Method and apparatus for processing speech, electronic device and storage medium | |
US12136428B1 (en) | Audio watermarking | |
US20240274120A1 (en) | Speech synthesis method and apparatus, electronic device, and readable storage medium | |
WO2024082928A1 (zh) | 语音处理方法、装置、设备和介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21738888 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2021738888 Country of ref document: EP Effective date: 20220810 |