WO2020102979A1 - 语音信息的处理方法、装置、存储介质及电子设备 - Google Patents

语音信息的处理方法、装置、存储介质及电子设备

Info

Publication number
WO2020102979A1
WO2020102979A1 PCT/CN2018/116447 CN2018116447W WO2020102979A1 WO 2020102979 A1 WO2020102979 A1 WO 2020102979A1 CN 2018116447 W CN2018116447 W CN 2018116447W WO 2020102979 A1 WO2020102979 A1 WO 2020102979A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
spectrogram
voice information
preset
model
Prior art date
Application number
PCT/CN2018/116447
Other languages
English (en)
French (fr)
Inventor
陈岩
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2018/116447 priority Critical patent/WO2020102979A1/zh
Priority to CN201880098316.5A priority patent/CN112771608A/zh
Publication of WO2020102979A1 publication Critical patent/WO2020102979A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the present application relates to the technical field of electronic equipment, and in particular, to a voice information processing method, device, storage medium, and electronic equipment.
  • the mobile phone when the mobile phone is in a call state, the mobile phone can collect the voice information in the current call environment in real time, analyze the noisy value in the voice information, and adjust the size of the call volume according to the noisy value, so that the call volume changes with the call environment
  • the noisy degree is automatically adjusted and processed, but the call volume is only processed according to the noisy value in the voice information.
  • the processing method is simple, the call scene is poorly targeted, and the voice information processing efficiency is low.
  • Embodiments of the present application provide a voice information processing method, device, storage medium, and electronic equipment, which can improve the processing efficiency of voice information.
  • an embodiment of the present application provides a method for processing voice information, including:
  • the preset scene model including a preset number of spectrograms
  • the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
  • an embodiment of the present application provides a voice information processing apparatus, including:
  • a construction unit configured to construct a preset scene model, the preset scene model including a preset number of spectrograms;
  • a training unit configured to train the spectrogram in the preset scene model to generate a corresponding scene recognition model
  • An analysis unit configured to collect target speech information in the current environment and analyze the target speech information to obtain a target speech spectrum corresponding to the target speech information
  • the input unit is configured to input the target spectrogram into a scene recognition model to determine a corresponding target preset scene model, and match corresponding call parameters according to the target preset scene model.
  • a storage medium provided by an embodiment of the present application has a computer program stored thereon, and when the computer program runs on a computer, the computer is caused to execute the method for processing voice information as provided in any embodiment of the present application .
  • an electronic device provided by an embodiment of the present application includes a processor and a memory, and the memory has a computer program, wherein the processor is used to perform the steps by calling the computer program:
  • the preset scene model including a preset number of spectrograms
  • the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
  • FIG. 1 is a schematic flowchart of a method for processing voice information provided by an embodiment of the present application.
  • FIG. 2 is another schematic flowchart of a method for processing voice information provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a scenario of a method for processing voice information provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a device for processing voice information provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another module of a voice information processing apparatus provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG 7 is another schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • module used in this article can be regarded as a software object executed on the computing system.
  • the different components, modules, engines and services described in this article can be regarded as the implementation objects on the computing system.
  • the device and method described herein are preferably implemented in the form of software, and of course can also be implemented in hardware, which are all within the protection scope of the present application.
  • An embodiment of the present application provides a method for processing voice information.
  • the execution subject of the method for processing voice information may be the voice information processing apparatus provided in the embodiment of the present application, or an electronic device integrated with the voice information processing apparatus, wherein The voice information processing device may be implemented in hardware or software.
  • the electronic device may be a smart phone, a tablet computer, a PDA (Personal Digital Assistant) and so on.
  • An embodiment of the present invention provides a method for processing voice information, including:
  • the preset scene model including a preset number of spectrograms
  • the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
  • the step of constructing a preset scene model may include: collecting a preset amount of voice information through a preset sampling rate; converting the preset amount of voice information into a corresponding spectrogram, Construct a preset scene model according to the phonogram.
  • the step of converting the preset amount of voice information into a corresponding spectrogram may include: performing frame processing on the voice information to obtain first frame data; The first frame data is subjected to Fourier transform to generate a spectrogram corresponding to speech information.
  • the step of training the spectrogram in the preset scene model to generate a corresponding scene recognition model may include: using a convolutional neural network to perform The spectrogram is trained to generate the corresponding scene recognition model.
  • the step of analyzing the target voice information to obtain a target spectrogram corresponding to the target voice information may include: performing frame processing on the target voice information to obtain Second frame data; Fourier transform the second frame data to obtain a target spectrogram corresponding to the target speech information.
  • the step of inputting the target spectrogram into the scene recognition model to determine the corresponding target preset scene model may include: inputting the target spectrogram into the scene recognition model; The recognition model performs feature traversal on the target spectrogram to determine a target preset scene model corresponding to the target spectrogram.
  • the step of matching corresponding call parameters according to the target preset scene model may further include: generating corresponding prompt information to prompt the user to adjust the call with the matched call parameters; When receiving the confirmation instruction corresponding to the prompt information, adjust the call according to the matched call parameters.
  • FIG. 1 is a schematic flowchart of a voice information processing method according to an embodiment of the present application.
  • the voice information processing method may include the following steps:
  • step S101 a preset scene model is constructed.
  • the preset scene model is a scene where the user may be in a call, such as a road scene, a subway scene, a strong wind scene, a rain scene or a vocal noisy scene.
  • a call such as a road scene, a subway scene, a strong wind scene, a rain scene or a vocal noisy scene.
  • different call parameters can be set in association , Such as different noise reduction processing, balancer processing, sound smoothing processing, etc. for call voices in different scenarios, so that in the corresponding scenarios, better call parameters are adopted to achieve better calls in the corresponding scenarios effect.
  • the electronic device can collect a preset amount of voice information in a specific scene and convert the preset amount of voice information into a corresponding spectrogram.
  • the abscissa of the spectrogram is time, and the ordinate is frequency and color.
  • the depth of represents the energy of voice data.
  • the spectrogram can express the characteristics of the voice information in multiple dimensions. Therefore, a preset scene model can be constructed from the multiple spectrograms.
  • the step of constructing the preset scene model may include:
  • the electronic device can collect a preset amount of voice information in a preset scene through a microphone at a preset sampling rate, such as a sampling rate of 44.1 kHz (kilohertz), and intercept the voice content of each voice information for 2 seconds as input Signal, converting the multiple input signals into corresponding spectrograms, and constructing the converted spectrogram into a preset scene model, the constructed preset scene model contains multiple spectrograms under corresponding scenes, It can reflect the voice features in the scene.
  • a preset sampling rate such as a sampling rate of 44.1 kHz (kilohertz)
  • the step of converting the preset amount of voice information into a corresponding spectrogram may include:
  • the input signal is framed and windowed, the frame length is 1024, the overlap is 128, the window function is the Hamming window, the first framed data is obtained, and the first framed data Perform Fourier transform to calculate the energy density of the signal and generate a spectrogram.
  • the spectrogram can be processed in grayscale.
  • the abscissa of the spectrogram is time, the ordinate is frequency, and the grayscale value Represents the energy value.
  • step S102 the spectrogram in the preset scene model is trained to generate a corresponding scene recognition model.
  • each preset scene model contains a preset number of spectrograms in the corresponding scene, you can use machine learning to train and learn the preset number of spectrograms in the scene to generate A scene recognition model to identify the scene.
  • a convolutional neural network may be used to learn and train the spectrogram in the preset scene model to generate a scene recognition model that can automatically identify the identifying features of the corresponding scene.
  • step S103 target speech information in the current environment is collected, and the target speech information is analyzed to obtain a target speech spectrum corresponding to the target speech information.
  • the electronic device when the electronic device is in a call state, the user often wants to call with the best call parameters to ensure the best call effect.
  • the current user can only manually select the corresponding call parameters, the process is cumbersome,
  • the automatic adjustment can often only adjust the loudness of the call sound according to the noisy value of the environment, and the adjustment behavior and the processing of the voice information are relatively simple.
  • the electronic device when the electronic device is in a call state, the electronic device automatically collects the target voice information in the current environment through the microphone, and converts the target voice information into the corresponding target spectrogram.
  • the ordinate is frequency
  • the depth of color represents the energy of voice data. It should be noted that the characteristics of the target spectrogram are the same as the features of the spectrogram in the preset scene model.
  • the step of analyzing the target speech information to obtain a target spectrogram corresponding to the target speech information may include:
  • the target voice information can be framed and windowed, the frame length is 1024, the overlap is 128, the window function is Hamming window, the second framed data is obtained, and the second framed data is subjected to Fourier transform , The energy density of the signal is calculated, and the corresponding target spectrogram is generated.
  • the target spectrogram can also be processed in grayscale.
  • the abscissa of the target spectrogram is time, and the ordinate is frequency, grayscale.
  • the value represents the energy value and has the same characteristics as the spectrogram in the preset scene model.
  • step S104 the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
  • the target spectrogram because the three-dimensional feature of the target spectrogram, that is, the abscissa is time, the ordinate is frequency, the gray value represents the energy value is the same as the features of the spectrogram in the preset scene model, so the target spectrogram can be used
  • the map is input into the scene recognition model, and the scene recognition model will traverse the features of the target spectrogram one by one, identify the target preset scene model corresponding to the target spectrogram, such as a subway scene, and preset according to the target
  • the scene model matches the corresponding call parameters adapted to the scene model, so that calls can be made according to the call parameters adapted to the current environment, and the user's call efficiency is improved.
  • the step of inputting the target spectrogram into the scene recognition model to determine the corresponding target preset scene model may include:
  • the target spectrogram can be input into the scene recognition model, and the scene recognition model can Perform feature traversal, and automatically identify iconic features, and determine the corresponding scene recognition model according to the iconic features.
  • a method for processing voice information provided by this embodiment, by constructing a preset scene model, the preset scene model includes a preset number of spectrograms; training the spectrogram in the preset scene model To generate the corresponding scene recognition model; collect the target speech information in the current environment and analyze the target speech information to obtain the target speech spectrum corresponding to the target speech information; enter the target speech spectrum into the scene recognition model to determine The corresponding target preset scene model, and match the corresponding call parameters according to the target preset scene model.
  • a preset scene model constructed based on a preset number of spectrograms is trained to generate a scene recognition model that can identify the scene, collect target speech information in the current environment in real time, and generate a target spectrogram, and convert the target spectrogram Input into the scene recognition model, identify the scene model where the current environment is, and match the corresponding appropriate call parameters according to the scene model, which improves the efficiency of processing voice information and thus the recognition rate of the call scene is more accurate.
  • FIG. 2 is another schematic flowchart of a method for processing voice information according to an embodiment of the present application.
  • the method includes:
  • step S201 a preset amount of voice information is collected through a preset sampling rate.
  • an electronic device such as a mobile phone can collect 500 voice messages in a preset scene through a microphone at a sampling rate of 44.1 kHz (kilohertz), and the time of each voice message can be limited to 2 seconds.
  • a voice input signal As a voice input signal.
  • step S202 the voice information is framed to obtain first framed data.
  • the speech input signal can be framed and windowed, the frame length is 1024, the overlap is 128, the window function is a Hamming window, and then the first framed data is obtained.
  • step S203 Fourier transform is performed on the first subframe data to generate a spectrogram corresponding to the voice information, and a preset scene model is constructed according to the spectrogram.
  • FIG. 3 is a schematic diagram of the grayscale spectrogram, the abscissa Is the time, the ordinate is the frequency, and the gray value is the energy value. It can be seen that the spectrogram can reflect the characteristics of the voice signal from a multi-dimensional perspective.
  • a preset scene model corresponding to the preset scene is constructed.
  • the preset scene model includes 500 spectrograms under the preset scene. For example, a road scene includes 500 spectrograms and a subway scene includes 500 spectrograms. and many more.
  • step S204 a convolutional neural network is used to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
  • Convolutional Neural Networks (Convolutional Neural Networks, CNN) is a type of feedforward neural networks (Feedforward Neural Networks) that contains convolution or related calculations and has a deep structure. It is one of the representative algorithms for deep learning
  • the spectrogram in the preset scene model can be trained to generate a scene recognition model that can identify iconic features, that is, the scene recognition model can automatically identify the logo in the spectrogram sexual characteristics to determine the preset scene model to which the spectrogram belongs.
  • step S205 the target voice information in the current environment is collected, and the target voice information is framed to obtain second framed data.
  • the target voice information in the current call environment can be collected through the microphone, and the target voice information is framed and windowed.
  • the frame length is 1024
  • the overlap is 128, and the window function is Hamming window. Get the second subframe data.
  • step S206 Fourier transform is performed on the second subframe data to obtain a target spectrogram corresponding to the target speech information.
  • the mobile phone performs Fourier transform on the second subframe data, calculates the energy density of the signal, and generates the corresponding target spectrogram.
  • the target spectrogram can also be processed in grayscale.
  • the abscissa of the spectrogram is time, the ordinate is frequency, and the gray value represents the energy value, which has the same features as the spectrogram in the preset scene model shown in FIG. 3.
  • step S207 input the target spectrogram into the scene recognition model, and perform feature traversal on the target spectrogram through the scene recognition model to determine the target preset scene model corresponding to the target spectrogram.
  • the target spectrum in the current call environment is input into the scene recognition model, and the scene recognition model will traverse the features in the target spectrum one by one, and then identify the corresponding target landmark features in the target spectrum , Determine the target preset scene model where the target spectrogram is located according to the iconic features.
  • step S208 the corresponding call parameters are matched according to the target preset scene model.
  • the mobile phone will associate different call parameters for each preset scene model, so that under the corresponding preset scene, the call is performed with the best call parameters, such as the road scene associated with the first call parameter, the subway scene is associated The second call parameter.
  • the first call parameter is different from the second call parameter. Therefore, when the target preset scene model is a subway scene, the corresponding second call parameter is matched.
  • step S209 corresponding prompt information is generated to prompt the user to perform call adjustment with the matched call parameters, and when receiving a confirmation instruction corresponding to the prompt information, perform call adjustment according to the matched call parameters.
  • corresponding prompt information can be generated, such as prompting "whether to call with the call parameters suitable for the current scenario", the user can choose yes or no operation accordingly, when the user selects yes, Generate and receive a determination instruction, and adjust the call according to the matched second call parameter.
  • a method for processing voice information by collecting a preset amount of voice information at a preset sampling frequency, and performing frame processing on the voice information to obtain first frame data, convert the first One-frame data is subjected to Fourier transform to generate corresponding spectrograms, and a preset scene model is constructed according to the spectrogram, and a convolution neural network is used to train the spectrogram in the preset scene model to generate corresponding scenes Recognize the model, collect the target voice information in the current environment, and analyze the target voice information to obtain the target speech spectrum corresponding to the target speech information, and input the target speech spectrum into the scene recognition model to determine the corresponding target preset scene Model, and match the corresponding call parameters according to the target preset scene model.
  • a preset scene model constructed based on a preset number of spectrograms is trained to generate a scene recognition model that can identify the scene, collect target speech information in the current environment in real time, and generate a target spectrogram, which is then Input into the scene recognition model, identify the scene model where the current environment is, and match the corresponding appropriate call parameters according to the scene model, which improves the efficiency of processing voice information and thus the recognition rate of the call scene is more accurate.
  • the embodiments of the present application further provide an apparatus based on the processing method of voice information described above.
  • the meaning of the nouns is the same as that in the above method for processing voice information. For specific implementation details, refer to the description in the method embodiments.
  • An embodiment of the present invention provides a voice information processing device, including:
  • a construction unit configured to construct a preset scene model, the preset scene model including a preset number of spectrograms;
  • a training unit configured to train the spectrogram in the preset scene model to generate a corresponding scene recognition model
  • the analysis unit is used to collect target speech information in the current environment and analyze the target speech information to obtain a target speech spectrum corresponding to the target speech information;
  • the input unit is configured to input the target spectrogram into a scene recognition model to determine a corresponding target preset scene model, and match corresponding call parameters according to the target preset scene model.
  • the construction unit may include: a collection subunit and a conversion subunit, the collection subunit is used to collect a preset amount of voice information at a preset sampling rate; the conversion subunit is used to convert all The preset amount of voice information is converted into a corresponding spectrogram, and a preset scene model is constructed according to the spectrogram.
  • the conversion subunit is specifically configured to: perform frame processing on the voice information to obtain first frame data; perform Fourier transform on the first frame data to generate voice information Corresponding to the spectrogram, a preset scene model is constructed according to the spectrogram.
  • the training unit is specifically configured to: use a convolutional neural network to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
  • the analysis unit is specifically configured to: collect target voice information in the current environment, and frame-process the target voice information to obtain second framed data; The data is subjected to Fourier transform to obtain a target spectrogram corresponding to the target speech information.
  • FIG. 4 is a schematic block diagram of a voice information processing apparatus provided by an embodiment of the present application.
  • the voice information processing device 300 includes a construction unit 31, a training unit 32, an analysis unit 33, and an input unit 34.
  • the construction unit 31 is configured to construct a preset scene model, and the preset scene model includes a preset number of spectrograms.
  • the construction unit 31 can collect a preset amount of voice information in a specific scene and convert the preset amount of voice information into a corresponding spectrogram.
  • the abscissa of the spectrogram is time, and the ordinate is frequency.
  • the depth of the color represents the energy of the voice data.
  • the spectrogram can express the characteristics of the voice information in multiple dimensions. Therefore, a preset scene model can be constructed from the multiple spectrograms.
  • the training unit 32 is configured to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
  • the training unit 32 can use a machine learning method to train and learn the preset number of spectrograms in the scene To generate a scene recognition model that can recognize the scene.
  • the training unit 32 may learn and train the spectrogram in the preset scene model through a convolutional neural network to generate a scene recognition model that can automatically identify the identifying features of the corresponding scene.
  • the analyzing unit 33 is configured to collect target speech information in the current environment, and analyze the target speech information to obtain a target speech spectrum corresponding to the target speech information.
  • the analysis unit 33 will automatically collect the target voice information in the current environment through the microphone, and convert the target voice information into the corresponding target spectrogram, the abscissa of the target spectrogram is time , The ordinate is the frequency, and the depth of the color represents the energy of the voice data. It should be noted that the features of the target spectrogram are the same as the features of the spectrogram in the preset scene model.
  • the analysis unit 33 is specifically configured to collect target voice information in the current environment, perform frame processing on the target voice information to obtain second frame data; A Fourier transform is performed to obtain a target spectrogram corresponding to the target speech information.
  • the input unit 34 is configured to input the target spectrogram into a scene recognition model to determine a corresponding target preset scene model, and match corresponding call parameters according to the target preset scene model.
  • the input unit 34 can The target spectrogram is input into the scene recognition model.
  • the scene recognition model will traverse the features of the target spectrogram one by one to identify the target preset scene model corresponding to the target spectrogram, such as a subway scene, and according to The target preset scene model matches the corresponding call parameters adapted to the scene model, so that calls can be made according to the call parameters adapted to the current environment, and the user's call efficiency is improved.
  • the input unit 34 is specifically configured to input the target spectrogram into a scene recognition model; perform feature traversal on the target spectrogram through the scene recognition model to determine the corresponding The target preset scene model, and match the corresponding call parameters according to the target preset scene model.
  • FIG. 5 is another schematic diagram of another module of the apparatus for processing voice information provided by the embodiment of the present application.
  • the apparatus 300 for processing voice information may further include:
  • the construction unit 31 may include an acquisition subunit 311 and a transformation subunit 312.
  • the collection subunit 311 is configured to collect a preset amount of voice information at a preset sampling rate.
  • the conversion subunit 312 is configured to convert the preset amount of voice information into a corresponding spectrogram, and construct a preset scene model according to the spectrogram.
  • the conversion sub-unit 312 is specifically configured to perform frame processing on the voice information to obtain first frame data; Fourier transform the first frame data to generate corresponding voice information , A preset scene model is constructed according to the spectrum chart.
  • the electronic device 500 includes a processor 501 and a memory 502.
  • the processor 501 and the memory 502 are electrically connected.
  • the processor 500 is the control center of the electronic device 500, which uses various interfaces and lines to connect the various parts of the entire electronic device, executes or loads the computer program stored in the memory 502, and calls the data stored in the memory 502 to execute
  • the electronic device 500 performs various functions and processes data to perform overall monitoring of the electronic device 500.
  • the memory 502 can be used to store software programs and modules.
  • the processor 501 runs computer programs and modules stored in the memory 502 to execute various functional applications and data processing.
  • the memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, computer programs required by at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may store Data created by the use of electronic devices, etc.
  • the memory 502 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 502 may further include a memory controller to provide the processor 501 with access to the memory 502.
  • the processor 501 in the electronic device 500 will load the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and the processor 501 executes and stores the instructions in the memory 502
  • the computer program in, which realizes various functions, is as follows:
  • the preset scene model including a preset number of spectrograms
  • the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
  • the processor 501 when constructing the preset scene model, may specifically perform the following steps:
  • the processor 501 when converting the preset amount of voice information into a corresponding spectrogram, the processor 501 may specifically perform the following steps:
  • the processor 501 when training the spectrogram in the preset scene model to generate a corresponding scene recognition model, the processor 501 may specifically perform the following steps:
  • a convolutional neural network is used to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
  • the processor 501 when analyzing the target speech information to obtain a target spectrogram corresponding to the target speech information, the processor 501 may specifically perform the following steps:
  • the processor 501 when inputting the target spectrogram into the scene recognition model to determine the corresponding target preset scene model, the processor 501 may specifically perform the following steps:
  • the processor 501 may further specifically perform the following steps:
  • the call adjustment is performed according to the matched call parameters.
  • the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power supply 506.
  • the display 503, the radio frequency circuit 504, the audio circuit 505, and the power supply 506 are electrically connected to the processor 501, respectively.
  • the display 503 can be used to display information input by the user or provided to the user and various graphical user interfaces, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the display 503 may include a display panel.
  • the display panel may be configured in the form of a liquid crystal display (Liquid Crystal) (LCD) or an organic light-emitting diode (Organic Light-Emitting Diode, OLED).
  • LCD liquid crystal display
  • OLED Organic Light-Emitting Diode
  • the radio frequency circuit 504 may be used to transmit and receive radio frequency signals to establish wireless communication with network devices or other electronic devices through wireless communication, and to transmit and receive signals with network devices or other electronic devices.
  • the audio circuit 505 can be used to provide an audio interface between a user and an electronic device through speakers and microphones.
  • the power supply 506 can be used to power various components of the electronic device 500.
  • the power supply 506 may be logically connected to the processor 501 through a power management system, so as to implement functions such as charging, discharging, and power management through the power management system.
  • the electronic device 500 may further include a camera, a Bluetooth module, etc., which will not be repeated here.
  • An embodiment of the present application also provides a storage medium that stores a computer program, and when the computer program is run on a computer, the computer is caused to execute the voice information processing method in any of the foregoing embodiments, such as: Set a scene model, the preset scene model includes a preset number of spectrograms; train the spectrograms in the preset scene model to generate a corresponding scene recognition model; collect target speech in the current environment Information, and analyze the target speech information to obtain a target spectrogram corresponding to the target speech information; input the target spectrogram into a scene recognition model to determine the corresponding target preset scene model, and according to The target preset scene model matches corresponding call parameters.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read Only Memory, ROM,), or a random access memory (Random Access Memory, RAM), etc.
  • the computer program can be stored in a computer-readable storage medium, such as stored in the memory of the electronic device, and executed by at least one processor in the electronic device, during the execution process can include, for example, voice
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.
  • each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

一种语音信息的处理方法、装置、存储介质及电子设备,该处理方法包括构建预设场景模型(S101);对预设场景模型中的语谱图进行训练,以生成相应的场景识别模型(S102);采集当前环境中的目标语音信息,并对目标语音信息进行分析,以得到目标语音信息相应的目标语谱图(S103);将目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据目标预设场景模型匹配相应的通话参数(S104)。提升了语音信息的处理效率。

Description

语音信息的处理方法、装置、存储介质及电子设备 技术领域
本申请涉及电子设备技术领域,尤其涉及一种语音信息的处理方法、装置、存储介质及电子设备。
背景技术
随着电子技术的不断发展,电子设备如手机的功能越来越强大,用户对语音通话的要求也越来越高,用户在不同的通话场景下可以灵活选择相应的通话参数,从而实现更好的通话效果。
目前,在手机处于通话状态时,手机可以实时采集当前通话环境中的语音信息,分析出语音信息中的嘈杂值,根据嘈杂值的大小相应调整通话音量的大小,实现通话音量随着通话环境的嘈杂度的变化而自动调节处理,但是只是根据语音信息中的嘈杂值对通话音量进行处理,处理方式较为单一,对通话场景的针对性较差,语音信息的处理效率低。
发明内容
本申请实施例提供一种语音信息的处理方法、装置、存储介质及电子设备,可以提升语音信息的处理效率。
第一方面,本申请实施例了提供了一种语音信息的处理方法,包括:
构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
第二方面,本申请实施例了提供了的一种语音信息的处理装置,包括:
构建单元,用于构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
训练单元,用于对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
分析单元,用于采集当前环境中的目标语音信息,并对所述目标语音信息 进行分析,以得到所述目标语音信息相应的目标语谱图;
输入单元,用于将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
第三方面,本申请实施例提供的存储介质,其上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行如本申请任一实施例提供的语音信息的处理方法。
第四方面,本申请实施例提供的电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
附图说明
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其它有益效果显而易见。
图1是本申请实施例提供的语音信息的处理方法的流程示意图。
图2为本申请实施例提供的语音信息的处理方法的另一流程示意图。
图3是本申请实施例提供的语音信息的处理方法的场景示意图。
图4为本申请实施例提供的语音信息的处理装置的模块示意图。
图5为本申请实施例提供的语音信息的处理装置的另一模块示意图。
图6为本申请实施例提供的电子设备的结构示意图。
图7为本申请实施例提供的电子设备的另一结构示意图。
具体实施方式
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
本文所使用的术语「模块」可看做为在该运算系统上执行的软件对象。本文该的不同组件、模块、引擎及服务可看做为在该运算系统上的实施对象。而本文该的装置及方法优选的以软件的方式进行实施,当然也可在硬件上进行实施,均在本申请保护范围之内。
本申请实施例提供一种语音信息的处理方法,该语音信息的处理方法的执行主体可以是本申请实施例提供的语音信息的处理装置,或者集成了该语音信息的处理装置的电子设备,其中该语音信息的处理装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑(PDA,Personal Digital Assistant)等。
以下进行具体分析说明。
本发明实施例提供一种语音信息的处理方法,包括:
构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
在一种实施方式中,所述构建预设场景模型的步骤,可以包括:通过预设采样率采集预设数量的语音信息;将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
在一种实施方式中,所述将所述预设数量的语音信息转化为相应的语谱图的步骤,可以包括:将所述语音信息进行分帧处理,以得到第一分帧数据;对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
在一种实施方式中,所述对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型的步骤,可以包括:采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
在一种实施方式中,所述对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图的步骤,可以包括:对所述目标语音信息进行分帧处理,以得到第二分帧数据;对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
在一种实施方式中,所述将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型的步骤,可以包括:将所述目标语谱图输入场景识别模型;通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
在一种实施方式中,所述根据所述目标预设场景模型匹配相应的通话参数的步骤之后,还可以包括:生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节;当接收到所述提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
本申请实施例提供一种语音信息的处理方法,如图1所示,图1为本申请实施例提供的语音信息的处理方法的流程示意图,该语音信息的处理方法可以包括以下步骤:
在步骤S101中,构建预设场景模型。
需要说明的是,该预设场景模型为用户通话可能处于的场景,如马路场景、地铁场景、大风场景、下雨场景或者人声嘈杂场景,针对不同的场景模型,可以关联设置不同的通话参数,如针对不同场景下对通话语音不同的降噪处理、平衡器处理、声音平滑度处理等等,使得在相应的场景下,采取较佳的通话参数,实现在相应的场景下的较佳通话效果。
其中,电子设备可以采集在特定场景下的预设数量的语音信息,并将预设数量的语音信息转化为相应的语谱图,该语谱图的横坐标为时间,纵坐标为频率,颜色的深度代表语音数据能量,该语谱图可以从多维度上表达出语音信息的特征,因此,可以通过该多个语谱图构建预设场景模型。
在一些实施方式中,该构建预设场景模型的步骤,可以包括:
(1)通过预设采样率采集预设数量的语音信息;
(2)将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
其中,电子设备可以通过麦克风以预设采样率,如以采样率44.1kHz(千赫兹)进行采集预设场景下的预设数量的语音信息,并截取每一语音信息2秒的语音内容作为输入信号,将该多个输入信号转化为相应的语谱图,并将转化后的语谱图构建为预设场景模型,该构建的预设场景模型中包含多个相应场景下的语谱图,可以反映出该场景中的语音特征。
在一些实施方式中,该将所述预设数量的语音信息转化为相应的语谱图的步骤,可以包括:
(1.1)将所述语音信息进行分帧处理,以得到第一分帧数据;
(1.2)对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
其中,在截取到相应的输入信号后,输入信号进行分帧加窗处理,帧长度为1024,重叠为128,窗函数为汉明窗,得到第一分帧数据,并对第一分帧数据进行傅里叶变换,计算出信号的能量密度,生成语谱图,相应的,可以对该语谱图进行灰度处理,该语谱图的横坐标为时间,纵坐标为频率,灰度值代表能量值。
在步骤S102中,对预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
其中,由于每一预设场景模型中包含有相应场景下的预设数量的语谱图,所以可以利用机器学习的方法,对该场景下的预设数量的语谱图进行训练学习,生成可以识别该场景的场景识别模型。
在一实施方式中,可以通过卷积神经网络对预设场景模型中的语谱图进行学习训练,生成可以自动识别出该对应场景的标识性特征的场景识别模型。
在步骤S103中,采集当前环境中的目标语音信息,并对目标语音信息进行分析,以得到目标语音信息相应的目标语谱图。
需要说明的是,当电子设备处于通话状态时,用户往往希望以最佳的通话参数进行通话,以保证最佳的通话效果,然而,当前用户只能手动选择相应的通话参数,过程比较繁琐,而自动调节往往只能根据环境的嘈杂值调节通话音亮的大小,调节的行为以及对语音信息的处理比较单一。
其中,当电子设备处于通话状态时,电子设备会自动通过麦克风采集当前环境中的目标语音信息,并将目标语音信息转化为相应的目标语谱图,该目标语谱图的横坐标为时间,纵坐标为频率,颜色的深度代表语音数据能量,需要注意的是,该目标语谱图的特征与预设场景模型中的语谱图的特征相同。
在一些实施方式中,该对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图的步骤,可以包括:
(1)对所述目标语音信息进行分帧处理,以得到第二分帧数据;
(2)对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
其中,可以对目标语音信息进行分帧加窗处理,帧长度也为1024,重叠为128,窗函数为汉明窗,得到第二分帧数据,并对第二分帧数据进行傅里叶变换,计算出信号的能量密度,生成相应的目标语谱图,相应的,可以对该目标语谱图也进行灰度处理,该目标语谱图的横坐标为时间,纵坐标为频率,灰度值代表能量值,与预设场景模型中的语谱图的特征相同。
在步骤S104中,将目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据目标预设场景模型匹配相应的通话参数。
其中,由于该目标语谱图的三维特征,即横坐标为时间,纵坐标为频率,灰度值代表能量值与预设场景模型中的语谱图的特征相同,所以可以将该目标语谱图输入到场景识别模型中,该场景识别模型会对该目标语谱图中的特征进行逐一遍历,识别出该目标语谱图相应的目标预设场景模型,如地铁场景,并根据目标预设场景模型匹配适应该场景模型的相应的通话参数,以使得可以根据适应当前环境的通话参数进行通话,提升用户的通话效率。
在一些实施方式中,该将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型的步骤,可以包括:
(1)将所述目标语谱图输入场景识别模型;
(2)通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
其中,由于该目标语谱图的特征与预设场景模型中的特征相同,所以可以将该目标语谱图输入到场景识别模型中,该场景识别模型由于学习训练,可以对该目标语谱图进行特征遍历,并自动识别出标志性特征,根据该标志性特征确定出相应的场景识别模型。
由上述可知,本实施例提供的一种语音信息的处理方法,通过构建预设场景模型,预设场景模型中包括预设数量的语谱图;对预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;采集当前环境中的目标语音信息,并对目标语音信息进行分析,以得到目标语音信息相应的目标语谱图;将目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据目标预设场景模型匹配相应的通话参数。以此根据预设数量的语谱图构建的预设场景模型 进行训练,生成可以识别场景的场景识别模型,实时采集当前环境中的目标语音信息,并生成目标语谱图,将目标语谱图输入到场景识别模型中,识别出当前环境处于的场景模型,并根据处于的场景模型匹配相应合适的通话参数,提升了对语音信息的处理效率,进而对通话场景的识别率更为准确。
根据上述实施例所描述的方法,以下将举例作进一步详细说明。
请参阅图2,图2为本申请实施例提供的语音信息的处理方法的另一流程示意图。
具体而言,该方法包括:
在步骤S201中,通过预设采样率采集预设数量的语音信息。
其中,电子设备如手机可以通过麦克风按照44.1kHz(千赫兹)的采样率采集预设场景下的500个语音信息,且每一语音信息的时间可以限定为2秒,将该2秒的语音信息作为语音输入信号。
在步骤S202中,将语音信息进行分帧处理,以得到第一分帧数据。
其中,可以对语音输入信号进行分帧加窗,帧长度为1024,重叠为128,窗函数为汉明窗,进而得到第一分帧数据。
在步骤S203中,对第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图,根据语谱图构建预设场景模型。
其中,对第一分帧数据进行傅里叶变换,计算出信号的能量密度,生成灰度的语谱图,如图3所示,图3为灰度的语谱图的示意图,该横坐标为时间,该纵坐标为频率,灰度值带量能量值,可以看出的是,该语谱图可以从多维角度反映出语音信号的特征,通过预设场景下的500个语谱图可以构建出相应预设场景的预设场景模型,该预设场景模型中包括在该预设场景下的500个语谱图,如马路场景包括500个语谱图以及地铁场景包括500个语谱图等等。
在步骤S204中,采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
其中,卷积神经网络(Convolutional Neural Networks,CNN)是一类包含卷积或相关计算且具有深度结构的前馈神经网络(Feedforward Neural Networks),是深度学习(deep learning)的代表算法之一,通过该卷积神经网络可以对预设场景模型中的语谱图进行训练,生成可以识别出标志性特征的场景识别模型,也就是说,该场景识别模型可以自动识别出语谱图中的标识性 特征以确定出语谱图属于的预设场景模型。
在步骤S205中,采集当前环境中的目标语音信息,并对目标语音信息进行分帧处理,以得到第二分帧数据。
其中,手机处于通话状态时,可以通过麦克风采集当前通话环境中的目标语音信息,并对目标语音信息进行分帧加窗处理,帧长度也为1024,重叠为128,窗函数为汉明窗,得到第二分帧数据。
在步骤S206中,对第二分帧数据进行傅里叶变换,以得到目标语音信息相应的目标语谱图。
其中,手机会对第二分帧数据进行傅里叶变换,计算出信号的能量密度,生成相应的目标语谱图,相应的,可以对该目标语谱图也进行灰度处理,该目标语谱图的横坐标为时间,纵坐标为频率,灰度值代表能量值,与图3所示的预设场景模型中的语谱图的特征相同。
在步骤S207中,将目标语谱图输入场景识别模型,通过场景识别模型对目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
其中,将当前通话环境中的目标语谱图输入到场景识别模型中,该场景识别模型会对目标语谱图中的特征进行逐一遍历,进而识别出目标语谱图中相应的目标标志性特征,根据标志性特征确定目标语谱图处于的目标预设场景模型。
在步骤S208中,根据目标预设场景模型匹配相应的通话参数。
其中,手机针对每个预设场景模型会对应关联不同的通话参数,以使得在相应的预设场景下,以最佳的通话参数进行通话,如马路场景关联第一通话参数,该地铁场景关联第二通话参数,该第一通话参数与该第二通话参数不同,因此,当目标预设场景模型为地铁场景时,匹配相应的第二通话参数。
在步骤S209中,生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节,当接收到提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
其中,在手机确定出第二通话参数时,可以生成相应的提示信息,比如提示“是否以适合当前场景的通话参数进行通话”,用户可以相应的选择是或者否操作,当用户选择是时,生成并接收确定指令,根据匹配到第二通话参数进行通话调节。
由上述可知,本实施例提供的一种语音信息的处理方法,通过以预设采样 频率采集预设数量的语音信息,并对语音信息进行分帧处理,以得到第一分帧数据,将第一分帧数据进行傅里叶变换,生成相应的语谱图,根据语谱图构建预设场景模型,采用卷积神经网络对预设场景模型中的语谱图进行训练,以生成相应的场景识别模型,采集当前环境中的目标语音信息,并对目标语音信息进行分析,以得到目标语音信息相应的目标语谱图,将目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据目标预设场景模型匹配相应的通话参数。以此根据预设数量的语谱图构建的预设场景模型进行训练,生成可以识别场景的场景识别模型,实时采集当前环境中的目标语音信息,并生成目标语谱图,将目标语谱图输入到场景识别模型中,识别出当前环境处于的场景模型,并根据处于的场景模型匹配相应合适的通话参数,提升了对语音信息的处理效率,进而对通话场景的识别率更为准确。
为便于更好的实施本申请实施例提供的语音信息的处理方法,本申请实施例还提供一种基于上述语音信息的处理方法的装置。其中名词的含义与上述语音信息的处理方法中相同,具体实现细节可以参考方法实施例中的说明。
本发明实施例提供一种语音信息的处理装置,包括:
构建单元,用于构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
训练单元,用于对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
分析单元,用于采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
输入单元,用于将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
在一种实施方式中,构建单元,可以包括:采集子单元和转化子单元,该采集子单元,用于通过预设采样率采集预设数量的语音信息;该转化子单元,用于将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
在一种实施方式中,转化子单元,具体用于:将所述语音信息进行分帧处理,以得到第一分帧数据;对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图,根据所述语谱图构建预设场景模型。
在一种实施方式中,训练单元,具体用于:采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
在一种实施方式中,分析单元,具体用于:采集当前环境中的目标语音信息,并对所述目标语音信息进行分帧处理,以得到第二分帧数据;对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
请参阅图4,图4为本申请实施例提供的语音信息的处理装置的模块示意图。具体而言,该语音信息的处理装置300,包括:构建单元31、训练单元32、分析单元33以及输入单元34。
构建单元31,用于构建预设场景模型,所述预设场景模型中包括预设数量的语谱图。
其中,构建单元31可以采集在特定场景下的预设数量的语音信息,并将预设数量的语音信息转化为相应的语谱图,该语谱图的横坐标为时间,纵坐标为频率,颜色的深度代表语音数据能量,该语谱图可以从多维度上表达出语音信息的特征,因此,可以通过该多个语谱图构建预设场景模型。
训练单元32,用于对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
其中,由于每一预设场景模型中包含有相应场景下的预设数量的语谱图,所以训练单元32可以利用机器学习的方法,对该场景下的预设数量的语谱图进行训练学习,生成可以识别该场景的场景识别模型。
在一实施方式中,训练单元32可以通过卷积神经网络对预设场景模型中的语谱图进行学习训练,生成可以自动识别出该对应场景的标识性特征的场景识别模型。
分析单元33,用于采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图。
其中,当电子设备处于通话状态时,分析单元33会自动通过麦克风采集当前环境中的目标语音信息,并将目标语音信息转化为相应的目标语谱图,该目标语谱图的横坐标为时间,纵坐标为频率,颜色的深度代表语音数据能量,需要注意的是,该目标语谱图的特征与预设场景模型中的语谱图的特征相同。
在一些实施方式中,该分析单元33,具体用于采集当前环境中的目标语音信息,对所述目标语音信息进行分帧处理,以得到第二分帧数据;对所述第 二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
输入单元34,用于将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
其中,由于该目标语谱图的三维特征,即横坐标为时间,纵坐标为频率,灰度值代表能量值与预设场景模型中的语谱图的特征相同,所以输入单元34可以将该目标语谱图输入到场景识别模型中,该场景识别模型会对该目标语谱图中的特征进行逐一遍历,识别出该目标语谱图相应的目标预设场景模型,如地铁场景,并根据目标预设场景模型匹配适应该场景模型的相应的通话参数,以使得可以根据适应当前环境的通话参数进行通话,提升用户的通话效率。
在一些实施方式中,该输入单元34,具体用于将所述目标语谱图输入场景识别模型;通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型,并根据目标预设场景模型匹配相应的通话参数。
可一并参考图5,图5为本申请实施例提供的语音信息的处理装置的另一模块示意图,该语音信息的处理装置300还可以包括:
其中,该构建单元31可以包括采集子单元311以及转化子单元312。
进一步的,该采集子单元311,用于通过预设采样率采集预设数量的语音信息。该转化子单元312,用于将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
在一些实施方式中,该转化子单元312具体用于将所述语音信息进行分帧处理,以得到第一分帧数据;对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图,根据所述语谱图构建预设场景模型。
本申请实施例还提供一种电子设备。请参阅图6,电子设备500包括处理器501以及存储器502。其中,处理器501与存储器502电性连接。
该处理器500是电子设备500的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器502内的计算机程序,以及调用存储在存储器502内的数据,执行电子设备500的各种功能并处理数据,从而对电子设备500进行整体监控。
该存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的计算机程序以及模块,从而执行各种功能应用以及数据处理。存 储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。
在本申请实施例中,电子设备500中的处理器501会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器502中,并由处理器501运行存储在存储器502中的计算机程序,从而实现各种功能,如下:
构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
在某些实施方式中,在构建预设场景模型时,处理器501可以具体执行以下步骤:
通过预设采样率采集预设数量的语音信息;
将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
在某些实施方式中,在将所述预设数量的语音信息转化为相应的语谱图时,处理器501可以具体执行以下步骤:
将所述语音信息进行分帧处理,以得到第一分帧数据;
对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
在某些实施方式中,在对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型时,处理器501可以具体执行以下步骤:
采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
在某些实施方式中,在对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图时,处理器501可以具体执行以下步骤:
对所述目标语音信息进行分帧处理,以得到第二分帧数据;
对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
在某些实施方式中,在将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型时,处理器501可以具体执行以下步骤:
将所述目标语谱图输入场景识别模型;
通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
在某些实施方式中,在根据所述目标预设场景模型匹配相应的通话参数之后,处理器501还可以具体执行以下步骤:
生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节;
当接收到所述提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
请一并参阅图7,在某些实施方式中,电子设备500还可以包括:显示器503、射频电路504、音频电路505以及电源506。其中,其中,显示器503、射频电路504、音频电路505以及电源506分别与处理器501电性连接。
该显示器503可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器503可以包括显示面板,在某些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。
该射频电路504可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。
该音频电路505可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。
该电源506可以用于给电子设备500的各个部件供电。在一些实施例中,电源506可以通过电源管理系统与处理器501逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管图7中未示出,电子设备500还可以包括摄像头、蓝牙模块等,在此不再赘述。
本申请实施例还提供一种存储介质,该存储介质存储有计算机程序,当该计算机程序在计算机上运行时,使得该计算机执行上述任一实施例中的语音信息的处理方法,比如:构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
在本申请实施例中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)、或者随机存取记忆体(Random Access Memory,RAM)等。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对本申请实施例的语音信息的处理方法而言,本领域普通测试人员可以理解实现本申请实施例的语音信息的处理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如语音信息的处理方法的实施例的流程。其中,该的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。
对本申请实施例的语音信息的处理装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,该存储介质譬如为只读存储器,磁盘或光盘等。
以上对本申请实施例所提供的一种语音信息的处理方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应 用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种语音信息的处理方法,其中,包括:
    构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
    对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
    采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
    将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
  2. 如权利要求1所述的语音信息的处理方法,其中,所述构建预设场景模型的步骤,包括:
    通过预设采样率采集预设数量的语音信息;
    将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
  3. 如权利要求2所述的语音信息的处理方法,其中,所述将所述预设数量的语音信息转化为相应的语谱图的步骤,包括:
    将所述语音信息进行分帧处理,以得到第一分帧数据;
    对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
  4. 如权利要求1所述的语音信息的处理方法,其中,所述对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型的步骤,包括:
    采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
  5. 如权利要求1至4任一项所述的语音信息的处理方法,其中,所述对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图的步骤,包括:
    对所述目标语音信息进行分帧处理,以得到第二分帧数据;
    对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
  6. 如权利要求5所述的语音信息的处理方法,其中,所述将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型的步骤,包括:
    将所述目标语谱图输入场景识别模型;
    通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
  7. 如权利要求1所述的语音信息的处理方法,其中,所述根据所述目标预设场景模型匹配相应的通话参数的步骤之后,还包括:
    生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节;
    当接收到所述提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
  8. 一种语音信息的处理装置,其中,包括:
    构建单元,用于构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
    训练单元,用于对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
    分析单元,用于采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
    输入单元,用于将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
  9. 如权利要求8所述的语音信息的处理装置,其中,所述构建单元,包括:
    采集子单元,用于通过预设采样率采集预设数量的语音信息;
    转化子单元,用于将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
  10. 如权利要求9所述的语音信息的处理装置,其中,所述转化子单元,具体用于:
    将所述语音信息进行分帧处理,以得到第一分帧数据;
    对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图,根据所述语谱图构建预设场景模型。
  11. 如权利要求8所述的语音信息的处理装置,其中,所述训练单元,具体用于:
    采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应 的场景识别模型。
  12. 如权利要求8至11任一项所述的语音信息的处理装置,其中,所述分析单元,具体用于:
    采集当前环境中的目标语音信息,并对所述目标语音信息进行分帧处理,以得到第二分帧数据;
    对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
  13. 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1所述的语音信息的处理方法。
  14. 一种电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;
    对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;
    采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;
    将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
  15. 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    通过预设采样率采集预设数量的语音信息;
    将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
  16. 如权利要求15所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    将所述语音信息进行分帧处理,以得到第一分帧数据;
    对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
  17. 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应 的场景识别模型。
  18. 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    对所述目标语音信息进行分帧处理,以得到第二分帧数据;
    对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
  19. 如权利要求18所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:
    将所述目标语谱图输入场景识别模型;
    通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
  20. 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,还用于执行步骤:
    生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节;
    当接收到所述提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
PCT/CN2018/116447 2018-11-20 2018-11-20 语音信息的处理方法、装置、存储介质及电子设备 WO2020102979A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/116447 WO2020102979A1 (zh) 2018-11-20 2018-11-20 语音信息的处理方法、装置、存储介质及电子设备
CN201880098316.5A CN112771608A (zh) 2018-11-20 2018-11-20 语音信息的处理方法、装置、存储介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/116447 WO2020102979A1 (zh) 2018-11-20 2018-11-20 语音信息的处理方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2020102979A1 true WO2020102979A1 (zh) 2020-05-28

Family

ID=70773731

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116447 WO2020102979A1 (zh) 2018-11-20 2018-11-20 语音信息的处理方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN112771608A (zh)
WO (1) WO2020102979A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113370923B (zh) * 2021-07-23 2023-11-03 深圳市元征科技股份有限公司 一种车辆配置的调整方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103632682A (zh) * 2013-11-20 2014-03-12 安徽科大讯飞信息科技股份有限公司 一种音频特征检测的方法
CN103903616A (zh) * 2012-12-25 2014-07-02 联想(北京)有限公司 一种信息处理的方法及电子设备
CN105845131A (zh) * 2016-04-11 2016-08-10 乐视控股(北京)有限公司 远讲语音识别方法及装置
CN108764304A (zh) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 场景识别方法、装置、存储介质及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360187B (zh) * 2011-05-25 2013-06-05 吉林大学 语谱图互相关的驾驶员汉语语音控制系统及方法
US9165565B2 (en) * 2011-09-09 2015-10-20 Adobe Systems Incorporated Sound mixture recognition
CN105810197B (zh) * 2014-12-30 2019-07-26 联想(北京)有限公司 语音处理方法、语音处理装置和电子设备
CN105208174A (zh) * 2015-09-06 2015-12-30 上海智臻智能网络科技股份有限公司 语音通信的方法、装置及拨号系统
CN106558318B (zh) * 2015-09-24 2020-04-28 阿里巴巴集团控股有限公司 音频识别方法和系统
CN106201312A (zh) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 一种应用处理方法、装置及终端

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903616A (zh) * 2012-12-25 2014-07-02 联想(北京)有限公司 一种信息处理的方法及电子设备
CN103632682A (zh) * 2013-11-20 2014-03-12 安徽科大讯飞信息科技股份有限公司 一种音频特征检测的方法
CN105845131A (zh) * 2016-04-11 2016-08-10 乐视控股(北京)有限公司 远讲语音识别方法及装置
CN108764304A (zh) * 2018-05-11 2018-11-06 Oppo广东移动通信有限公司 场景识别方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN112771608A (zh) 2021-05-07

Similar Documents

Publication Publication Date Title
CN109087669B (zh) 音频相似度检测方法、装置、存储介质及计算机设备
US11798531B2 (en) Speech recognition method and apparatus, and method and apparatus for training speech recognition model
CN107705778B (zh) 音频处理方法、装置、存储介质以及终端
JP5996783B2 (ja) 声紋特徴モデルを更新するための方法及び端末
WO2018219105A1 (zh) 语音识别方法及相关产品
CN105489221A (zh) 一种语音识别方法及装置
CN108922525B (zh) 语音处理方法、装置、存储介质及电子设备
CN108810280B (zh) 语音采集频率的处理方法、装置、存储介质及电子设备
CN110265011B (zh) 一种电子设备的交互方法及其电子设备
US10783884B2 (en) Electronic device-awakening method and apparatus, device and computer-readable storage medium
CN107871494B (zh) 一种语音合成的方法、装置及电子设备
WO2020249038A1 (zh) 音频流的处理方法、装置、移动终端及存储介质
CN111883091A (zh) 音频降噪方法和音频降噪模型的训练方法
CN111739545B (zh) 音频处理方法、装置及存储介质
WO2020057624A1 (zh) 语音识别的方法和装置
CN111081275B (zh) 基于声音分析的终端处理方法、装置、存储介质及终端
CN110931028A (zh) 一种语音处理方法、装置和电子设备
CN108600559B (zh) 静音模式的控制方法、装置、存储介质及电子设备
WO2020102979A1 (zh) 语音信息的处理方法、装置、存储介质及电子设备
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN113611318A (zh) 一种音频数据增强方法及相关设备
CN110580910B (zh) 一种音频处理方法、装置、设备及可读存储介质
WO2019242415A1 (zh) 位置提示方法、装置、存储介质及电子设备
WO2022213943A1 (zh) 消息发送方法、消息发送装置、电子设备和存储介质
CN111863006A (zh) 一种音频信号处理方法、音频信号处理装置和耳机

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18940994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18940994

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18940994

Country of ref document: EP

Kind code of ref document: A1