WO2020102979A1 - 语音信息的处理方法、装置、存储介质及电子设备 - Google Patents
语音信息的处理方法、装置、存储介质及电子设备Info
- Publication number
- WO2020102979A1 WO2020102979A1 PCT/CN2018/116447 CN2018116447W WO2020102979A1 WO 2020102979 A1 WO2020102979 A1 WO 2020102979A1 CN 2018116447 W CN2018116447 W CN 2018116447W WO 2020102979 A1 WO2020102979 A1 WO 2020102979A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- spectrogram
- voice information
- preset
- model
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000003672 processing method Methods 0.000 claims abstract description 12
- 238000004590 computer program Methods 0.000 claims description 23
- 230000010365 information processing Effects 0.000 claims description 22
- 238000001228 spectrum Methods 0.000 claims description 19
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000010276 construction Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000012790 confirmation Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Definitions
- the present application relates to the technical field of electronic equipment, and in particular, to a voice information processing method, device, storage medium, and electronic equipment.
- the mobile phone when the mobile phone is in a call state, the mobile phone can collect the voice information in the current call environment in real time, analyze the noisy value in the voice information, and adjust the size of the call volume according to the noisy value, so that the call volume changes with the call environment
- the noisy degree is automatically adjusted and processed, but the call volume is only processed according to the noisy value in the voice information.
- the processing method is simple, the call scene is poorly targeted, and the voice information processing efficiency is low.
- Embodiments of the present application provide a voice information processing method, device, storage medium, and electronic equipment, which can improve the processing efficiency of voice information.
- an embodiment of the present application provides a method for processing voice information, including:
- the preset scene model including a preset number of spectrograms
- the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
- an embodiment of the present application provides a voice information processing apparatus, including:
- a construction unit configured to construct a preset scene model, the preset scene model including a preset number of spectrograms;
- a training unit configured to train the spectrogram in the preset scene model to generate a corresponding scene recognition model
- An analysis unit configured to collect target speech information in the current environment and analyze the target speech information to obtain a target speech spectrum corresponding to the target speech information
- the input unit is configured to input the target spectrogram into a scene recognition model to determine a corresponding target preset scene model, and match corresponding call parameters according to the target preset scene model.
- a storage medium provided by an embodiment of the present application has a computer program stored thereon, and when the computer program runs on a computer, the computer is caused to execute the method for processing voice information as provided in any embodiment of the present application .
- an electronic device provided by an embodiment of the present application includes a processor and a memory, and the memory has a computer program, wherein the processor is used to perform the steps by calling the computer program:
- the preset scene model including a preset number of spectrograms
- the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
- FIG. 1 is a schematic flowchart of a method for processing voice information provided by an embodiment of the present application.
- FIG. 2 is another schematic flowchart of a method for processing voice information provided by an embodiment of the present application.
- FIG. 3 is a schematic diagram of a scenario of a method for processing voice information provided by an embodiment of the present application.
- FIG. 4 is a schematic block diagram of a device for processing voice information provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram of another module of a voice information processing apparatus provided by an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- FIG 7 is another schematic structural diagram of an electronic device provided by an embodiment of the present application.
- module used in this article can be regarded as a software object executed on the computing system.
- the different components, modules, engines and services described in this article can be regarded as the implementation objects on the computing system.
- the device and method described herein are preferably implemented in the form of software, and of course can also be implemented in hardware, which are all within the protection scope of the present application.
- An embodiment of the present application provides a method for processing voice information.
- the execution subject of the method for processing voice information may be the voice information processing apparatus provided in the embodiment of the present application, or an electronic device integrated with the voice information processing apparatus, wherein The voice information processing device may be implemented in hardware or software.
- the electronic device may be a smart phone, a tablet computer, a PDA (Personal Digital Assistant) and so on.
- An embodiment of the present invention provides a method for processing voice information, including:
- the preset scene model including a preset number of spectrograms
- the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
- the step of constructing a preset scene model may include: collecting a preset amount of voice information through a preset sampling rate; converting the preset amount of voice information into a corresponding spectrogram, Construct a preset scene model according to the phonogram.
- the step of converting the preset amount of voice information into a corresponding spectrogram may include: performing frame processing on the voice information to obtain first frame data; The first frame data is subjected to Fourier transform to generate a spectrogram corresponding to speech information.
- the step of training the spectrogram in the preset scene model to generate a corresponding scene recognition model may include: using a convolutional neural network to perform The spectrogram is trained to generate the corresponding scene recognition model.
- the step of analyzing the target voice information to obtain a target spectrogram corresponding to the target voice information may include: performing frame processing on the target voice information to obtain Second frame data; Fourier transform the second frame data to obtain a target spectrogram corresponding to the target speech information.
- the step of inputting the target spectrogram into the scene recognition model to determine the corresponding target preset scene model may include: inputting the target spectrogram into the scene recognition model; The recognition model performs feature traversal on the target spectrogram to determine a target preset scene model corresponding to the target spectrogram.
- the step of matching corresponding call parameters according to the target preset scene model may further include: generating corresponding prompt information to prompt the user to adjust the call with the matched call parameters; When receiving the confirmation instruction corresponding to the prompt information, adjust the call according to the matched call parameters.
- FIG. 1 is a schematic flowchart of a voice information processing method according to an embodiment of the present application.
- the voice information processing method may include the following steps:
- step S101 a preset scene model is constructed.
- the preset scene model is a scene where the user may be in a call, such as a road scene, a subway scene, a strong wind scene, a rain scene or a vocal noisy scene.
- a call such as a road scene, a subway scene, a strong wind scene, a rain scene or a vocal noisy scene.
- different call parameters can be set in association , Such as different noise reduction processing, balancer processing, sound smoothing processing, etc. for call voices in different scenarios, so that in the corresponding scenarios, better call parameters are adopted to achieve better calls in the corresponding scenarios effect.
- the electronic device can collect a preset amount of voice information in a specific scene and convert the preset amount of voice information into a corresponding spectrogram.
- the abscissa of the spectrogram is time, and the ordinate is frequency and color.
- the depth of represents the energy of voice data.
- the spectrogram can express the characteristics of the voice information in multiple dimensions. Therefore, a preset scene model can be constructed from the multiple spectrograms.
- the step of constructing the preset scene model may include:
- the electronic device can collect a preset amount of voice information in a preset scene through a microphone at a preset sampling rate, such as a sampling rate of 44.1 kHz (kilohertz), and intercept the voice content of each voice information for 2 seconds as input Signal, converting the multiple input signals into corresponding spectrograms, and constructing the converted spectrogram into a preset scene model, the constructed preset scene model contains multiple spectrograms under corresponding scenes, It can reflect the voice features in the scene.
- a preset sampling rate such as a sampling rate of 44.1 kHz (kilohertz)
- the step of converting the preset amount of voice information into a corresponding spectrogram may include:
- the input signal is framed and windowed, the frame length is 1024, the overlap is 128, the window function is the Hamming window, the first framed data is obtained, and the first framed data Perform Fourier transform to calculate the energy density of the signal and generate a spectrogram.
- the spectrogram can be processed in grayscale.
- the abscissa of the spectrogram is time, the ordinate is frequency, and the grayscale value Represents the energy value.
- step S102 the spectrogram in the preset scene model is trained to generate a corresponding scene recognition model.
- each preset scene model contains a preset number of spectrograms in the corresponding scene, you can use machine learning to train and learn the preset number of spectrograms in the scene to generate A scene recognition model to identify the scene.
- a convolutional neural network may be used to learn and train the spectrogram in the preset scene model to generate a scene recognition model that can automatically identify the identifying features of the corresponding scene.
- step S103 target speech information in the current environment is collected, and the target speech information is analyzed to obtain a target speech spectrum corresponding to the target speech information.
- the electronic device when the electronic device is in a call state, the user often wants to call with the best call parameters to ensure the best call effect.
- the current user can only manually select the corresponding call parameters, the process is cumbersome,
- the automatic adjustment can often only adjust the loudness of the call sound according to the noisy value of the environment, and the adjustment behavior and the processing of the voice information are relatively simple.
- the electronic device when the electronic device is in a call state, the electronic device automatically collects the target voice information in the current environment through the microphone, and converts the target voice information into the corresponding target spectrogram.
- the ordinate is frequency
- the depth of color represents the energy of voice data. It should be noted that the characteristics of the target spectrogram are the same as the features of the spectrogram in the preset scene model.
- the step of analyzing the target speech information to obtain a target spectrogram corresponding to the target speech information may include:
- the target voice information can be framed and windowed, the frame length is 1024, the overlap is 128, the window function is Hamming window, the second framed data is obtained, and the second framed data is subjected to Fourier transform , The energy density of the signal is calculated, and the corresponding target spectrogram is generated.
- the target spectrogram can also be processed in grayscale.
- the abscissa of the target spectrogram is time, and the ordinate is frequency, grayscale.
- the value represents the energy value and has the same characteristics as the spectrogram in the preset scene model.
- step S104 the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
- the target spectrogram because the three-dimensional feature of the target spectrogram, that is, the abscissa is time, the ordinate is frequency, the gray value represents the energy value is the same as the features of the spectrogram in the preset scene model, so the target spectrogram can be used
- the map is input into the scene recognition model, and the scene recognition model will traverse the features of the target spectrogram one by one, identify the target preset scene model corresponding to the target spectrogram, such as a subway scene, and preset according to the target
- the scene model matches the corresponding call parameters adapted to the scene model, so that calls can be made according to the call parameters adapted to the current environment, and the user's call efficiency is improved.
- the step of inputting the target spectrogram into the scene recognition model to determine the corresponding target preset scene model may include:
- the target spectrogram can be input into the scene recognition model, and the scene recognition model can Perform feature traversal, and automatically identify iconic features, and determine the corresponding scene recognition model according to the iconic features.
- a method for processing voice information provided by this embodiment, by constructing a preset scene model, the preset scene model includes a preset number of spectrograms; training the spectrogram in the preset scene model To generate the corresponding scene recognition model; collect the target speech information in the current environment and analyze the target speech information to obtain the target speech spectrum corresponding to the target speech information; enter the target speech spectrum into the scene recognition model to determine The corresponding target preset scene model, and match the corresponding call parameters according to the target preset scene model.
- a preset scene model constructed based on a preset number of spectrograms is trained to generate a scene recognition model that can identify the scene, collect target speech information in the current environment in real time, and generate a target spectrogram, and convert the target spectrogram Input into the scene recognition model, identify the scene model where the current environment is, and match the corresponding appropriate call parameters according to the scene model, which improves the efficiency of processing voice information and thus the recognition rate of the call scene is more accurate.
- FIG. 2 is another schematic flowchart of a method for processing voice information according to an embodiment of the present application.
- the method includes:
- step S201 a preset amount of voice information is collected through a preset sampling rate.
- an electronic device such as a mobile phone can collect 500 voice messages in a preset scene through a microphone at a sampling rate of 44.1 kHz (kilohertz), and the time of each voice message can be limited to 2 seconds.
- a voice input signal As a voice input signal.
- step S202 the voice information is framed to obtain first framed data.
- the speech input signal can be framed and windowed, the frame length is 1024, the overlap is 128, the window function is a Hamming window, and then the first framed data is obtained.
- step S203 Fourier transform is performed on the first subframe data to generate a spectrogram corresponding to the voice information, and a preset scene model is constructed according to the spectrogram.
- FIG. 3 is a schematic diagram of the grayscale spectrogram, the abscissa Is the time, the ordinate is the frequency, and the gray value is the energy value. It can be seen that the spectrogram can reflect the characteristics of the voice signal from a multi-dimensional perspective.
- a preset scene model corresponding to the preset scene is constructed.
- the preset scene model includes 500 spectrograms under the preset scene. For example, a road scene includes 500 spectrograms and a subway scene includes 500 spectrograms. and many more.
- step S204 a convolutional neural network is used to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
- Convolutional Neural Networks (Convolutional Neural Networks, CNN) is a type of feedforward neural networks (Feedforward Neural Networks) that contains convolution or related calculations and has a deep structure. It is one of the representative algorithms for deep learning
- the spectrogram in the preset scene model can be trained to generate a scene recognition model that can identify iconic features, that is, the scene recognition model can automatically identify the logo in the spectrogram sexual characteristics to determine the preset scene model to which the spectrogram belongs.
- step S205 the target voice information in the current environment is collected, and the target voice information is framed to obtain second framed data.
- the target voice information in the current call environment can be collected through the microphone, and the target voice information is framed and windowed.
- the frame length is 1024
- the overlap is 128, and the window function is Hamming window. Get the second subframe data.
- step S206 Fourier transform is performed on the second subframe data to obtain a target spectrogram corresponding to the target speech information.
- the mobile phone performs Fourier transform on the second subframe data, calculates the energy density of the signal, and generates the corresponding target spectrogram.
- the target spectrogram can also be processed in grayscale.
- the abscissa of the spectrogram is time, the ordinate is frequency, and the gray value represents the energy value, which has the same features as the spectrogram in the preset scene model shown in FIG. 3.
- step S207 input the target spectrogram into the scene recognition model, and perform feature traversal on the target spectrogram through the scene recognition model to determine the target preset scene model corresponding to the target spectrogram.
- the target spectrum in the current call environment is input into the scene recognition model, and the scene recognition model will traverse the features in the target spectrum one by one, and then identify the corresponding target landmark features in the target spectrum , Determine the target preset scene model where the target spectrogram is located according to the iconic features.
- step S208 the corresponding call parameters are matched according to the target preset scene model.
- the mobile phone will associate different call parameters for each preset scene model, so that under the corresponding preset scene, the call is performed with the best call parameters, such as the road scene associated with the first call parameter, the subway scene is associated The second call parameter.
- the first call parameter is different from the second call parameter. Therefore, when the target preset scene model is a subway scene, the corresponding second call parameter is matched.
- step S209 corresponding prompt information is generated to prompt the user to perform call adjustment with the matched call parameters, and when receiving a confirmation instruction corresponding to the prompt information, perform call adjustment according to the matched call parameters.
- corresponding prompt information can be generated, such as prompting "whether to call with the call parameters suitable for the current scenario", the user can choose yes or no operation accordingly, when the user selects yes, Generate and receive a determination instruction, and adjust the call according to the matched second call parameter.
- a method for processing voice information by collecting a preset amount of voice information at a preset sampling frequency, and performing frame processing on the voice information to obtain first frame data, convert the first One-frame data is subjected to Fourier transform to generate corresponding spectrograms, and a preset scene model is constructed according to the spectrogram, and a convolution neural network is used to train the spectrogram in the preset scene model to generate corresponding scenes Recognize the model, collect the target voice information in the current environment, and analyze the target voice information to obtain the target speech spectrum corresponding to the target speech information, and input the target speech spectrum into the scene recognition model to determine the corresponding target preset scene Model, and match the corresponding call parameters according to the target preset scene model.
- a preset scene model constructed based on a preset number of spectrograms is trained to generate a scene recognition model that can identify the scene, collect target speech information in the current environment in real time, and generate a target spectrogram, which is then Input into the scene recognition model, identify the scene model where the current environment is, and match the corresponding appropriate call parameters according to the scene model, which improves the efficiency of processing voice information and thus the recognition rate of the call scene is more accurate.
- the embodiments of the present application further provide an apparatus based on the processing method of voice information described above.
- the meaning of the nouns is the same as that in the above method for processing voice information. For specific implementation details, refer to the description in the method embodiments.
- An embodiment of the present invention provides a voice information processing device, including:
- a construction unit configured to construct a preset scene model, the preset scene model including a preset number of spectrograms;
- a training unit configured to train the spectrogram in the preset scene model to generate a corresponding scene recognition model
- the analysis unit is used to collect target speech information in the current environment and analyze the target speech information to obtain a target speech spectrum corresponding to the target speech information;
- the input unit is configured to input the target spectrogram into a scene recognition model to determine a corresponding target preset scene model, and match corresponding call parameters according to the target preset scene model.
- the construction unit may include: a collection subunit and a conversion subunit, the collection subunit is used to collect a preset amount of voice information at a preset sampling rate; the conversion subunit is used to convert all The preset amount of voice information is converted into a corresponding spectrogram, and a preset scene model is constructed according to the spectrogram.
- the conversion subunit is specifically configured to: perform frame processing on the voice information to obtain first frame data; perform Fourier transform on the first frame data to generate voice information Corresponding to the spectrogram, a preset scene model is constructed according to the spectrogram.
- the training unit is specifically configured to: use a convolutional neural network to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
- the analysis unit is specifically configured to: collect target voice information in the current environment, and frame-process the target voice information to obtain second framed data; The data is subjected to Fourier transform to obtain a target spectrogram corresponding to the target speech information.
- FIG. 4 is a schematic block diagram of a voice information processing apparatus provided by an embodiment of the present application.
- the voice information processing device 300 includes a construction unit 31, a training unit 32, an analysis unit 33, and an input unit 34.
- the construction unit 31 is configured to construct a preset scene model, and the preset scene model includes a preset number of spectrograms.
- the construction unit 31 can collect a preset amount of voice information in a specific scene and convert the preset amount of voice information into a corresponding spectrogram.
- the abscissa of the spectrogram is time, and the ordinate is frequency.
- the depth of the color represents the energy of the voice data.
- the spectrogram can express the characteristics of the voice information in multiple dimensions. Therefore, a preset scene model can be constructed from the multiple spectrograms.
- the training unit 32 is configured to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
- the training unit 32 can use a machine learning method to train and learn the preset number of spectrograms in the scene To generate a scene recognition model that can recognize the scene.
- the training unit 32 may learn and train the spectrogram in the preset scene model through a convolutional neural network to generate a scene recognition model that can automatically identify the identifying features of the corresponding scene.
- the analyzing unit 33 is configured to collect target speech information in the current environment, and analyze the target speech information to obtain a target speech spectrum corresponding to the target speech information.
- the analysis unit 33 will automatically collect the target voice information in the current environment through the microphone, and convert the target voice information into the corresponding target spectrogram, the abscissa of the target spectrogram is time , The ordinate is the frequency, and the depth of the color represents the energy of the voice data. It should be noted that the features of the target spectrogram are the same as the features of the spectrogram in the preset scene model.
- the analysis unit 33 is specifically configured to collect target voice information in the current environment, perform frame processing on the target voice information to obtain second frame data; A Fourier transform is performed to obtain a target spectrogram corresponding to the target speech information.
- the input unit 34 is configured to input the target spectrogram into a scene recognition model to determine a corresponding target preset scene model, and match corresponding call parameters according to the target preset scene model.
- the input unit 34 can The target spectrogram is input into the scene recognition model.
- the scene recognition model will traverse the features of the target spectrogram one by one to identify the target preset scene model corresponding to the target spectrogram, such as a subway scene, and according to The target preset scene model matches the corresponding call parameters adapted to the scene model, so that calls can be made according to the call parameters adapted to the current environment, and the user's call efficiency is improved.
- the input unit 34 is specifically configured to input the target spectrogram into a scene recognition model; perform feature traversal on the target spectrogram through the scene recognition model to determine the corresponding The target preset scene model, and match the corresponding call parameters according to the target preset scene model.
- FIG. 5 is another schematic diagram of another module of the apparatus for processing voice information provided by the embodiment of the present application.
- the apparatus 300 for processing voice information may further include:
- the construction unit 31 may include an acquisition subunit 311 and a transformation subunit 312.
- the collection subunit 311 is configured to collect a preset amount of voice information at a preset sampling rate.
- the conversion subunit 312 is configured to convert the preset amount of voice information into a corresponding spectrogram, and construct a preset scene model according to the spectrogram.
- the conversion sub-unit 312 is specifically configured to perform frame processing on the voice information to obtain first frame data; Fourier transform the first frame data to generate corresponding voice information , A preset scene model is constructed according to the spectrum chart.
- the electronic device 500 includes a processor 501 and a memory 502.
- the processor 501 and the memory 502 are electrically connected.
- the processor 500 is the control center of the electronic device 500, which uses various interfaces and lines to connect the various parts of the entire electronic device, executes or loads the computer program stored in the memory 502, and calls the data stored in the memory 502 to execute
- the electronic device 500 performs various functions and processes data to perform overall monitoring of the electronic device 500.
- the memory 502 can be used to store software programs and modules.
- the processor 501 runs computer programs and modules stored in the memory 502 to execute various functional applications and data processing.
- the memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, computer programs required by at least one function (such as a sound playback function, an image playback function, etc.), etc .; the storage data area may store Data created by the use of electronic devices, etc.
- the memory 502 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 502 may further include a memory controller to provide the processor 501 with access to the memory 502.
- the processor 501 in the electronic device 500 will load the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and the processor 501 executes and stores the instructions in the memory 502
- the computer program in, which realizes various functions, is as follows:
- the preset scene model including a preset number of spectrograms
- the target spectrogram is input into the scene recognition model to determine the corresponding target preset scene model, and the corresponding call parameters are matched according to the target preset scene model.
- the processor 501 when constructing the preset scene model, may specifically perform the following steps:
- the processor 501 when converting the preset amount of voice information into a corresponding spectrogram, the processor 501 may specifically perform the following steps:
- the processor 501 when training the spectrogram in the preset scene model to generate a corresponding scene recognition model, the processor 501 may specifically perform the following steps:
- a convolutional neural network is used to train the spectrogram in the preset scene model to generate a corresponding scene recognition model.
- the processor 501 when analyzing the target speech information to obtain a target spectrogram corresponding to the target speech information, the processor 501 may specifically perform the following steps:
- the processor 501 when inputting the target spectrogram into the scene recognition model to determine the corresponding target preset scene model, the processor 501 may specifically perform the following steps:
- the processor 501 may further specifically perform the following steps:
- the call adjustment is performed according to the matched call parameters.
- the electronic device 500 may further include: a display 503, a radio frequency circuit 504, an audio circuit 505, and a power supply 506.
- the display 503, the radio frequency circuit 504, the audio circuit 505, and the power supply 506 are electrically connected to the processor 501, respectively.
- the display 503 can be used to display information input by the user or provided to the user and various graphical user interfaces, which can be composed of graphics, text, icons, video, and any combination thereof.
- the display 503 may include a display panel.
- the display panel may be configured in the form of a liquid crystal display (Liquid Crystal) (LCD) or an organic light-emitting diode (Organic Light-Emitting Diode, OLED).
- LCD liquid crystal display
- OLED Organic Light-Emitting Diode
- the radio frequency circuit 504 may be used to transmit and receive radio frequency signals to establish wireless communication with network devices or other electronic devices through wireless communication, and to transmit and receive signals with network devices or other electronic devices.
- the audio circuit 505 can be used to provide an audio interface between a user and an electronic device through speakers and microphones.
- the power supply 506 can be used to power various components of the electronic device 500.
- the power supply 506 may be logically connected to the processor 501 through a power management system, so as to implement functions such as charging, discharging, and power management through the power management system.
- the electronic device 500 may further include a camera, a Bluetooth module, etc., which will not be repeated here.
- An embodiment of the present application also provides a storage medium that stores a computer program, and when the computer program is run on a computer, the computer is caused to execute the voice information processing method in any of the foregoing embodiments, such as: Set a scene model, the preset scene model includes a preset number of spectrograms; train the spectrograms in the preset scene model to generate a corresponding scene recognition model; collect target speech in the current environment Information, and analyze the target speech information to obtain a target spectrogram corresponding to the target speech information; input the target spectrogram into a scene recognition model to determine the corresponding target preset scene model, and according to The target preset scene model matches corresponding call parameters.
- the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read Only Memory, ROM,), or a random access memory (Random Access Memory, RAM), etc.
- the computer program can be stored in a computer-readable storage medium, such as stored in the memory of the electronic device, and executed by at least one processor in the electronic device, during the execution process can include, for example, voice
- the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.
- each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module.
- the above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Telephone Function (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Claims (20)
- 一种语音信息的处理方法,其中,包括:构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
- 如权利要求1所述的语音信息的处理方法,其中,所述构建预设场景模型的步骤,包括:通过预设采样率采集预设数量的语音信息;将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
- 如权利要求2所述的语音信息的处理方法,其中,所述将所述预设数量的语音信息转化为相应的语谱图的步骤,包括:将所述语音信息进行分帧处理,以得到第一分帧数据;对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
- 如权利要求1所述的语音信息的处理方法,其中,所述对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型的步骤,包括:采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应的场景识别模型。
- 如权利要求1至4任一项所述的语音信息的处理方法,其中,所述对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图的步骤,包括:对所述目标语音信息进行分帧处理,以得到第二分帧数据;对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
- 如权利要求5所述的语音信息的处理方法,其中,所述将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型的步骤,包括:将所述目标语谱图输入场景识别模型;通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
- 如权利要求1所述的语音信息的处理方法,其中,所述根据所述目标预设场景模型匹配相应的通话参数的步骤之后,还包括:生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节;当接收到所述提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
- 一种语音信息的处理装置,其中,包括:构建单元,用于构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;训练单元,用于对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;分析单元,用于采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;输入单元,用于将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
- 如权利要求8所述的语音信息的处理装置,其中,所述构建单元,包括:采集子单元,用于通过预设采样率采集预设数量的语音信息;转化子单元,用于将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
- 如权利要求9所述的语音信息的处理装置,其中,所述转化子单元,具体用于:将所述语音信息进行分帧处理,以得到第一分帧数据;对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图,根据所述语谱图构建预设场景模型。
- 如权利要求8所述的语音信息的处理装置,其中,所述训练单元,具体用于:采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应 的场景识别模型。
- 如权利要求8至11任一项所述的语音信息的处理装置,其中,所述分析单元,具体用于:采集当前环境中的目标语音信息,并对所述目标语音信息进行分帧处理,以得到第二分帧数据;对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
- 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行如权利要求1所述的语音信息的处理方法。
- 一种电子设备,包括处理器和存储器,所述存储器有计算机程序,其中,所述处理器通过调用所述计算机程序,用于执行步骤:构建预设场景模型,所述预设场景模型中包括预设数量的语谱图;对所述预设场景模型中的语谱图进行训练,以生成相应的场景识别模型;采集当前环境中的目标语音信息,并对所述目标语音信息进行分析,以得到所述目标语音信息相应的目标语谱图;将所述目标语谱图输入场景识别模型,以确定相应的目标预设场景模型,并根据所述目标预设场景模型匹配相应的通话参数。
- 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:通过预设采样率采集预设数量的语音信息;将所述预设数量的语音信息转化为相应的语谱图,根据所述语谱图构建预设场景模型。
- 如权利要求15所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:将所述语音信息进行分帧处理,以得到第一分帧数据;对所述第一分帧数据进行傅里叶变换,生成语音信息相应的语谱图。
- 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:采用卷积神经网络对进行预设场景模型中的语谱图进行训练,以生成相应 的场景识别模型。
- 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:对所述目标语音信息进行分帧处理,以得到第二分帧数据;对所述第二分帧数据进行傅里叶变换,以得到所述目标语音信息相应的目标语谱图。
- 如权利要求18所述的电子设备,其中,所述处理器通过调用所述计算机程序,用于执行步骤:将所述目标语谱图输入场景识别模型;通过场景识别模型对所述目标语谱图进行特征遍历,以确定该目标语谱图相应的目标预设场景模型。
- 如权利要求14所述的电子设备,其中,所述处理器通过调用所述计算机程序,还用于执行步骤:生成相应的提示信息,以提示用户以匹配到的通话参数进行通话调节;当接收到所述提示信息相应的确认指令时,根据匹配到通话参数进行通话调节。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/116447 WO2020102979A1 (zh) | 2018-11-20 | 2018-11-20 | 语音信息的处理方法、装置、存储介质及电子设备 |
CN201880098316.5A CN112771608A (zh) | 2018-11-20 | 2018-11-20 | 语音信息的处理方法、装置、存储介质及电子设备 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/116447 WO2020102979A1 (zh) | 2018-11-20 | 2018-11-20 | 语音信息的处理方法、装置、存储介质及电子设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020102979A1 true WO2020102979A1 (zh) | 2020-05-28 |
Family
ID=70773731
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/116447 WO2020102979A1 (zh) | 2018-11-20 | 2018-11-20 | 语音信息的处理方法、装置、存储介质及电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112771608A (zh) |
WO (1) | WO2020102979A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113370923B (zh) * | 2021-07-23 | 2023-11-03 | 深圳市元征科技股份有限公司 | 一种车辆配置的调整方法、装置、电子设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103632682A (zh) * | 2013-11-20 | 2014-03-12 | 安徽科大讯飞信息科技股份有限公司 | 一种音频特征检测的方法 |
CN103903616A (zh) * | 2012-12-25 | 2014-07-02 | 联想(北京)有限公司 | 一种信息处理的方法及电子设备 |
CN105845131A (zh) * | 2016-04-11 | 2016-08-10 | 乐视控股(北京)有限公司 | 远讲语音识别方法及装置 |
CN108764304A (zh) * | 2018-05-11 | 2018-11-06 | Oppo广东移动通信有限公司 | 场景识别方法、装置、存储介质及电子设备 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360187B (zh) * | 2011-05-25 | 2013-06-05 | 吉林大学 | 语谱图互相关的驾驶员汉语语音控制系统及方法 |
US9165565B2 (en) * | 2011-09-09 | 2015-10-20 | Adobe Systems Incorporated | Sound mixture recognition |
CN105810197B (zh) * | 2014-12-30 | 2019-07-26 | 联想(北京)有限公司 | 语音处理方法、语音处理装置和电子设备 |
CN105208174A (zh) * | 2015-09-06 | 2015-12-30 | 上海智臻智能网络科技股份有限公司 | 语音通信的方法、装置及拨号系统 |
CN106558318B (zh) * | 2015-09-24 | 2020-04-28 | 阿里巴巴集团控股有限公司 | 音频识别方法和系统 |
CN106201312A (zh) * | 2016-06-30 | 2016-12-07 | 北京奇虎科技有限公司 | 一种应用处理方法、装置及终端 |
-
2018
- 2018-11-20 WO PCT/CN2018/116447 patent/WO2020102979A1/zh active Application Filing
- 2018-11-20 CN CN201880098316.5A patent/CN112771608A/zh active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103903616A (zh) * | 2012-12-25 | 2014-07-02 | 联想(北京)有限公司 | 一种信息处理的方法及电子设备 |
CN103632682A (zh) * | 2013-11-20 | 2014-03-12 | 安徽科大讯飞信息科技股份有限公司 | 一种音频特征检测的方法 |
CN105845131A (zh) * | 2016-04-11 | 2016-08-10 | 乐视控股(北京)有限公司 | 远讲语音识别方法及装置 |
CN108764304A (zh) * | 2018-05-11 | 2018-11-06 | Oppo广东移动通信有限公司 | 场景识别方法、装置、存储介质及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN112771608A (zh) | 2021-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109087669B (zh) | 音频相似度检测方法、装置、存储介质及计算机设备 | |
US11798531B2 (en) | Speech recognition method and apparatus, and method and apparatus for training speech recognition model | |
CN111883091B (zh) | 音频降噪方法和音频降噪模型的训练方法 | |
JP5996783B2 (ja) | 声紋特徴モデルを更新するための方法及び端末 | |
WO2018219105A1 (zh) | 语音识别方法及相关产品 | |
CN108922525B (zh) | 语音处理方法、装置、存储介质及电子设备 | |
CN105489221A (zh) | 一种语音识别方法及装置 | |
CN108810280B (zh) | 语音采集频率的处理方法、装置、存储介质及电子设备 | |
CN110265011B (zh) | 一种电子设备的交互方法及其电子设备 | |
US10783884B2 (en) | Electronic device-awakening method and apparatus, device and computer-readable storage medium | |
CN111739545B (zh) | 音频处理方法、装置及存储介质 | |
WO2020249038A1 (zh) | 音频流的处理方法、装置、移动终端及存储介质 | |
CN109361995B (zh) | 一种电器设备的音量调节方法、装置、电器设备和介质 | |
WO2022147692A1 (zh) | 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质 | |
CN108600559B (zh) | 静音模式的控制方法、装置、存储介质及电子设备 | |
CN104409081A (zh) | 语音信号处理方法和装置 | |
CN111081275B (zh) | 基于声音分析的终端处理方法、装置、存储介质及终端 | |
CN113611318A (zh) | 一种音频数据增强方法及相关设备 | |
WO2020102979A1 (zh) | 语音信息的处理方法、装置、存储介质及电子设备 | |
CN109215688A (zh) | 同场景音频处理方法、装置、计算机可读存储介质及系统 | |
WO2020102943A1 (zh) | 手势识别模型的生成方法、装置、存储介质及电子设备 | |
CN110580910B (zh) | 一种音频处理方法、装置、设备及可读存储介质 | |
WO2019242415A1 (zh) | 位置提示方法、装置、存储介质及电子设备 | |
WO2022213943A1 (zh) | 消息发送方法、消息发送装置、电子设备和存储介质 | |
CN114708849A (zh) | 语音处理方法、装置、计算机设备及计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18940994 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18940994 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18940994 Country of ref document: EP Kind code of ref document: A1 |