WO2019242414A1 - Procédé et appareil de traitement vocal, support d'informations et dispositif électronique - Google Patents
Procédé et appareil de traitement vocal, support d'informations et dispositif électronique Download PDFInfo
- Publication number
- WO2019242414A1 WO2019242414A1 PCT/CN2019/085543 CN2019085543W WO2019242414A1 WO 2019242414 A1 WO2019242414 A1 WO 2019242414A1 CN 2019085543 W CN2019085543 W CN 2019085543W WO 2019242414 A1 WO2019242414 A1 WO 2019242414A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- voiceprint feature
- signal
- output
- voice signal
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000004590 computer program Methods 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 4
- 230000008451 emotion Effects 0.000 description 26
- 238000004458 analytical method Methods 0.000 description 20
- 238000000034 method Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000002996 emotional effect Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000007935 neutral effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 210000000056 organ Anatomy 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 210000003800 pharynx Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 210000001584 soft palate Anatomy 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
- G10L13/0335—Pitch control
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the present application relates to the technical field of electronic devices, and in particular, to a voice processing method, device, storage medium, and electronic device.
- an embodiment of the present application provides a voice processing method, including:
- the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
- an embodiment of the present application provides a voice processing apparatus, including:
- Acquisition module used to collect voice signals in the external environment
- An acquisition module configured to acquire voice content and voiceprint features included in the voice signal
- a generating module configured to generate a voice signal to be output according to the voice content and the voiceprint feature, the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and Output voice content;
- An output module is configured to output the voice signal to be output.
- an embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program is run on a computer, the computer is caused to execute:
- the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
- an embodiment of the present application provides an electronic device including a processor and a memory, where the memory has a computer program, and the processor calls the computer program to execute:
- the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
- FIG. 1 is a schematic flowchart of a voice processing method according to an embodiment of the present application.
- FIG. 2 is a schematic diagram of an electronic device acquiring voice content from a voice signal in an embodiment of the present application.
- FIG. 3 is a schematic diagram of voice interaction between an electronic device and a user in an embodiment of the present application.
- FIG. 4 is a schematic diagram of an electronic device and a user performing a voice interaction in a conference room scene according to an embodiment of the present application.
- FIG. 5 is another schematic flowchart of a voice processing method according to an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present application.
- FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- FIG. 8 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
- an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
- the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is clearly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
- the embodiment of the present application provides a voice processing method.
- the execution subject of the voice processing method may be a voice processing device provided in the embodiment of the application, or an electronic device integrated with the voice processing device.
- the voice processing device may use hardware or Software way.
- the electronic device may be a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
- An embodiment of the present application provides a voice processing method, including:
- the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
- the outputting the voice signal to be output includes:
- the acquiring a voice signal in an external environment includes:
- the acquiring a noise signal during the collection of the noisy voice signal according to the historical noise signal includes:
- the method before the generating a voice signal to be output according to the voice content and the voiceprint feature, the method further includes:
- the voice signal to be output is generated according to the voice content and the voiceprint feature.
- determining whether the voiceprint feature matches a preset voiceprint feature includes:
- the obtaining the similarity between the voiceprint feature and the preset voiceprint feature includes:
- the feature distance is used as a similarity between the voiceprint feature and the preset voiceprint feature.
- the method further includes:
- the acquiring the voice content included in the voice signal includes:
- FIG. 1 is a schematic flowchart of a voice processing method according to an embodiment of the present application. As shown in FIG. 1, the process of the voice processing method provided by the embodiment of the present application may be as follows:
- the electronic device can collect voice signals in the external environment in many different ways. For example, when the electronic device is not connected to a microphone, the electronic device can collect the voice in the external environment through the built-in microphone to obtain the voice signal; For example, when a microphone is externally connected to the electronic device, the electronic device can collect the voice in the external environment through the external microphone to obtain a voice signal.
- an electronic device collects a voice signal in an external environment through a microphone (the microphone here may be a built-in microphone or an external microphone), if the microphone is an analog microphone, an analog voice signal will be collected. At this time, the electronic device The analog voice signal needs to be sampled to convert the analog voice signal into a digital voice signal. For example, it can be sampled at a sampling frequency of 16KHz. In addition, if the microphone is a digital microphone, the electronic device will directly collect the signal through the digital microphone. Digitized voice signal without conversion.
- the electronic device After the electronic device collects a voice signal in the external environment, the electronic device determines whether a voice parsing engine exists locally. If it exists, the electronic device inputs the collected voice signal to the local voice parsing engine for voice analysis to obtain the voice. Parse text. Among them, the speech signal is parsed, that is, the conversion process of the speech signal from "audio" to "text".
- the electronic device can select a speech parsing engine from the multiple speech parsing engines to perform speech parsing on the speech signal in the following manner:
- the electronic device may randomly select a speech analysis engine from a plurality of local speech analysis engines to perform speech analysis on the collected speech signals.
- the electronic device can select a speech parsing engine with the highest parsing success rate from multiple speech parsing engines to perform speech parsing on the collected speech signals.
- the electronic device can select a speech parsing engine with the shortest analysis time from multiple speech parsing engines to perform speech parsing on the collected speech signals.
- the electronic device may also select a speech parsing engine with a parsing success rate that reaches a preset success rate from the multiple speech parsing engines and has the shortest parsing time to perform speech parsing on the collected speech signals.
- a speech analysis engine in a manner not listed above, or may combine multiple speech analysis engines to perform speech analysis on the speech signal.
- an electronic device may pass two speeches simultaneously.
- the parsing engine performs speech parsing on the speech signal, and when the speech parsing texts obtained by the two speech parsing engines are the same, the same speech parsing text is used as the speech parsing text of the speech signal; for example, an electronic device may pass at least three speeches
- the parsing engine performs speech parsing on the speech signal, and when the speech parsing text obtained by at least two of the speech parsing engines is the same, the same speech parsing text is used as the speech parsing text of the speech signal.
- the electronic device After the electronic device parses and obtains the speech analysis text of the speech signal, it can extract the speech content included in the speech signal from the speech analysis text. For example, referring to FIG. 2, a user speaks a voice “What is the weather tomorrow”, and the electronic device will collect a voice signal corresponding to the voice “What is the weather tomorrow”, and perform a voice analysis on the voice signal to obtain a corresponding voice parsed text. The speech content "What is the weather like tomorrow" is extracted from the speech analysis text.
- the electronic device determines whether a voice parsing engine exists locally, if it does not exist, it sends the aforementioned voice signal to a server (the server is a server providing a voice parsing service), instructs the server to analyze the aforementioned voice signal, and returns the analysis The speech parsed text obtained by the aforementioned speech signal.
- the electronic device can extract the speech content included in the speech signal from the speech analysis text.
- the first to determine the characteristics of the voiceprint is the size of the acoustic cavity, including the throat, nasal cavity, and oral cavity.
- the shape, size, and location of these organs determine the size of the vocal cord tension and the range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
- the second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated.
- the vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
- the mood of the user when speaking can also change the characteristics of the voiceprint.
- the electronic device in addition to acquiring the voice content included in the collected voice signal, the electronic device also acquires the voiceprint feature included in the collected voice signal.
- the voiceprint features include, but are not limited to, spectral feature components, cepstrum feature components, formant feature components, pitch feature components, reflection coefficient feature components, tone feature components, speech rate feature components, emotional feature components, prosodic feature components, and rhythm. At least one of the feature components.
- 103 Generate a voice signal to be output according to the acquired voice content and voiceprint features, where the voice signal to be output includes voiceprint features to be output corresponding to the aforementioned voiceprint features and voice content to be output corresponding to the aforementioned voice content.
- the electronic device obtains the voice content and voiceprint features included in the voice signal, and then according to the preset correspondence between the voice content, voiceprint features, and voice content to be output, and the acquired voice content and voice Pattern features to get the corresponding speech content to be output.
- the correspondence between the speech content, voiceprint features, and speech content to be output can be set by those skilled in the art according to actual needs, where a tone word that does not affect semantics can be added to the speech content to be output.
- a voiceprint feature that includes only emotional feature components
- the electronic device when a user speaks "What's the weather tomorrow" with a neutral emotion, the electronic device will get the corresponding content to be output as "Tomorrow will be clear and clear, suitable for going out”; For another example, when the user says “I'm unhappy” with negative emotions, the electronic device will get the corresponding content to be output as "Don't be unhappy, take me out to play.”
- the electronic device also obtains a corresponding voiceprint feature to be output according to a preset correspondence relationship between the voiceprint feature and the voiceprint feature to be output, and the obtained voiceprint feature.
- the correspondence between the voiceprint features and the voiceprint features to be output can be set by those skilled in the art according to actual needs, and this application does not specifically limit this.
- the emotions to be output corresponding to the negative emotions can be set as positive emotions
- the emotions to be output corresponding to the neutral emotions are neutral emotions
- the emotions to be output corresponding to the positive emotions are positive emotions.
- the electronic device After the electronic device obtains the voice content to be output corresponding to the voice content and the voiceprint feature, and obtains the voice feature to be output corresponding to the voiceprint feature, the electronic device performs speech according to the voice content to be output and the voice feature to be output.
- the speech signals to be output are obtained by synthesis, and the speech signals to be output include the speech content to be output corresponding to the foregoing voice content, the voiceprint feature, and the speech feature to be output corresponding to the voiceprint feature.
- the electronic device After the electronic device generates the voice signal to be output, the electronic device outputs the voice signal to be output in a voice manner. For example, please refer to FIG. 3, taking voiceprint features including only emotional feature components as an example. When a user says “I am unhappy” with negative emotions, the electronic device will get the corresponding content to be output as "Don't be unhappy, take me “Go out and play”, and get the corresponding voiceprint feature to be output as "positive emotions”.
- the electronic device After that, the electronic device performs speech synthesis based on "don't be unhappy, take me out to play” and “positive emotions” to obtain the voice signals to be output
- the voice signal to be output if the electronic device is regarded as a “person”, the “person” will say “do n’t be unhappy, take me out to play” with a positive emotion to comfort the user.
- the electronic device in the embodiment of the present application can collect voice signals in the external environment, and obtain the voice content and voiceprint features included in the collected voice signals, and then generate a standby signal based on the acquired voice content and voiceprint features.
- a voice signal is output, and the voice signal to be output includes the voice pattern feature to be output corresponding to the aforementioned voice print feature and the voice content to be output corresponding to the aforementioned voice content, and finally the generated voice signal to be output is output.
- the electronic device can output an output voice signal including the corresponding voiceprint feature according to the voiceprint characteristics included in the input voice signal, and realizes voice output in different utterance modes. Therefore, the electronic device's voice interaction is improved. flexibility.
- "outputting the generated voice signal to be output” includes:
- the electronic device When the electronic device outputs the generated voice signal to be output, it first obtains the loudness value (or volume value) of the aforementioned voice signal, uses the loudness value as the input loudness value, and then according to the preset input loudness value and output loudness The corresponding relationship of the values determines the output loudness value corresponding to the aforementioned loudness value, and uses the output loudness value as the target loudness value corresponding to the voice signal to be output, and finally determines the target loudness value to output the generated voice signal to be output.
- the loudness value or volume value
- the correspondence between the input loudness value and the output loudness value can be as follows:
- Lout represents the output loudness value
- Lin represents the input loudness value
- k is the corresponding coefficient, which can be set by those skilled in the art according to actual needs. For example, when k is set to 1, the output loudness value is equal to the input loudness value. When set to less than 1, the output loudness value will be less than the input loudness value.
- the target loudness value corresponding to the output voice signal to be output is determined through the collected loudness value of the voice signal, which can make the voice interaction of the electronic device more compatible with the current scene.
- the user is located in the conference room with the electronic device.
- the electronic device When the user whispers the voice, the electronic device also whispers the voice, thereby avoiding the situation where the fixed vocalization disturbs others.
- "collecting a voice signal in an external environment” includes:
- the embodiments of the present application continue to provide a solution for collecting a voice signal from a noisy environment.
- the electronic device When the electronic device is in a noisy environment, if the user sends out a voice signal, the electronic device will collect the noisy voice signal in the external environment.
- the noisy voice signal is formed by the combination of the voice signal sent by the user and the noise signal in the external environment. If the user does not send a voice signal, the electronic device will only collect noise signals in the external environment. Among them, the electronic device will buffer the collected noisy speech signals and noise signals.
- the start time of the noisy voice signal is used as the end time, and the preset time period (the The preset time length can be taken by a person skilled in the art according to actual needs.
- the embodiment of the present application does not specifically limit this. For example, it can be set to a historical noise signal of 500 ms. Noise signal.
- the electronic device obtains 16:47 on June 13, 2018.
- a noise signal with a duration of 500 milliseconds buffered from 56 seconds to 16:47:56 and 500 milliseconds on June 13, 2018 is used as the historical noise signal corresponding to the noisy speech signal.
- the electronic device After acquiring the historical noise signal corresponding to the noisy speech signal, the electronic device further acquires the noise signal during the collection of the noisy speech signal according to the acquired historical noise signal.
- the electronic device can predict the noise distribution during the acquisition of the noisy speech signal based on the acquired historical noise signal, thereby obtaining the noise signal during the noisy speech signal acquisition.
- the noise variation in continuous time is usually small.
- the electronic device can use the acquired historical noise signal as the noise signal during the noisy speech signal acquisition.
- the length of the noisy speech signal can be intercepted from the historical noise signal with the same duration as the noisy speech signal as the noise signal during the collection of the noisy speech signal; if the length of the historical noise signal is less than the length of the noisy speech signal, Then, the historical noise signal can be copied, and multiple historical noise signals can be spliced to obtain a noise signal with the same duration as the noisy speech signal, as the noise signal during the noisy speech signal acquisition.
- the electronic device After acquiring the noise signal during the acquisition of the noisy voice signal, the electronic device first performs inverse processing on the acquired noise signal, and then superimposes the inverse processed noise signal with the noisy voice signal to eliminate the noisy voice.
- the noise part of the signal is used to obtain a noise-reduced voice signal, and the obtained noise-reduced voice signal is used as a voice signal collected from the external environment for subsequent processing.
- obtaining a noise signal during the acquisition of a noisy voice signal according to the historical noise signal includes:
- the electronic device obtains the historical noise signal, uses the historical noise signal as sample data, and performs model training according to a preset training algorithm to obtain a noise prediction model.
- the training algorithm is a machine learning algorithm.
- the machine learning algorithm can predict the data through continuous feature learning.
- the electronic device can predict the current noise distribution based on the historical noise distribution.
- machine learning algorithms can include: decision tree algorithms, regression algorithms, Bayesian algorithms, neural network algorithms (which can include deep neural network algorithms, convolutional neural network algorithms and recursive neural network algorithms, etc.), clustering algorithms, etc. Which training algorithm is selected as a preset training algorithm for model training can be selected by those skilled in the art according to actual needs.
- the preset training algorithm for the configuration of the electronic device configuration is a Gaussian mixture model algorithm (which is a regression algorithm).
- the historical noise signal is used as sample data, and the model is modeled according to the Gaussian mixture model algorithm.
- a Gaussian mixture model is obtained (the noise prediction model includes multiple Gaussian units for describing the noise distribution), and the Gaussian mixture model is used as the noise prediction model.
- the electronic device uses the start time and end time of the noisy speech signal collection period as the input of the noise prediction model, and inputs the noise prediction model for processing to obtain the noise prediction model output noise signal during the noisy speech signal collection period.
- the method before “generating a voice signal to be output based on the acquired voice content and voiceprint characteristics”, the method further includes:
- the preset voiceprint feature may be a voiceprint feature previously entered by the owner, or may be a voiceprint feature previously entered by another user authorized by the owner to determine the aforementioned voiceprint feature (that is, a voice signal collected in an external environment). Whether the voiceprint feature of the device matches the preset voiceprint feature, that is, determining whether the user sending the voice signal is the owner. If the aforementioned voiceprint feature does not match the preset voiceprint feature, the electronic device determines that the user who issued the voice signal is not the owner. If the aforementioned voiceprint feature matches the preset voiceprint feature, the electronic device determines that the user who issued the voice signal is the machine. Mainly, at this time, a voice signal to be output is generated according to the obtained voice content and the aforementioned voiceprint feature. For details, refer to the foregoing related description, and details are not described herein again.
- the user who issued the voice signal is identified according to the voiceprint characteristics of the voice signal. Only when the user who sends the voice signal is the owner, the voice is obtained based on the obtained voice. The content and the aforementioned voiceprint characteristics generate a speech signal to be output. Therefore, it is possible to prevent the electronic device from erroneously responding to others other than the owner, so as to improve the owner's use experience.
- determining whether the aforementioned voiceprint feature matches the preset voiceprint feature includes:
- the electronic device determines whether the aforementioned voiceprint feature matches the preset voiceprint feature, it can obtain the similarity between the aforementioned voiceprint feature and the preset voiceprint feature, and determine whether the acquired similarity is greater than or equal to the first preset similarity Degree (can be set by those skilled in the art according to actual needs).
- the obtained similarity is greater than or equal to the first preset similarity, it is determined that the obtained voiceprint feature matches the preset voiceprint feature, and when the obtained similarity is less than the first preset similarity, It is determined that the obtained voiceprint feature does not match the preset voiceprint feature.
- the electronic device may obtain the distance between the aforementioned voiceprint feature and the preset voiceprint feature, and use the obtained distance as the similarity between the aforementioned voiceprint feature and the preset voiceprint feature.
- a person skilled in the art may select any one of characteristic distances (such as Euclidean distance, Manhattan distance, Chebyshev distance, etc.) to measure the distance between the aforementioned voiceprint feature and the preset voiceprint feature.
- the cosine distance of the aforementioned voiceprint feature and the preset voiceprint feature can be obtained, and specifically refer to the following formula:
- e represents the cosine distance between the aforementioned voiceprint feature and the preset voiceprint feature
- f represents the aforementioned voiceprint feature
- N represents the dimensions of the aforementioned voiceprint feature and the preset voiceprint feature (the aforementioned voiceprint feature and the preset voiceprint feature Dimensions are the same)
- fi represents the feature vector of the i-th dimension in the aforementioned voiceprint feature
- gi represents the feature vector of the i-th dimension in the preset voiceprint feature.
- the method further includes:
- the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the user who issued the voice signal is the owner, the electronic device cannot recognize it. In addition, there are many situations that cause the electronic device to fail to identify the owner, which will not be repeated here.
- the electronic device completes the judgment of the similarity of the voiceprint feature, if the similarity between the aforementioned voiceprint feature and the preset voiceprint feature is less than
- the first preset similarity is further judged whether the similarity is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be specifically determined by those skilled in the art according to actual needs Take a suitable value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
- the electronic device When the judgment result is yes, that is, the similarity between the aforementioned voiceprint feature and the preset voiceprint feature is less than the first preset similarity and greater than or equal to the second preset similarity, the electronic device further obtains the current position information .
- the electronic device when in an outdoor environment (the electronic device can identify whether it is currently in an outdoor environment or an indoor environment according to the strength of the received satellite positioning signal, for example, when the strength of the received satellite positioning signal is lower than a preset threshold, it is determined to be in Indoor environment, when the strength of the received satellite positioning signal is higher than or equal to a preset threshold, it is determined to be in an outdoor environment), the electronic device can use satellite positioning technology to obtain the current position information. Indoor location technology can be used to obtain the current location information.
- the electronic device After acquiring the current position information, the electronic device determines whether it is currently within a preset position range according to the position information.
- the preset position range can be configured as a common position range of the owner, such as home and company.
- the electronic device determines that the aforementioned voiceprint feature matches the preset voiceprint feature, and determines that the user who issued the voice signal is the owner.
- the voice processing method may include:
- the start time of the noisy voice signal is used as the end time, and the preset time period (the The preset time length can be taken by a person skilled in the art according to actual needs.
- the embodiment of the present application does not specifically limit this. For example, it can be set to a historical noise signal of 500 ms. Noise signal.
- the electronic device obtains 16:47 on June 13, 2018.
- a noise signal with a duration of 500 milliseconds buffered from 56 seconds to 16:47:56 and 500 milliseconds on June 13, 2018 is used as the historical noise signal corresponding to the noisy speech signal.
- the electronic device After acquiring the historical noise signal corresponding to the noisy speech signal, the electronic device further acquires the noise signal during the collection of the noisy speech signal according to the acquired historical noise signal.
- the electronic device can predict the noise distribution during the acquisition of the noisy speech signal based on the acquired historical noise signal, thereby obtaining the noise signal during the noisy speech signal acquisition.
- the noise variation in continuous time is usually small.
- the electronic device can use the acquired historical noise signal as the noise signal during the noisy speech signal acquisition.
- the length of the noisy speech signal can be intercepted from the historical noise signal with the same duration as the noisy speech signal as the noise signal during the collection of the noisy speech signal; if the length of the historical noise signal is less than the length of the noisy speech signal, Then, the historical noise signal can be copied, and multiple historical noise signals can be spliced to obtain a noise signal with the same duration as the noisy speech signal, as the noise signal during the noisy speech signal acquisition.
- the electronic device After acquiring the noise signal during the acquisition of the noisy voice signal, the electronic device first performs inverse processing on the acquired noise signal, and then superimposes the inverse processed noise signal with the noisy voice signal to eliminate the noisy voice.
- the noise part of the signal is used to obtain a noise-reduced voice signal, and the obtained noise-reduced voice signal is used as a voice signal to be processed for subsequent processing.
- the electronic device After the electronic device obtains the speech signal to be processed, it first determines whether there is a speech parsing engine locally. If it exists, the electronic device inputs the aforementioned speech signal to the local speech parsing engine for speech parsing to obtain a speech parsing text. Among them, the speech signal is parsed, that is, the conversion process of the speech signal from "audio" to "text".
- the electronic device After the electronic device parses and obtains the voice parsed text of the voice signal, it can extract the voice content included in the voice signal from the voice parsed text. For example, referring to FIG. 2, a user speaks a voice “What is the weather tomorrow”, and the electronic device will collect a voice signal corresponding to the voice “What is the weather tomorrow”, and perform a voice analysis on the voice signal to obtain a corresponding voice parsed text. The speech content "What is the weather like tomorrow" is extracted from the speech analysis text.
- the electronic device determines whether a voice parsing engine exists locally, if it does not exist, it sends the aforementioned voice signal to a server (the server is a server providing a voice parsing service), instructs the server to analyze the aforementioned voice signal, and returns the analysis The speech parsed text obtained by the aforementioned speech signal.
- the electronic device can extract the speech content included in the speech signal from the speech analysis text.
- the electronic device in addition to acquiring the voice content included in the aforementioned voice signal, the electronic device also acquires the voiceprint feature included in the aforementioned voice signal.
- the voiceprint features include, but are not limited to, spectral feature components, cepstrum feature components, formant feature components, pitch feature components, reflection coefficient feature components, tone feature components, speech rate feature components, emotional feature components, prosodic feature components, and rhythm. At least one of the feature components.
- a voice signal to be output according to the acquired voice content and voiceprint features, where the voice signal to be output includes voiceprint features to be output corresponding to the aforementioned voiceprint features and voice content to be output corresponding to the aforementioned voice content.
- the electronic device obtains the voice content and voiceprint features included in the voice signal, and then according to the preset correspondence between the voice content, voiceprint features, and voice content to be output, and the acquired voice content and voice Pattern features to get the corresponding speech content to be output.
- the correspondence between the speech content, voiceprint features, and speech content to be output can be set by those skilled in the art according to actual needs, where a tone word that does not affect semantics can be added to the speech content to be output.
- a voiceprint feature that includes only emotional feature components
- the electronic device when a user speaks "What's the weather tomorrow" with a neutral emotion, the electronic device will get the corresponding content to be output as "Tomorrow will be clear and clear, suitable for going out”; For another example, when the user says “I'm unhappy” with negative emotions, the electronic device will get the corresponding content to be output as "Don't be unhappy, take me out to play.”
- the electronic device also obtains a corresponding voiceprint feature to be output according to a preset correspondence relationship between the voiceprint feature and the voiceprint feature to be output, and the obtained voiceprint feature.
- the correspondence between the voiceprint features and the voiceprint features to be output can be set by those skilled in the art according to actual needs, and this application does not specifically limit this.
- the emotions to be output corresponding to the negative emotions can be set as positive emotions
- the emotions to be output corresponding to the neutral emotions are neutral emotions
- the emotions to be output corresponding to the positive emotions are positive emotions.
- the electronic device After the electronic device obtains the voice content to be output corresponding to the voice content and the voiceprint feature, and obtains the voice feature to be output corresponding to the voiceprint feature, the electronic device performs speech according to the voice content to be output and the voice feature to be output.
- the speech signals to be output are obtained by synthesis, and the speech signals to be output include the speech content to be output corresponding to the foregoing voice content, the voiceprint feature, and the speech feature to be output corresponding to the voiceprint feature.
- the electronic device first obtains a loudness value (or a volume value) of the foregoing voice signal after the generated voice signal is to be output.
- the electronic device obtains the loudness value of the voice signal, uses the loudness value as the input loudness value, and then determines the output loudness value corresponding to the aforementioned loudness value according to the preset correspondence between the input loudness value and the output loudness value. Use the output loudness value as the target loudness value corresponding to the voice signal to be output, and then determine the target loudness value to output the generated voice signal to be output.
- the correspondence between the input loudness value and the output loudness value can be as follows:
- Lout represents the output loudness value
- Lin represents the input loudness value
- k is the corresponding coefficient, which can be set by those skilled in the art according to actual needs. For example, when k is set to 1, the output loudness value is equal to the input loudness value. When set to less than 1, the output loudness value will be less than the input loudness value.
- a voice processing device is also provided.
- FIG. 6, is a schematic structural diagram of a voice processing apparatus 400 according to an embodiment of the present application.
- the voice processing device is applied to an electronic device.
- the voice processing device includes an acquisition module 401, an acquisition module 402, a generation module 403, and an output module 404, as follows:
- the acquisition module 401 is configured to acquire a voice signal in an external environment.
- the acquiring module 402 is configured to acquire voice content and voiceprint features included in the collected voice signal.
- a generating module 403 is configured to generate a voice signal to be output according to the acquired voice content and voiceprint features, where the voice signal to be output includes voiceprint features to be output corresponding to the aforementioned voiceprint features and voice content to be output corresponding to the aforementioned voice content .
- the output module 404 is configured to output the generated voice signal to be output.
- the output module 404 may be configured to:
- a voice signal to be output is output.
- the acquisition module 401 may be configured to:
- the acquired noise signal is superposed in phase with the noisy voice signal, and the superimposed noise-reduced voice signal is used as the collected voice signal.
- the acquisition module 401 may be configured to:
- Noise signals are predicted during the acquisition of noisy speech signals according to a noise prediction model.
- the generating module 403 may be configured to:
- a voice signal to be output is generated according to the acquired voice content and the aforementioned voiceprint feature.
- the generating module 403 may be configured to:
- the generating module 403 may be configured to:
- the feature distance is used as the similarity between the voiceprint feature and the preset voiceprint feature.
- the generating module 403 may be configured to:
- the obtaining module 402 may be configured to:
- the voice processing device 400 may be integrated in an electronic device, such as a mobile phone, a tablet computer, or the like.
- the above modules can be implemented as independent entities, or can be arbitrarily combined, and implemented as the same or several entities.
- the specific implementation of the above units can refer to the previous embodiments, and will not be repeated here.
- an electronic device is also provided.
- the electronic device 500 includes a processor 501 and a memory 502.
- the processor 501 is electrically connected to the memory 502.
- the processor 500 is a control center of the electronic device 500. It connects various parts of the entire electronic device by using various interfaces and lines, and executes the electronic program by running or loading a computer program stored in the memory 502, and calling data stored in the memory 502. Various functions of the device 500 and process data.
- the memory 502 may be configured to store software programs and modules.
- the processor 501 executes various functional applications and data processing by running computer programs and modules stored in the memory 502.
- the memory 502 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a computer program (such as a sound playback function, an image playback function, etc.) required for at least one function; the storage data area may store data according to Data created by the use of electronic devices, etc.
- the memory 502 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 502 may further include a memory controller to provide the processor 501 with access to the memory 502.
- the processor 501 in the electronic device 500 loads the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and the processor 501 runs the stored data in the memory 502
- a computer program in the computer to achieve various functions, as follows:
- the voice signal to be output includes voiceprint feature to be output corresponding to the aforementioned voiceprint feature, and voice content to be output corresponding to the aforementioned voice content;
- the electronic device 500 may further include a display 503, a radio frequency circuit 504, an audio circuit 505, and a power source 506.
- the display 503, the radio frequency circuit 504, the audio circuit 505, and the power supply 506 are electrically connected to the processor 501, respectively.
- the display 503 may be used to display information input by the user or information provided to the user and various graphical user interfaces. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof.
- the display 503 may include a display panel.
- the display panel may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), or an organic light emitting diode (Organic Light-Emitting Diode, OLED).
- the radio frequency circuit 504 may be used to transmit and receive radio frequency signals to establish wireless communication with a network device or other electronic device through wireless communication, and transmit and receive signals to and from the network device or other electronic device.
- the audio circuit 505 may be used to provide an audio interface between the user and the electronic device through a speaker or a microphone.
- the power source 506 may be used to power various components of the electronic device 500.
- the power supply 506 may be logically connected to the processor 501 through a power management system, so as to implement functions such as management of charging, discharging, and power consumption management through the power management system.
- the electronic device 500 may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
- the processor 501 may execute:
- a voice signal to be output is output.
- the processor 501 when collecting a voice signal in an external environment, the processor 501 may execute:
- the acquired noise signal is superposed in phase with the noisy voice signal, and the superimposed noise-reduced voice signal is used as the collected voice signal.
- the processor 501 may execute:
- Noise signals are predicted during the acquisition of noisy speech signals according to a noise prediction model.
- the processor 501 when generating a voice signal to be output based on the acquired voice content and voiceprint characteristics, the processor 501 may execute:
- a voice signal to be output is generated according to the acquired voice content and the aforementioned voiceprint feature.
- the processor 501 may further execute:
- the processor 501 may execute:
- the feature distance is used as the similarity between the voiceprint feature and the preset voiceprint feature.
- the processor 501 may further execute:
- the processor 501 when acquiring the voice content included in the collected voice signal, the processor 501 may execute:
- An embodiment of the present application further provides a storage medium.
- the storage medium stores a computer program, and when the computer program is run on a computer, the computer is caused to execute the speech processing method in any of the foregoing embodiments, such as: Voice signals in the external environment; acquiring voice content and voiceprint features included in the collected voice signals; generating a voice signal to be output according to the acquired voice content and voiceprint features, and the voice signal to be output includes the corresponding voiceprint feature Features of the voiceprint to be output, and voice content to be output corresponding to the aforementioned voice content; and output a generated voice signal to be output.
- the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, ROM), or a random access device (Random Access Memory, RAM).
- the computer program may be stored in a computer-readable storage medium, such as stored in a memory of an electronic device, and executed by at least one processor in the electronic device, and may include, for example, a voice processing method during execution.
- the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.
- the voice processing device For the voice processing device according to the embodiment of the present application, its functional modules may be integrated into one processing chip, or each module may exist separately physically, or two or more modules may be integrated into one module.
- the above integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk. .
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
L'invention concerne un procédé de traitement vocal, consistant : à collecter un signal vocal dans un environnement externe (101) ; à acquérir le contenu vocal et des caractéristiques d'empreinte vocale comprises dans le signal vocal collecté (102) ; sur la base du contenu vocal acquis et des caractéristiques d'impression vocale, à générer un signal vocal à délivrer, le signal vocal à délivrer comprenant des caractéristiques d'impression vocale à délivrer correspondant aux caractéristiques d'impression vocale susmentionnées et un contenu vocal à délivrer en sortie correspondant au contenu vocal susmentionné (103) ; et à délivrer le signal vocal généré à délivrer (104). L'invention concerne également un appareil de traitement vocal, un support d'informations et un dispositif électronique.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810631577.0A CN108922525B (zh) | 2018-06-19 | 2018-06-19 | 语音处理方法、装置、存储介质及电子设备 |
CN201810631577.0 | 2018-06-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019242414A1 true WO2019242414A1 (fr) | 2019-12-26 |
Family
ID=64421230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/085543 WO2019242414A1 (fr) | 2018-06-19 | 2019-05-05 | Procédé et appareil de traitement vocal, support d'informations et dispositif électronique |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108922525B (fr) |
WO (1) | WO2019242414A1 (fr) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108922525B (zh) * | 2018-06-19 | 2020-05-12 | Oppo广东移动通信有限公司 | 语音处理方法、装置、存储介质及电子设备 |
CN109817196B (zh) * | 2019-01-11 | 2021-06-08 | 安克创新科技股份有限公司 | 一种噪音消除方法、装置、系统、设备及存储介质 |
CN110288989A (zh) * | 2019-06-03 | 2019-09-27 | 安徽兴博远实信息科技有限公司 | 语音交互方法及系统 |
CN110400571B (zh) * | 2019-08-08 | 2022-04-22 | Oppo广东移动通信有限公司 | 音频处理方法、装置、存储介质及电子设备 |
CN110767229B (zh) * | 2019-10-15 | 2022-02-01 | 广州国音智能科技有限公司 | 基于声纹的音频输出方法、装置、设备及可读存储介质 |
CN110634491B (zh) * | 2019-10-23 | 2022-02-01 | 大连东软信息学院 | 语音信号中针对通用语音任务的串联特征提取系统及方法 |
CN111933138B (zh) * | 2020-08-20 | 2022-10-21 | Oppo(重庆)智能科技有限公司 | 语音控制方法、装置、终端及存储介质 |
CN115497480A (zh) * | 2021-06-18 | 2022-12-20 | 海信集团控股股份有限公司 | 一种声音复刻方法、装置、设备及介质 |
CN114678003A (zh) * | 2022-04-07 | 2022-06-28 | 游密科技(深圳)有限公司 | 语音合成方法、装置、电子设备及存储介质 |
CN115273852A (zh) * | 2022-06-21 | 2022-11-01 | 北京小米移动软件有限公司 | 语音应答方法、装置、可读存储介质及芯片 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103165131A (zh) * | 2011-12-17 | 2013-06-19 | 富泰华工业(深圳)有限公司 | 语音处理系统及语音处理方法 |
CN103259908A (zh) * | 2012-02-15 | 2013-08-21 | 联想(北京)有限公司 | 一种移动终端及其智能控制方法 |
CN103838991A (zh) * | 2014-02-20 | 2014-06-04 | 联想(北京)有限公司 | 一种信息处理方法及电子设备 |
CN105488227A (zh) * | 2015-12-29 | 2016-04-13 | 惠州Tcl移动通信有限公司 | 一种电子设备及其基于声纹特征处理音频文件的方法 |
CN106128467A (zh) * | 2016-06-06 | 2016-11-16 | 北京云知声信息技术有限公司 | 语音处理方法及装置 |
US20170069317A1 (en) * | 2015-09-04 | 2017-03-09 | Samsung Electronics Co., Ltd. | Voice recognition apparatus, driving method thereof, and non-transitory computer-readable recording medium |
CN107729433A (zh) * | 2017-09-29 | 2018-02-23 | 联想(北京)有限公司 | 一种音频处理方法及设备 |
CN207149252U (zh) * | 2017-08-01 | 2018-03-27 | 安徽听见科技有限公司 | 语音处理系统 |
CN108922525A (zh) * | 2018-06-19 | 2018-11-30 | Oppo广东移动通信有限公司 | 语音处理方法、装置、存储介质及电子设备 |
-
2018
- 2018-06-19 CN CN201810631577.0A patent/CN108922525B/zh not_active Expired - Fee Related
-
2019
- 2019-05-05 WO PCT/CN2019/085543 patent/WO2019242414A1/fr active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103165131A (zh) * | 2011-12-17 | 2013-06-19 | 富泰华工业(深圳)有限公司 | 语音处理系统及语音处理方法 |
CN103259908A (zh) * | 2012-02-15 | 2013-08-21 | 联想(北京)有限公司 | 一种移动终端及其智能控制方法 |
CN103838991A (zh) * | 2014-02-20 | 2014-06-04 | 联想(北京)有限公司 | 一种信息处理方法及电子设备 |
US20170069317A1 (en) * | 2015-09-04 | 2017-03-09 | Samsung Electronics Co., Ltd. | Voice recognition apparatus, driving method thereof, and non-transitory computer-readable recording medium |
CN105488227A (zh) * | 2015-12-29 | 2016-04-13 | 惠州Tcl移动通信有限公司 | 一种电子设备及其基于声纹特征处理音频文件的方法 |
CN106128467A (zh) * | 2016-06-06 | 2016-11-16 | 北京云知声信息技术有限公司 | 语音处理方法及装置 |
CN207149252U (zh) * | 2017-08-01 | 2018-03-27 | 安徽听见科技有限公司 | 语音处理系统 |
CN107729433A (zh) * | 2017-09-29 | 2018-02-23 | 联想(北京)有限公司 | 一种音频处理方法及设备 |
CN108922525A (zh) * | 2018-06-19 | 2018-11-30 | Oppo广东移动通信有限公司 | 语音处理方法、装置、存储介质及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN108922525A (zh) | 2018-11-30 |
CN108922525B (zh) | 2020-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019242414A1 (fr) | Procédé et appareil de traitement vocal, support d'informations et dispositif électronique | |
CN110136692B (zh) | 语音合成方法、装置、设备及存储介质 | |
US20130211826A1 (en) | Audio Signals as Buffered Streams of Audio Signals and Metadata | |
CN110473546B (zh) | 一种媒体文件推荐方法及装置 | |
WO2021008538A1 (fr) | Procédé d'interaction vocale et dispositif associé | |
KR20190042918A (ko) | 전자 장치 및 그의 동작 방법 | |
CN108806684B (zh) | 位置提示方法、装置、存储介质及电子设备 | |
CN107799126A (zh) | 基于有监督机器学习的语音端点检测方法及装置 | |
CN111583944A (zh) | 变声方法及装置 | |
CN108711429B (zh) | 电子设备及设备控制方法 | |
CN111508511A (zh) | 实时变声方法及装置 | |
CN112840396A (zh) | 用于处理用户话语的电子装置及其控制方法 | |
CN110265011B (zh) | 一种电子设备的交互方法及其电子设备 | |
CN108962241B (zh) | 位置提示方法、装置、存储介质及电子设备 | |
WO2022147692A1 (fr) | Procédé de reconnaissance d'instruction vocale, dispositif électronique et support de stockage non transitoire lisible par ordinateur | |
WO2022057759A1 (fr) | Procédé de conversion de voix et dispositif associé | |
CN117059068A (zh) | 语音处理方法、装置、存储介质及计算机设备 | |
CN114154636A (zh) | 数据处理方法、电子设备及计算机程序产品 | |
WO2019242415A1 (fr) | Procédé et dispositif d'invite de position, support d'informations et dispositif électronique | |
WO2019228140A1 (fr) | Procédé et appareil d'exécution d'instruction, support d'informations et dispositif électronique | |
CN109064720B (zh) | 位置提示方法、装置、存储介质及电子设备 | |
KR102114365B1 (ko) | 음성인식 방법 및 장치 | |
CN108989551B (zh) | 位置提示方法、装置、存储介质及电子设备 | |
WO2020102943A1 (fr) | Procédé et appareil de génération d'un modèle de reconnaissance de gestes, support d'informations et dispositif électronique | |
CN111696566B (zh) | 语音处理方法、装置和介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19823564 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19823564 Country of ref document: EP Kind code of ref document: A1 |