WO2021136298A1 - Procédé et appareil de traitement vocal et dispositif intelligent et support de stockage - Google Patents
Procédé et appareil de traitement vocal et dispositif intelligent et support de stockage Download PDFInfo
- Publication number
- WO2021136298A1 WO2021136298A1 PCT/CN2020/141038 CN2020141038W WO2021136298A1 WO 2021136298 A1 WO2021136298 A1 WO 2021136298A1 CN 2020141038 W CN2020141038 W CN 2020141038W WO 2021136298 A1 WO2021136298 A1 WO 2021136298A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- information
- recognized
- type
- segment
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000001514 detection method Methods 0.000 claims description 44
- 230000004044 response Effects 0.000 claims description 24
- 238000004891 communication Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 10
- 230000005236 sound signal Effects 0.000 description 66
- 239000012634 fragment Substances 0.000 description 20
- 230000008569 process Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 8
- 230000000977 initiatory effect Effects 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 6
- 230000006399 behavior Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 230000002618 waking effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Definitions
- the present invention relates to the technical field of intelligent robots, in particular to a voice processing method, device, intelligent equipment and storage medium.
- smart devices such as smart robots, smart speakers, etc.
- smart speakers that can conduct continuous conversations with users are usually set up. After waking up the smart device, the user can perform multiple voice interactions with the smart robot, and there is no need to wake up the smart device again between each interaction.
- the user can send out the voice message "How is the weather today", and then the smart device broadcasts the queried weather status to the user. Then, the user can send out the voice message "Where is the Starbucks” again, so that the smart device can continue to broadcast the location of the Starbucks that has been queried to the user. Among them, the smart device is in a wake-up state between the two voice messages of "What's the weather today" and "Where is the Starbucks?"
- the smart device when the smart device is awake, it can receive the voice information broadcast by itself and respond to the voice information as the voice information sent by the user, that is, the smart device can mistake its own machine sound
- the user’s vocals therefore, appear to be "self-questioning and self-answering" wrong behaviors, which affects the user experience.
- the purpose of the embodiments of the present invention is to provide a voice processing method, device, smart device, and storage medium to improve the recognition accuracy of the voice type of voice information.
- the specific technical solutions are as follows:
- an embodiment of the present invention provides a voice processing method, and the method includes:
- the smart device Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
- the sound type of the voice information to be recognized is determined.
- the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:
- the voice type of the voice information to be recognized is human voice.
- the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:
- the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;
- the sound type of the voice information to be recognized is determined according to the relationship between the proportion information and the set threshold.
- the step of determining the proportion information of the first type of information based on the first quantity of the first type of information includes:
- the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
- the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
- the method further includes:
- the voice information to be recognized is a human voice
- semantic recognition is performed based on the text recognition result, and the response information of the voice information to be recognized is determined.
- an embodiment of the present invention provides a voice processing device, the device including:
- the type determination module is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
- the type determination module is specifically configured to:
- the type determination module is specifically configured to:
- the first quantity of the first type of information From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information
- the first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.
- the proportion information of a type of information From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio
- the proportion information of a type of information determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,
- the type determination module is specifically configured to:
- the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
- the device further includes:
- the information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.
- the device further includes:
- the result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;
- the information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
- an embodiment of the present invention provides an electronic device, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
- Memory used to store computer programs
- the processor is configured to implement the steps of any voice processing method provided in the first aspect when executing the program stored in the memory.
- an embodiment of the present invention provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the above-mentioned aspects provided in the first aspect is implemented.
- the steps of the voice processing method are described in detail below.
- an embodiment of the present invention provides a computer program.
- the computer program product includes a computer program stored on a computer-readable storage medium.
- the computer program includes program instructions that are executed by a processor. The steps of any one of the voice processing methods provided in the first aspect above are implemented.
- the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected.
- the broadcast status information corresponding to the fragment In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
- FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention
- FIG. 2 is a schematic flowchart of a specific implementation of S101 in FIG. 1;
- FIG. 3 is a schematic flowchart of another specific implementation of S101 in FIG. 1;
- FIG. 4 is a schematic flowchart of a specific implementation of S102 in FIG. 1;
- FIG. 5 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;
- FIG. 6 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;
- FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention.
- FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
- the smart device uses a preset voiceprint model to detect the voice information to determine the voice type of the voice information, that is, the voice information is Human voice or machine voice. Since the voiceprint model is obtained based on the machine voice training of the smart device, and the voiceprint used for training the voiceprint model is similar to the voice spectrum of some users, the voiceprint model will mistake the voice of some users. It is judged as a machine sound, which results in that part of the human voice cannot be responded to by the smart device and still affects the user experience. Based on this, how to improve the recognition accuracy of the voice type of the voice information is a problem to be solved urgently.
- the embodiment of the present invention provides a voice processing method.
- the method includes:
- the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice when the voice segment is collected Broadcast
- the sound type of the voice information to be recognized is determined.
- the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected.
- the broadcast status information corresponding to the fragment In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is to say, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized.
- the voice broadcast status information can reflect whether there may be machine sounds generated by the smart device voice broadcast in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
- the execution subject of a voice processing method provided in the embodiment of the present invention may be a smart device that collects voice information to be recognized, and thus, the recognition method may be completed offline.
- the smart device may be any smart electronic device that needs to perform voice processing, for example, a smart robot, a smart speaker, a smart phone, a tablet computer, etc.
- the embodiment of the present invention does not make a specific limitation.
- the execution subject may also be a server that provides voice processing for the smart device that collects the voice information to be recognized, so that the recognition method may be completed online.
- the execution subject when the execution subject is the server, when the smart device collects various sound signals in the environment, it can process the sound signals locally, so as to obtain the voice information to be recognized and the information contained in the voice information to be recognized.
- the broadcast state information corresponding to each voice segment can then be uploaded to the server for the to-be-recognized voice information and each voice segment, so that the server can execute a voice processing method provided by an embodiment of the present invention.
- FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention. As shown in Figure 1, the method may include the following steps:
- S101 Obtain the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment included in the voice information to be recognized;
- the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice broadcast when the voice segment is collected;
- the electronic device determines is the sound type of the received voice information to be recognized. Therefore, the electronic device needs to first obtain the voice information to be recognized. Wherein, when the types of electronic devices are different, the ways in which the electronic devices obtain the voice information to be recognized may be different.
- the electronic device uses the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to determine the sound type of the voice information to be recognized. Therefore, the electronic device also needs to obtain the voice information to be recognized.
- the broadcast status information corresponding to each voice segment of. similarly, when the types of electronic devices are different, the manner in which the electronic devices obtain the broadcast status information corresponding to each voice segment included in the voice information to be recognized may also be different.
- the electronic device when the electronic device is a smart device, the electronic device can process the sound signals when collecting various sound signals in the environment, so as to obtain the to-be-recognized voice information and the voice fragments contained in the to-be-recognized voice information.
- Corresponding broadcast status information when the electronic device is a server, the electronic device can receive the to-be-recognized voice information uploaded by the corresponding smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information.
- step S101 will be described in detail later.
- S102 Determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
- the electronic device can determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
- the electronic device can perform the above step S102 in a variety of ways, which is not specifically limited in the embodiment of the present invention.
- the specific implementation manner of the above step S102 will be described with an example in the following.
- the voice broadcast status information of each voice segment included in the voice information to be recognized can be used to recognize the sound type of the voice to be recognized.
- the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
- step S101 may include the following steps:
- S201 Perform voice activity detection on the collected sound signal
- the target moment is: the moment when the voice start signal is collected
- S203 When collecting each voice segment, detect whether the smart device is performing voice broadcast, and determine the broadcast status information of the voice segment according to the detection result;
- S204 Determine the to-be-recognized voice information based on the multiple voice segments obtained by the division.
- the broadcast status information corresponding to each voice segment is: the broadcast status information of the smart device that is read when the voice segment is collected.
- the smart device After the smart device is started, it can collect sound signals in the environment in real time.
- the sound signal may include the voice information sent by the user, may also include the voice information sent by the smart device itself, and may also include the sound signals of various noises as background sounds of the environment.
- the smart device can detect whether the sound signal can be used as a voice start signal. Furthermore, when a sound signal is detected as a voice start signal, the smart device can determine the voice start signal, and the sound signal collected after the time when the voice start signal is collected can be used as the voice information to be recognized Voice information included in. In addition, the voice start signal can be used as the start information of the voice information to be recognized.
- the smart device can also perform one-by-one detection on the sound signals collected after the moment when the voice start signal is collected, to determine whether the sound signal can be used as a voice termination signal. Furthermore, when it is detected that a voice signal is a voice termination signal, it can be determined that the voice termination signal is termination information in the voice information to be recognized.
- the detected voice start signal, voice termination signal, and the sound signal located between the voice start signal and the voice termination signal constitute the voice information to be recognized.
- the voice start signal may be used as the start information of the voice information to be recognized
- the voice termination signal is the termination information in the voice information to be recognized.
- the smart device can continuously collect the sound in the environment and generate the corresponding sound signal in turn.
- the smart device can divide the collected voice signal into segments according to the preset division rules, starting from the target moment of collecting the voice initiation signal, to obtain multiple voice segments in turn Until the voice termination signal is detected.
- the detected voice termination signal is included in the determined last sound segment, and the sound signal included in the last sound segment may not satisfy the preset division rule.
- the preset division rule may be: the time for collecting the sound signal satisfies a certain preset value; or: the collected sound signal corresponds to a syllable, which is not described in detail in the embodiment of the present invention.
- the voice activity detection may be VAD (Voice Activity Detection, voice endpoint detection).
- VAD Voice Activity Detection, voice endpoint detection
- the smart device can use the VAD to detect the voice start endpoint and the voice termination endpoint in the voice signal.
- the voice initiation endpoint is the voice initiation signal of the voice information to be recognized
- the voice termination endpoint is the voice termination signal of the voice information to be recognized.
- the smart device can divide the collected sound signal from the detection of the voice initiation endpoint into each voice segment according to the preset division rule, until the voice termination endpoint is detected , Divide the voice termination endpoint into the last voice segment contained in the voice information to be recognized.
- the smart device can determine the voice information to be recognized based on the divided voice segments.
- the last voice signal in the last voice segment obtained by dividing is the termination information of the voice information to be recognized
- the sound signals in the speech segments can be arranged in sequence according to the order of division, and the sound signal combination formed by the arrangement is the speech information to be recognized.
- the preset division rule is: the duration of collecting the sound signal reaches 0.1 second, and at the first second of the collection, the voice start endpoint is detected, and it is determined that the currently collected signal is the voice start signal. Then when the first 1.1 second is collected, the sound signal collected between the first second and the first 1.1 second can be divided into the first voice segment; then, when the first 1.2 second is collected, the sound signal can be divided into the first voice segment.
- the voice signal combination formed by the voice signals collected from the first second to the 1.75 second second is the voice information to be recognized.
- the broadcast status information may be TTS (Text To Speech) status information.
- TTS Text To Speech
- the smart device converts the text information to be broadcast into voice information through an offline model, and then broadcasts the voice information; in another case, the server The text information to be broadcast is converted into voice information through the cloud model, and the converted voice information is fed back to the smart device.
- the smart device can broadcast the received voice information.
- the conversion of the text information to be broadcast into voice information is TTS.
- this process can be processed through an offline model in a smart device, or online through a cloud model on the server side.
- the TTS state information corresponding to the voice segment can be recorded as: TTS idle state, and the TTS idle state can be defined as 1, which is the first type of information Defined as 1.
- the TTS status information corresponding to the voice segment can be recorded as: TTS broadcast status, and the TTS broadcast status can be defined as 0, That is, the second type of information is defined as 0.
- the smart device when the smart device collects various sound signals in the environment in real time, in order to avoid the noise in the collected environmental background sound from affecting the smart device’s
- the collected voice signal may be preprocessed to reduce the collected noise and enhance the voice signal that can be used as the voice information to be detected.
- S200 Perform signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal
- step S201 may include the following steps:
- the smart device can obtain the sound wave shape of the sound signal, so that the smart device can perform signal preprocessing on the sound signal according to the sound wave shape of the sound signal.
- the sound signal whose sound wave shape matches the sound wave shape of the noise is attenuated, and the sound signal whose sound wave shape matches the sound wave shape of the sound signal that can be used as the voice information to be recognized is enhanced.
- the above step S201 is to perform voice activity detection on the collected voice signal, that is, perform voice activity detection on the voice signal after signal preprocessing.
- the smart device can pre-collect the sound wave shapes of various noises and various sound wave shapes that can be used as the sound signal of the voice information to be detected, so that these sound wave shapes and the labels corresponding to each sound wave shape can be used, Carry out model training to obtain the acoustic wave detection model.
- the label corresponding to each sound wave shape is: a label used to characterize that the sound wave shape is a sound wave shape of noise or that can be used as a sound wave shape of a sound signal of the voice information to be detected.
- the sound signal that can be used as the voice information to be detected can be the voice signal sent by the user or the voice signal broadcast by the smart device. That is, the sound type of the voice signal that can be used as the voice information to be detected can be human voice or machine voice. .
- the above-mentioned step S101 may include the following steps:
- the sound type determination is done online.
- the smart device collects various sound signals in the environment, obtains the voice information to be recognized from the collected voice signals, and determines the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, thereby, the voice to be recognized
- the information and each broadcast status information are sent to the server, so that the server executes a voice processing method provided in an embodiment of the present invention to determine the sound type of the voice information to be recognized.
- the smart device can determine the voice information to be recognized and the broadcast status corresponding to each voice segment contained in the voice information to be recognized through the solution provided in the embodiment shown in Figure 2 or Figure 3 above. And send the determined voice information to be recognized and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to the server.
- the specific information content sent can be: each divided voice segment and the broadcast status information corresponding to each voice segment obtained, so that the server can simultaneously
- the received voice information to be recognized includes each voice segment and the broadcast state information corresponding to each obtained voice segment.
- the server can obtain the voice information to be recognized after obtaining each voice segment in the voice information to be recognized in sequence. Obtain the voice information to be recognized. In other words, the entirety of each voice segment received by the server is the voice information to be recognized.
- step S102 may include the following steps:
- the electronic device can obtain the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, so that the electronic device can obtain the broadcast status information corresponding to the first voice segment in each voice segment. Furthermore, the electronic device can determine whether the broadcast status information indicates that the smart device did not perform voice broadcast when the voice segment was collected.
- the smart device does not perform voice broadcast, thus, it can be explained that the voice information to be recognized is sent by the user. Therefore, the electronic The device can determine that the voice type of the voice information to be recognized is human voice.
- step S102 may include the following steps:
- S401 Determine the first quantity of the first type of information from the acquired broadcast status information
- the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected
- the electronic device After obtaining the voice information to be recognized and the broadcast state information corresponding to each voice segment included in the voice information to be recognized, the electronic device can determine the first quantity of the first type of information from each broadcast state information.
- the determined first number can represent that the type of voice information in each voice segment included in the voice information to be recognized is human voice. The number of speech fragments.
- the electronic device After determining the first quantity of the first type of information, the electronic device can determine the proportion information of the first type of information based on the first quantity of the first type of information.
- step S402 may include the following steps:
- S402A Calculate the first ratio of the first quantity to the total quantity of the acquired broadcast status information, and use the first ratio as the proportion information of the first type of information.
- the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined
- the voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.
- the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected
- the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device.
- the above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices.
- the first ratio of the first quantity to the total quantity of the acquired broadcast status information can be calculated, and the first ratio can be used as the proportion information of the first type of information.
- the above-mentioned calculated proportion information of the first type of information can be understood as: the proportion of speech fragments whose sound type is human voice among the speech fragments contained in the speech information to be recognized.
- the higher the ratio the greater the possibility that the voice type of the voice information to be recognized is human voice.
- the first ratio is 0, indicating that the sound type of the voice information to be recognized is more likely to be machine sound
- the first ratio is 1, indicating that the voice type of the voice information to be recognized is more likely to be human voice.
- the broadcast status information is TTS status information
- the TTS play status is defined as 0
- the TTS idle status is defined as 1
- the first ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the total number of acquired TTS status information.
- the first ratio can be calculated to be 0.9.
- step S402 may include the following steps:
- S402B Determine the second quantity of the second type of information from the acquired broadcast status information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information;
- the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
- the electronic device may further determine the second quantity of the second type of information from each broadcast status information. Therefore, the electronic device can calculate the determined second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information.
- the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined
- the voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.
- the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected
- the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device.
- the above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices. In this way, it can be determined that the sound type of the speech segment is machine sound.
- the broadcast status information is TTS status information
- the TTS play status is defined as 0
- the TTS idle status is defined as 1
- the second ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the number of 0.
- S403 Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.
- the electronic device After determining the proportion information of the first type of information, the electronic device can determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.
- step S403 may include the following steps:
- the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voiceprint model for the voice information to be recognized, and the voice information to be recognized is determined to be a human voice; or,
- the voice information to be recognized is determined to be a machine sound.
- the greater the proportion information of the first type of information determined the greater the possibility that the voice type of the voice information to be recognized is a human voice.
- the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.
- the electronic device can determine the voiceprint model to treat The detection result of the detection by recognizing the voice information, so that when the detection result is a human voice, it can be determined that the voice information to be recognized is a human voice.
- the proportion information is not greater than the set threshold, and the detection result of the voiceprint model detecting the voice information to be recognized is a machine voice, it can be determined that the voice information to be recognized is a machine voice.
- the predetermined thresholds set above may be the same or different. .
- the electronic device may use a preset voiceprint model to detect the voice information to be recognized after performing step S101 and receive the voice information to be recognized, so as to obtain the detection result. Therefore, in this specific implementation, it can be directly Use the obtained detection result; it is also possible to perform the above step S403, when it is determined that the proportion information is not greater than the set threshold, then use the preset voiceprint model to detect the voice information to be recognized to obtain the detection result, thereby , Use the test result.
- the proportion information is greater than a set threshold, and then, when it is determined that the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.
- the voiceprint model can obtain the detection result of the voice information to be recognized.
- the detection result is a human voice
- it can be determined that the voice information to be recognized is a human voice
- the detection result is machine sound
- it can be determined that the voice information to be recognized is machine sound.
- the voiceprint model may first obtain the detection result of the voice information to be recognized, and when the detection result is a human voice, it may be determined that the voice information to be recognized is a human voice.
- the detection result when the detection result is machine sound, it can be judged whether the calculated proportion information is greater than the set threshold. If it is greater, it can be determined that the voice information to be recognized is a human voice; if it is not greater than, it can be determined to be recognized.
- the voice information is machine sound.
- prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.
- the electronic device when it is determined that the voice information to be recognized is machine sound, the electronic device can feed back to the smart device that collects the voice information to be recognized prompt information for prompting that the voice information to be recognized is machine sound.
- the smart device will not respond to the to-be-recognized voice information, thereby avoiding "self-questioning and self-answering" behaviors.
- the prompt information may be a preset "error code”.
- the electronic device may not perform semantic recognition on the text recognition result of the voice information to be recognized.
- the electronic device may not perform voice recognition on the acquired voice information to be recognized, that is, the electronic device may not obtain a text recognition result corresponding to the voice information to be recognized.
- the embodiment of the present invention may further include the following steps:
- the electronic device After obtaining the voice information to be recognized, the electronic device can subsequently obtain the text recognition result corresponding to the voice information to be recognized.
- the electronic device can determine that the voice information to be recognized is voice information sent by the user, so that the electronic device needs to respond to the voice information sent by the user.
- the electronic device can perform semantic recognition on the obtained text recognition result, thereby determining the response information of the voice information to be recognized.
- the electronic device can input the text recognition result to the semantic model, so that the semantic model can analyze the semantics of the text recognition result, and then determine the response result corresponding to the semantics as the voice information to be recognized Response information.
- the semantic model is used to recognize the semantics of the text recognition information, obtain the user needs corresponding to the voice information to be recognized, and make actions corresponding to the user needs according to the user needs, thereby obtaining the semantics corresponding to the
- the response result of is used as the response information of the voice information to be recognized. For example, obtain the result corresponding to the user demand from a designated website or storage space, or execute an action corresponding to the user demand, etc.
- the text recognition information is: how is the weather today.
- the semantic model can recognize the keywords "today” and "weather” in the text recognition information, and then know the current geographic location through the positioning system, so that the semantic model can determine the user's needs as: the current geographic location The location is today’s weather conditions, and then, the semantic model can automatically connect to the website for querying the weather, and obtain the current weather conditions in the current geographic location from the website, for example, the weather in Beijing is 23 degrees Celsius, and then , The acquired weather condition can be determined as the response result corresponding to the semantics as the response information of the voice information to be recognized.
- the text recognition information is: Where is Starbucks.
- the semantic model can recognize the keywords "Starbucks" and "Where” in the text recognition information.
- the semantic model can determine the user's needs as: the location of Starbucks.
- the semantic model can be preset from the preset storage space. In the stored information, read the location information of Starbucks, for example, the northeast corner of the third floor of this commercial building, and then determine the location information obtained as the response result corresponding to the semantics, as the response information of the voice information to be recognized .
- the text recognition information is: two meters ahead.
- the semantic model can recognize the keywords "forward” and "two meters” in the text recognition information.
- the semantic model can determine the user's needs as follows: I want to move forward two meters, and then the semantic model can be generated The corresponding control instruction, thus, controls itself to move forward a distance of two meters.
- the action of the smart device moving forward is the response result corresponding to the semantics.
- the voice information to be recognized acquired by the electronic device includes multiple voice fragments. Therefore, in order to ensure the accuracy of the obtained text recognition result, the manner of obtaining the text recognition result corresponding to the voice information to be recognized is It can include the following steps:
- the first speech segment When the first speech segment is received, perform speech recognition on the first speech segment to obtain a temporary text result; when receiving a non-first speech segment, based on the temporary text result that has been obtained, the received Perform voice recognition on all voice fragments to obtain a new temporary text result. Until the last voice fragment is received, the text recognition result corresponding to the voice information to be recognized is obtained.
- the first speech fragment when the first speech fragment is received, the first speech fragment is recognized by speech, and the temporary text result of the first speech fragment is obtained; furthermore, when the second speech fragment is received, it can be based on The temporary text result of the first speech segment, the speech information composed of the first and second speech segments are recognized, and the temporary text results of the first two speech segments are obtained; then, when the third speech segment is received, Based on the temporary text results of the first two speech fragments, the speech information composed of the first to third speech fragments can be recognized, and the temporary text results of the first three speech fragments can be obtained; and so on, until the last speech is received
- segmenting based on the temporary text results from the first voice segment to the penultimate voice segment, the voice information composed of the first to last voice segments can be recognized, and the temporary text results of the first to last voice segments can be obtained Obviously, the result obtained at this time is the text recognition result corresponding to the voice information to be recognized.
- the voice recognition model in the electronic device may be used to perform voice recognition on the voice information to be recognized.
- Each voice sample includes voice information and text information corresponding to the voice information.
- the voice recognition model can establish the correspondence between voice information and text information. relationship. In this way, after the trained voice recognition model receives the to-be-recognized voice information, it can determine the text recognition result corresponding to the to-be-recognized voice information according to the established correspondence.
- the speech recognition model can be called a decoder.
- the electronic device may output the temporary recognition result to the user.
- the electronic device When the electronic device is a smart device, the electronic device can directly output the temporary recognition result through the display screen.
- the electronic device may also output the text recognition result to the user.
- the electronic device when the electronic device is a server, the electronic device sends the text recognition result to the smart device sending the voice information to be recognized, so that the smart device outputs the text recognition result through the display screen;
- the electronic device When the electronic device is a smart device, the electronic device can directly output the text recognition result through the display screen.
- the electronic device may broadcast the response information to the user.
- the electronic device When the electronic device is a server, the electronic device sends the response information to the smart device that sent the voice information to be recognized, so that the smart device broadcasts the response information to the user;
- the electronic device can directly broadcast the response information.
- the above-mentioned electronic device is a server. specific:
- the smart device collects each sound signal in the environment in real time, and performs signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal.
- the smart device performs voice activity detection on the sound signal after signal preprocessing.
- VAD can be used to detect the voice start endpoint and voice termination endpoint in the voice signal preprocessed by the signal, and after the voice start endpoint is detected, the collected voice signals are divided in sequence according to the preset division rule It is a voice segment, until the voice termination endpoint is detected.
- the decoder performs voice recognition on all the currently received voice segments to obtain a temporary recognition result, and sends the temporary recognition result to the smart device, so that the smart device outputs the temporary recognition result through the display screen.
- the text recognition result of the voice information to be recognized is obtained, the text recognition result is sent to the smart device, so that the smart device outputs the text recognition result through the display screen.
- the voiceprint model performs voiceprint detection on all voice segments currently received, and records the detection results. Accordingly, when all voice segments constituting the voice information to be recognized are received, voiceprint is performed on the voice information to be recognized. Test and record the test results.
- the server After the server receives the TTS status information corresponding to each voice segment among all the voice segments that constitute the voice information to be recognized, it calculates the number of 1s in the received TTS status information, and then calculates the number of 1s and the received TTS status information. The ratio of the number of TTS status information, and determine the relationship between the ratio and the set threshold.
- the voice information to be recognized is a human voice.
- the voice information to be recognized is determined to be a human voice.
- the proportion information is not greater than the set threshold, and when it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice information to be recognized based on the voiceprint model, it is determined that the voice information to be recognized is a machine sound.
- the smart device After receiving the response information, the smart device can output the response information.
- the embodiment of the present invention also provides a voice processing device.
- FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention. As shown in Figure 8, the voice processing device includes the following modules:
- the information acquisition module 810 is configured to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein the broadcast status information corresponding to each voice segment indicates that the voice is collected Whether the smart device is performing voice broadcast during the segment;
- the type determining module 820 is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
- the type determining module 820 is specifically configured to:
- the type determining module 820 is specifically configured to:
- the first quantity of the first type of information From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information
- the first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.
- the type determining module 820 is specifically configured to:
- the proportion information of a type of information From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio
- the proportion information of a type of information determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,
- the type determination module is specifically configured to:
- the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
- the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
- the device further includes:
- the information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.
- the device further includes:
- the result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;
- the information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
- an embodiment of the present invention also provides an electronic device, as shown in FIG. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where: The processor 901, the communication interface 902, and the memory 903 communicate with each other through the communication bus 904,
- the memory 903 is used to store computer programs
- the processor 901 is configured to implement a voice processing method provided in the foregoing embodiment of the present invention when executing a program stored in the memory 903.
- the aforementioned voice processing method includes:
- the smart device Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
- the sound type of the voice information to be recognized is determined.
- the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized.
- the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
- the communication interface is used for communication between the above-mentioned electronic device and other devices.
- the memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage.
- NVM non-Volatile Memory
- the memory may also be at least one storage device located far away from the foregoing processor.
- the above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- CPU central processing unit
- NP Network Processor
- DSP Digital Signal Processing
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- FPGA Field-Programmable Gate Array
- the embodiment of the present invention also provides a computer-readable storage medium.
- the computer program is executed by a processor, any voice provided by the foregoing embodiment of the present invention is implemented. Approach.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
La présente invention concerne un procédé et un appareil de traitement vocal, ainsi qu'un dispositif intelligent et un support de stockage. Le procédé consiste à : acquérir des informations vocales à reconnaître qui sont collectées par un dispositif intelligent et des informations d'état de diffusion correspondant à chaque segment vocal inclus dans les informations vocales à reconnaître, les informations d'état de diffusion correspondant à chaque segment vocal représentant le fait de savoir si le dispositif intelligent est en train d'effectuer une diffusion vocale lorsque le segment vocal est collecté ; et déterminer, sur la base des informations d'état de diffusion acquises, le type de son des informations vocales à reconnaître. Par comparaison avec l'état de la technique, la précision de reconnaissance du type de son d'informations vocales peut être améliorée par application de la solution fournie dans les modes de réalisation de la présente invention.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911398330.X | 2019-12-30 | ||
CN201911398330.XA CN113129902B (zh) | 2019-12-30 | 2019-12-30 | 一种语音处理方法、装置、电子设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021136298A1 true WO2021136298A1 (fr) | 2021-07-08 |
Family
ID=76687322
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/141038 WO2021136298A1 (fr) | 2019-12-30 | 2020-12-29 | Procédé et appareil de traitement vocal et dispositif intelligent et support de stockage |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113129902B (fr) |
WO (1) | WO2021136298A1 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114187895A (zh) * | 2021-12-17 | 2022-03-15 | 海尔优家智能科技(北京)有限公司 | 语音识别方法、装置、设备和存储介质 |
CN114500590A (zh) * | 2021-12-23 | 2022-05-13 | 珠海格力电器股份有限公司 | 智能设备语音播报方法、装置、计算机设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750952A (zh) * | 2011-04-18 | 2012-10-24 | 索尼公司 | 声音信号处理装置、方法和程序 |
CN103167174A (zh) * | 2013-02-25 | 2013-06-19 | 广东欧珀移动通信有限公司 | 一种移动终端问候语的输出方法、装置及移动终端 |
CN106847285A (zh) * | 2017-03-31 | 2017-06-13 | 上海思依暄机器人科技股份有限公司 | 一种机器人及其语音识别方法 |
CN108346425A (zh) * | 2017-01-25 | 2018-07-31 | 北京搜狗科技发展有限公司 | 一种语音活动检测的方法和装置、语音识别的方法和装置 |
CN110097890A (zh) * | 2019-04-16 | 2019-08-06 | 北京搜狗科技发展有限公司 | 一种语音处理方法、装置和用于语音处理的装置 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937693B (zh) * | 2010-08-17 | 2012-04-04 | 深圳市子栋科技有限公司 | 基于语音命令的视音频播放方法及系统 |
CN102780646B (zh) * | 2012-07-19 | 2015-12-09 | 上海量明科技发展有限公司 | 即时通信中声音图标的实现方法、客户端及系统 |
CN104484045B (zh) * | 2014-12-26 | 2018-07-20 | 小米科技有限责任公司 | 音频播放控制方法及装置 |
CN107507620A (zh) * | 2017-09-25 | 2017-12-22 | 广东小天才科技有限公司 | 一种语音播报声音设置方法、装置、移动终端及存储介质 |
CN108509176B (zh) * | 2018-04-10 | 2021-06-08 | Oppo广东移动通信有限公司 | 一种播放音频数据的方法、装置、存储介质及智能终端 |
CN109524013B (zh) * | 2018-12-18 | 2022-07-22 | 北京猎户星空科技有限公司 | 一种语音处理方法、装置、介质和智能设备 |
CN110070866B (zh) * | 2019-04-09 | 2021-12-24 | 阿波罗智联(北京)科技有限公司 | 语音识别方法及装置 |
-
2019
- 2019-12-30 CN CN201911398330.XA patent/CN113129902B/zh active Active
-
2020
- 2020-12-29 WO PCT/CN2020/141038 patent/WO2021136298A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102750952A (zh) * | 2011-04-18 | 2012-10-24 | 索尼公司 | 声音信号处理装置、方法和程序 |
CN103167174A (zh) * | 2013-02-25 | 2013-06-19 | 广东欧珀移动通信有限公司 | 一种移动终端问候语的输出方法、装置及移动终端 |
CN108346425A (zh) * | 2017-01-25 | 2018-07-31 | 北京搜狗科技发展有限公司 | 一种语音活动检测的方法和装置、语音识别的方法和装置 |
CN106847285A (zh) * | 2017-03-31 | 2017-06-13 | 上海思依暄机器人科技股份有限公司 | 一种机器人及其语音识别方法 |
CN110097890A (zh) * | 2019-04-16 | 2019-08-06 | 北京搜狗科技发展有限公司 | 一种语音处理方法、装置和用于语音处理的装置 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114187895A (zh) * | 2021-12-17 | 2022-03-15 | 海尔优家智能科技(北京)有限公司 | 语音识别方法、装置、设备和存储介质 |
CN114500590A (zh) * | 2021-12-23 | 2022-05-13 | 珠海格力电器股份有限公司 | 智能设备语音播报方法、装置、计算机设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN113129902A (zh) | 2021-07-16 |
CN113129902B (zh) | 2023-10-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107919130B (zh) | 基于云端的语音处理方法和装置 | |
US10438595B2 (en) | Speaker identification and unsupervised speaker adaptation techniques | |
EP3371809B1 (fr) | Commandes vocales pour multiples dispositifs | |
US20180366105A1 (en) | Providing an indication of the suitability of speech recognition | |
JP7348288B2 (ja) | 音声対話の方法、装置、及びシステム | |
CN108009303B (zh) | 基于语音识别的搜索方法、装置、电子设备和存储介质 | |
US20200035241A1 (en) | Method, device and computer storage medium for speech interaction | |
WO2017012242A1 (fr) | Procédé et appareil de reconnaissance vocale | |
WO2021136298A1 (fr) | Procédé et appareil de traitement vocal et dispositif intelligent et support de stockage | |
CN111833902B (zh) | 唤醒模型训练方法、唤醒词识别方法、装置及电子设备 | |
CN110875059B (zh) | 收音结束的判断方法、装置以及储存装置 | |
CN105139858A (zh) | 一种信息处理方法及电子设备 | |
CN109697981B (zh) | 一种语音交互方法、装置、设备及存储介质 | |
CN107886944A (zh) | 一种语音识别方法、装置、设备及存储介质 | |
CN108648765A (zh) | 一种语音异常检测的方法、装置及终端 | |
US10950221B2 (en) | Keyword confirmation method and apparatus | |
US8868419B2 (en) | Generalizing text content summary from speech content | |
CN112002349B (zh) | 一种语音端点检测方法及装置 | |
CN110956958A (zh) | 搜索方法、装置、终端设备及存储介质 | |
CN111063356B (zh) | 电子设备响应方法及系统、音箱和计算机可读存储介质 | |
CN111933149A (zh) | 语音交互方法、穿戴式设备、终端及语音交互系统 | |
WO2024099359A1 (fr) | Procédé et appareil de détection vocale, dispositif électronique et support de stockage | |
CN111077997B (zh) | 一种点读模式下的点读控制方法及电子设备 | |
US10818298B2 (en) | Audio processing | |
CN113096651A (zh) | 语音信号处理方法、装置、可读存储介质及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20910235 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20910235 Country of ref document: EP Kind code of ref document: A1 |