WO2021136298A1 - Voice processing method and apparatus, and intelligent device and storage medium - Google Patents

Voice processing method and apparatus, and intelligent device and storage medium Download PDF

Info

Publication number
WO2021136298A1
WO2021136298A1 PCT/CN2020/141038 CN2020141038W WO2021136298A1 WO 2021136298 A1 WO2021136298 A1 WO 2021136298A1 CN 2020141038 W CN2020141038 W CN 2020141038W WO 2021136298 A1 WO2021136298 A1 WO 2021136298A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
information
recognized
type
segment
Prior art date
Application number
PCT/CN2020/141038
Other languages
French (fr)
Chinese (zh)
Inventor
刘浩
任海海
Original Assignee
北京猎户星空科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猎户星空科技有限公司 filed Critical 北京猎户星空科技有限公司
Publication of WO2021136298A1 publication Critical patent/WO2021136298A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Definitions

  • the present invention relates to the technical field of intelligent robots, in particular to a voice processing method, device, intelligent equipment and storage medium.
  • smart devices such as smart robots, smart speakers, etc.
  • smart speakers that can conduct continuous conversations with users are usually set up. After waking up the smart device, the user can perform multiple voice interactions with the smart robot, and there is no need to wake up the smart device again between each interaction.
  • the user can send out the voice message "How is the weather today", and then the smart device broadcasts the queried weather status to the user. Then, the user can send out the voice message "Where is the Starbucks” again, so that the smart device can continue to broadcast the location of the Starbucks that has been queried to the user. Among them, the smart device is in a wake-up state between the two voice messages of "What's the weather today" and "Where is the Starbucks?"
  • the smart device when the smart device is awake, it can receive the voice information broadcast by itself and respond to the voice information as the voice information sent by the user, that is, the smart device can mistake its own machine sound
  • the user’s vocals therefore, appear to be "self-questioning and self-answering" wrong behaviors, which affects the user experience.
  • the purpose of the embodiments of the present invention is to provide a voice processing method, device, smart device, and storage medium to improve the recognition accuracy of the voice type of voice information.
  • the specific technical solutions are as follows:
  • an embodiment of the present invention provides a voice processing method, and the method includes:
  • the smart device Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
  • the sound type of the voice information to be recognized is determined.
  • the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:
  • the voice type of the voice information to be recognized is human voice.
  • the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:
  • the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;
  • the sound type of the voice information to be recognized is determined according to the relationship between the proportion information and the set threshold.
  • the step of determining the proportion information of the first type of information based on the first quantity of the first type of information includes:
  • the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
  • the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
  • the method further includes:
  • the voice information to be recognized is a human voice
  • semantic recognition is performed based on the text recognition result, and the response information of the voice information to be recognized is determined.
  • an embodiment of the present invention provides a voice processing device, the device including:
  • the type determination module is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
  • the type determination module is specifically configured to:
  • the type determination module is specifically configured to:
  • the first quantity of the first type of information From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information
  • the first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.
  • the proportion information of a type of information From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio
  • the proportion information of a type of information determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,
  • the type determination module is specifically configured to:
  • the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
  • the device further includes:
  • the information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.
  • the device further includes:
  • the result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;
  • the information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
  • an embodiment of the present invention provides an electronic device, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
  • Memory used to store computer programs
  • the processor is configured to implement the steps of any voice processing method provided in the first aspect when executing the program stored in the memory.
  • an embodiment of the present invention provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the above-mentioned aspects provided in the first aspect is implemented.
  • the steps of the voice processing method are described in detail below.
  • an embodiment of the present invention provides a computer program.
  • the computer program product includes a computer program stored on a computer-readable storage medium.
  • the computer program includes program instructions that are executed by a processor. The steps of any one of the voice processing methods provided in the first aspect above are implemented.
  • the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected.
  • the broadcast status information corresponding to the fragment In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
  • FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a specific implementation of S101 in FIG. 1;
  • FIG. 3 is a schematic flowchart of another specific implementation of S101 in FIG. 1;
  • FIG. 4 is a schematic flowchart of a specific implementation of S102 in FIG. 1;
  • FIG. 5 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;
  • FIG. 6 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;
  • FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • the smart device uses a preset voiceprint model to detect the voice information to determine the voice type of the voice information, that is, the voice information is Human voice or machine voice. Since the voiceprint model is obtained based on the machine voice training of the smart device, and the voiceprint used for training the voiceprint model is similar to the voice spectrum of some users, the voiceprint model will mistake the voice of some users. It is judged as a machine sound, which results in that part of the human voice cannot be responded to by the smart device and still affects the user experience. Based on this, how to improve the recognition accuracy of the voice type of the voice information is a problem to be solved urgently.
  • the embodiment of the present invention provides a voice processing method.
  • the method includes:
  • the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice when the voice segment is collected Broadcast
  • the sound type of the voice information to be recognized is determined.
  • the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected.
  • the broadcast status information corresponding to the fragment In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is to say, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized.
  • the voice broadcast status information can reflect whether there may be machine sounds generated by the smart device voice broadcast in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
  • the execution subject of a voice processing method provided in the embodiment of the present invention may be a smart device that collects voice information to be recognized, and thus, the recognition method may be completed offline.
  • the smart device may be any smart electronic device that needs to perform voice processing, for example, a smart robot, a smart speaker, a smart phone, a tablet computer, etc.
  • the embodiment of the present invention does not make a specific limitation.
  • the execution subject may also be a server that provides voice processing for the smart device that collects the voice information to be recognized, so that the recognition method may be completed online.
  • the execution subject when the execution subject is the server, when the smart device collects various sound signals in the environment, it can process the sound signals locally, so as to obtain the voice information to be recognized and the information contained in the voice information to be recognized.
  • the broadcast state information corresponding to each voice segment can then be uploaded to the server for the to-be-recognized voice information and each voice segment, so that the server can execute a voice processing method provided by an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention. As shown in Figure 1, the method may include the following steps:
  • S101 Obtain the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment included in the voice information to be recognized;
  • the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice broadcast when the voice segment is collected;
  • the electronic device determines is the sound type of the received voice information to be recognized. Therefore, the electronic device needs to first obtain the voice information to be recognized. Wherein, when the types of electronic devices are different, the ways in which the electronic devices obtain the voice information to be recognized may be different.
  • the electronic device uses the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to determine the sound type of the voice information to be recognized. Therefore, the electronic device also needs to obtain the voice information to be recognized.
  • the broadcast status information corresponding to each voice segment of. similarly, when the types of electronic devices are different, the manner in which the electronic devices obtain the broadcast status information corresponding to each voice segment included in the voice information to be recognized may also be different.
  • the electronic device when the electronic device is a smart device, the electronic device can process the sound signals when collecting various sound signals in the environment, so as to obtain the to-be-recognized voice information and the voice fragments contained in the to-be-recognized voice information.
  • Corresponding broadcast status information when the electronic device is a server, the electronic device can receive the to-be-recognized voice information uploaded by the corresponding smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information.
  • step S101 will be described in detail later.
  • S102 Determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
  • the electronic device can determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
  • the electronic device can perform the above step S102 in a variety of ways, which is not specifically limited in the embodiment of the present invention.
  • the specific implementation manner of the above step S102 will be described with an example in the following.
  • the voice broadcast status information of each voice segment included in the voice information to be recognized can be used to recognize the sound type of the voice to be recognized.
  • the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
  • step S101 may include the following steps:
  • S201 Perform voice activity detection on the collected sound signal
  • the target moment is: the moment when the voice start signal is collected
  • S203 When collecting each voice segment, detect whether the smart device is performing voice broadcast, and determine the broadcast status information of the voice segment according to the detection result;
  • S204 Determine the to-be-recognized voice information based on the multiple voice segments obtained by the division.
  • the broadcast status information corresponding to each voice segment is: the broadcast status information of the smart device that is read when the voice segment is collected.
  • the smart device After the smart device is started, it can collect sound signals in the environment in real time.
  • the sound signal may include the voice information sent by the user, may also include the voice information sent by the smart device itself, and may also include the sound signals of various noises as background sounds of the environment.
  • the smart device can detect whether the sound signal can be used as a voice start signal. Furthermore, when a sound signal is detected as a voice start signal, the smart device can determine the voice start signal, and the sound signal collected after the time when the voice start signal is collected can be used as the voice information to be recognized Voice information included in. In addition, the voice start signal can be used as the start information of the voice information to be recognized.
  • the smart device can also perform one-by-one detection on the sound signals collected after the moment when the voice start signal is collected, to determine whether the sound signal can be used as a voice termination signal. Furthermore, when it is detected that a voice signal is a voice termination signal, it can be determined that the voice termination signal is termination information in the voice information to be recognized.
  • the detected voice start signal, voice termination signal, and the sound signal located between the voice start signal and the voice termination signal constitute the voice information to be recognized.
  • the voice start signal may be used as the start information of the voice information to be recognized
  • the voice termination signal is the termination information in the voice information to be recognized.
  • the smart device can continuously collect the sound in the environment and generate the corresponding sound signal in turn.
  • the smart device can divide the collected voice signal into segments according to the preset division rules, starting from the target moment of collecting the voice initiation signal, to obtain multiple voice segments in turn Until the voice termination signal is detected.
  • the detected voice termination signal is included in the determined last sound segment, and the sound signal included in the last sound segment may not satisfy the preset division rule.
  • the preset division rule may be: the time for collecting the sound signal satisfies a certain preset value; or: the collected sound signal corresponds to a syllable, which is not described in detail in the embodiment of the present invention.
  • the voice activity detection may be VAD (Voice Activity Detection, voice endpoint detection).
  • VAD Voice Activity Detection, voice endpoint detection
  • the smart device can use the VAD to detect the voice start endpoint and the voice termination endpoint in the voice signal.
  • the voice initiation endpoint is the voice initiation signal of the voice information to be recognized
  • the voice termination endpoint is the voice termination signal of the voice information to be recognized.
  • the smart device can divide the collected sound signal from the detection of the voice initiation endpoint into each voice segment according to the preset division rule, until the voice termination endpoint is detected , Divide the voice termination endpoint into the last voice segment contained in the voice information to be recognized.
  • the smart device can determine the voice information to be recognized based on the divided voice segments.
  • the last voice signal in the last voice segment obtained by dividing is the termination information of the voice information to be recognized
  • the sound signals in the speech segments can be arranged in sequence according to the order of division, and the sound signal combination formed by the arrangement is the speech information to be recognized.
  • the preset division rule is: the duration of collecting the sound signal reaches 0.1 second, and at the first second of the collection, the voice start endpoint is detected, and it is determined that the currently collected signal is the voice start signal. Then when the first 1.1 second is collected, the sound signal collected between the first second and the first 1.1 second can be divided into the first voice segment; then, when the first 1.2 second is collected, the sound signal can be divided into the first voice segment.
  • the voice signal combination formed by the voice signals collected from the first second to the 1.75 second second is the voice information to be recognized.
  • the broadcast status information may be TTS (Text To Speech) status information.
  • TTS Text To Speech
  • the smart device converts the text information to be broadcast into voice information through an offline model, and then broadcasts the voice information; in another case, the server The text information to be broadcast is converted into voice information through the cloud model, and the converted voice information is fed back to the smart device.
  • the smart device can broadcast the received voice information.
  • the conversion of the text information to be broadcast into voice information is TTS.
  • this process can be processed through an offline model in a smart device, or online through a cloud model on the server side.
  • the TTS state information corresponding to the voice segment can be recorded as: TTS idle state, and the TTS idle state can be defined as 1, which is the first type of information Defined as 1.
  • the TTS status information corresponding to the voice segment can be recorded as: TTS broadcast status, and the TTS broadcast status can be defined as 0, That is, the second type of information is defined as 0.
  • the smart device when the smart device collects various sound signals in the environment in real time, in order to avoid the noise in the collected environmental background sound from affecting the smart device’s
  • the collected voice signal may be preprocessed to reduce the collected noise and enhance the voice signal that can be used as the voice information to be detected.
  • S200 Perform signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal
  • step S201 may include the following steps:
  • the smart device can obtain the sound wave shape of the sound signal, so that the smart device can perform signal preprocessing on the sound signal according to the sound wave shape of the sound signal.
  • the sound signal whose sound wave shape matches the sound wave shape of the noise is attenuated, and the sound signal whose sound wave shape matches the sound wave shape of the sound signal that can be used as the voice information to be recognized is enhanced.
  • the above step S201 is to perform voice activity detection on the collected voice signal, that is, perform voice activity detection on the voice signal after signal preprocessing.
  • the smart device can pre-collect the sound wave shapes of various noises and various sound wave shapes that can be used as the sound signal of the voice information to be detected, so that these sound wave shapes and the labels corresponding to each sound wave shape can be used, Carry out model training to obtain the acoustic wave detection model.
  • the label corresponding to each sound wave shape is: a label used to characterize that the sound wave shape is a sound wave shape of noise or that can be used as a sound wave shape of a sound signal of the voice information to be detected.
  • the sound signal that can be used as the voice information to be detected can be the voice signal sent by the user or the voice signal broadcast by the smart device. That is, the sound type of the voice signal that can be used as the voice information to be detected can be human voice or machine voice. .
  • the above-mentioned step S101 may include the following steps:
  • the sound type determination is done online.
  • the smart device collects various sound signals in the environment, obtains the voice information to be recognized from the collected voice signals, and determines the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, thereby, the voice to be recognized
  • the information and each broadcast status information are sent to the server, so that the server executes a voice processing method provided in an embodiment of the present invention to determine the sound type of the voice information to be recognized.
  • the smart device can determine the voice information to be recognized and the broadcast status corresponding to each voice segment contained in the voice information to be recognized through the solution provided in the embodiment shown in Figure 2 or Figure 3 above. And send the determined voice information to be recognized and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to the server.
  • the specific information content sent can be: each divided voice segment and the broadcast status information corresponding to each voice segment obtained, so that the server can simultaneously
  • the received voice information to be recognized includes each voice segment and the broadcast state information corresponding to each obtained voice segment.
  • the server can obtain the voice information to be recognized after obtaining each voice segment in the voice information to be recognized in sequence. Obtain the voice information to be recognized. In other words, the entirety of each voice segment received by the server is the voice information to be recognized.
  • step S102 may include the following steps:
  • the electronic device can obtain the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, so that the electronic device can obtain the broadcast status information corresponding to the first voice segment in each voice segment. Furthermore, the electronic device can determine whether the broadcast status information indicates that the smart device did not perform voice broadcast when the voice segment was collected.
  • the smart device does not perform voice broadcast, thus, it can be explained that the voice information to be recognized is sent by the user. Therefore, the electronic The device can determine that the voice type of the voice information to be recognized is human voice.
  • step S102 may include the following steps:
  • S401 Determine the first quantity of the first type of information from the acquired broadcast status information
  • the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected
  • the electronic device After obtaining the voice information to be recognized and the broadcast state information corresponding to each voice segment included in the voice information to be recognized, the electronic device can determine the first quantity of the first type of information from each broadcast state information.
  • the determined first number can represent that the type of voice information in each voice segment included in the voice information to be recognized is human voice. The number of speech fragments.
  • the electronic device After determining the first quantity of the first type of information, the electronic device can determine the proportion information of the first type of information based on the first quantity of the first type of information.
  • step S402 may include the following steps:
  • S402A Calculate the first ratio of the first quantity to the total quantity of the acquired broadcast status information, and use the first ratio as the proportion information of the first type of information.
  • the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined
  • the voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.
  • the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected
  • the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device.
  • the above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices.
  • the first ratio of the first quantity to the total quantity of the acquired broadcast status information can be calculated, and the first ratio can be used as the proportion information of the first type of information.
  • the above-mentioned calculated proportion information of the first type of information can be understood as: the proportion of speech fragments whose sound type is human voice among the speech fragments contained in the speech information to be recognized.
  • the higher the ratio the greater the possibility that the voice type of the voice information to be recognized is human voice.
  • the first ratio is 0, indicating that the sound type of the voice information to be recognized is more likely to be machine sound
  • the first ratio is 1, indicating that the voice type of the voice information to be recognized is more likely to be human voice.
  • the broadcast status information is TTS status information
  • the TTS play status is defined as 0
  • the TTS idle status is defined as 1
  • the first ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the total number of acquired TTS status information.
  • the first ratio can be calculated to be 0.9.
  • step S402 may include the following steps:
  • S402B Determine the second quantity of the second type of information from the acquired broadcast status information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information;
  • the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
  • the electronic device may further determine the second quantity of the second type of information from each broadcast status information. Therefore, the electronic device can calculate the determined second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information.
  • the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined
  • the voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.
  • the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected
  • the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device.
  • the above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices. In this way, it can be determined that the sound type of the speech segment is machine sound.
  • the broadcast status information is TTS status information
  • the TTS play status is defined as 0
  • the TTS idle status is defined as 1
  • the second ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the number of 0.
  • S403 Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.
  • the electronic device After determining the proportion information of the first type of information, the electronic device can determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.
  • step S403 may include the following steps:
  • the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voiceprint model for the voice information to be recognized, and the voice information to be recognized is determined to be a human voice; or,
  • the voice information to be recognized is determined to be a machine sound.
  • the greater the proportion information of the first type of information determined the greater the possibility that the voice type of the voice information to be recognized is a human voice.
  • the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.
  • the electronic device can determine the voiceprint model to treat The detection result of the detection by recognizing the voice information, so that when the detection result is a human voice, it can be determined that the voice information to be recognized is a human voice.
  • the proportion information is not greater than the set threshold, and the detection result of the voiceprint model detecting the voice information to be recognized is a machine voice, it can be determined that the voice information to be recognized is a machine voice.
  • the predetermined thresholds set above may be the same or different. .
  • the electronic device may use a preset voiceprint model to detect the voice information to be recognized after performing step S101 and receive the voice information to be recognized, so as to obtain the detection result. Therefore, in this specific implementation, it can be directly Use the obtained detection result; it is also possible to perform the above step S403, when it is determined that the proportion information is not greater than the set threshold, then use the preset voiceprint model to detect the voice information to be recognized to obtain the detection result, thereby , Use the test result.
  • the proportion information is greater than a set threshold, and then, when it is determined that the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.
  • the voiceprint model can obtain the detection result of the voice information to be recognized.
  • the detection result is a human voice
  • it can be determined that the voice information to be recognized is a human voice
  • the detection result is machine sound
  • it can be determined that the voice information to be recognized is machine sound.
  • the voiceprint model may first obtain the detection result of the voice information to be recognized, and when the detection result is a human voice, it may be determined that the voice information to be recognized is a human voice.
  • the detection result when the detection result is machine sound, it can be judged whether the calculated proportion information is greater than the set threshold. If it is greater, it can be determined that the voice information to be recognized is a human voice; if it is not greater than, it can be determined to be recognized.
  • the voice information is machine sound.
  • prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.
  • the electronic device when it is determined that the voice information to be recognized is machine sound, the electronic device can feed back to the smart device that collects the voice information to be recognized prompt information for prompting that the voice information to be recognized is machine sound.
  • the smart device will not respond to the to-be-recognized voice information, thereby avoiding "self-questioning and self-answering" behaviors.
  • the prompt information may be a preset "error code”.
  • the electronic device may not perform semantic recognition on the text recognition result of the voice information to be recognized.
  • the electronic device may not perform voice recognition on the acquired voice information to be recognized, that is, the electronic device may not obtain a text recognition result corresponding to the voice information to be recognized.
  • the embodiment of the present invention may further include the following steps:
  • the electronic device After obtaining the voice information to be recognized, the electronic device can subsequently obtain the text recognition result corresponding to the voice information to be recognized.
  • the electronic device can determine that the voice information to be recognized is voice information sent by the user, so that the electronic device needs to respond to the voice information sent by the user.
  • the electronic device can perform semantic recognition on the obtained text recognition result, thereby determining the response information of the voice information to be recognized.
  • the electronic device can input the text recognition result to the semantic model, so that the semantic model can analyze the semantics of the text recognition result, and then determine the response result corresponding to the semantics as the voice information to be recognized Response information.
  • the semantic model is used to recognize the semantics of the text recognition information, obtain the user needs corresponding to the voice information to be recognized, and make actions corresponding to the user needs according to the user needs, thereby obtaining the semantics corresponding to the
  • the response result of is used as the response information of the voice information to be recognized. For example, obtain the result corresponding to the user demand from a designated website or storage space, or execute an action corresponding to the user demand, etc.
  • the text recognition information is: how is the weather today.
  • the semantic model can recognize the keywords "today” and "weather” in the text recognition information, and then know the current geographic location through the positioning system, so that the semantic model can determine the user's needs as: the current geographic location The location is today’s weather conditions, and then, the semantic model can automatically connect to the website for querying the weather, and obtain the current weather conditions in the current geographic location from the website, for example, the weather in Beijing is 23 degrees Celsius, and then , The acquired weather condition can be determined as the response result corresponding to the semantics as the response information of the voice information to be recognized.
  • the text recognition information is: Where is Starbucks.
  • the semantic model can recognize the keywords "Starbucks" and "Where” in the text recognition information.
  • the semantic model can determine the user's needs as: the location of Starbucks.
  • the semantic model can be preset from the preset storage space. In the stored information, read the location information of Starbucks, for example, the northeast corner of the third floor of this commercial building, and then determine the location information obtained as the response result corresponding to the semantics, as the response information of the voice information to be recognized .
  • the text recognition information is: two meters ahead.
  • the semantic model can recognize the keywords "forward” and "two meters” in the text recognition information.
  • the semantic model can determine the user's needs as follows: I want to move forward two meters, and then the semantic model can be generated The corresponding control instruction, thus, controls itself to move forward a distance of two meters.
  • the action of the smart device moving forward is the response result corresponding to the semantics.
  • the voice information to be recognized acquired by the electronic device includes multiple voice fragments. Therefore, in order to ensure the accuracy of the obtained text recognition result, the manner of obtaining the text recognition result corresponding to the voice information to be recognized is It can include the following steps:
  • the first speech segment When the first speech segment is received, perform speech recognition on the first speech segment to obtain a temporary text result; when receiving a non-first speech segment, based on the temporary text result that has been obtained, the received Perform voice recognition on all voice fragments to obtain a new temporary text result. Until the last voice fragment is received, the text recognition result corresponding to the voice information to be recognized is obtained.
  • the first speech fragment when the first speech fragment is received, the first speech fragment is recognized by speech, and the temporary text result of the first speech fragment is obtained; furthermore, when the second speech fragment is received, it can be based on The temporary text result of the first speech segment, the speech information composed of the first and second speech segments are recognized, and the temporary text results of the first two speech segments are obtained; then, when the third speech segment is received, Based on the temporary text results of the first two speech fragments, the speech information composed of the first to third speech fragments can be recognized, and the temporary text results of the first three speech fragments can be obtained; and so on, until the last speech is received
  • segmenting based on the temporary text results from the first voice segment to the penultimate voice segment, the voice information composed of the first to last voice segments can be recognized, and the temporary text results of the first to last voice segments can be obtained Obviously, the result obtained at this time is the text recognition result corresponding to the voice information to be recognized.
  • the voice recognition model in the electronic device may be used to perform voice recognition on the voice information to be recognized.
  • Each voice sample includes voice information and text information corresponding to the voice information.
  • the voice recognition model can establish the correspondence between voice information and text information. relationship. In this way, after the trained voice recognition model receives the to-be-recognized voice information, it can determine the text recognition result corresponding to the to-be-recognized voice information according to the established correspondence.
  • the speech recognition model can be called a decoder.
  • the electronic device may output the temporary recognition result to the user.
  • the electronic device When the electronic device is a smart device, the electronic device can directly output the temporary recognition result through the display screen.
  • the electronic device may also output the text recognition result to the user.
  • the electronic device when the electronic device is a server, the electronic device sends the text recognition result to the smart device sending the voice information to be recognized, so that the smart device outputs the text recognition result through the display screen;
  • the electronic device When the electronic device is a smart device, the electronic device can directly output the text recognition result through the display screen.
  • the electronic device may broadcast the response information to the user.
  • the electronic device When the electronic device is a server, the electronic device sends the response information to the smart device that sent the voice information to be recognized, so that the smart device broadcasts the response information to the user;
  • the electronic device can directly broadcast the response information.
  • the above-mentioned electronic device is a server. specific:
  • the smart device collects each sound signal in the environment in real time, and performs signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal.
  • the smart device performs voice activity detection on the sound signal after signal preprocessing.
  • VAD can be used to detect the voice start endpoint and voice termination endpoint in the voice signal preprocessed by the signal, and after the voice start endpoint is detected, the collected voice signals are divided in sequence according to the preset division rule It is a voice segment, until the voice termination endpoint is detected.
  • the decoder performs voice recognition on all the currently received voice segments to obtain a temporary recognition result, and sends the temporary recognition result to the smart device, so that the smart device outputs the temporary recognition result through the display screen.
  • the text recognition result of the voice information to be recognized is obtained, the text recognition result is sent to the smart device, so that the smart device outputs the text recognition result through the display screen.
  • the voiceprint model performs voiceprint detection on all voice segments currently received, and records the detection results. Accordingly, when all voice segments constituting the voice information to be recognized are received, voiceprint is performed on the voice information to be recognized. Test and record the test results.
  • the server After the server receives the TTS status information corresponding to each voice segment among all the voice segments that constitute the voice information to be recognized, it calculates the number of 1s in the received TTS status information, and then calculates the number of 1s and the received TTS status information. The ratio of the number of TTS status information, and determine the relationship between the ratio and the set threshold.
  • the voice information to be recognized is a human voice.
  • the voice information to be recognized is determined to be a human voice.
  • the proportion information is not greater than the set threshold, and when it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice information to be recognized based on the voiceprint model, it is determined that the voice information to be recognized is a machine sound.
  • the smart device After receiving the response information, the smart device can output the response information.
  • the embodiment of the present invention also provides a voice processing device.
  • FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention. As shown in Figure 8, the voice processing device includes the following modules:
  • the information acquisition module 810 is configured to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein the broadcast status information corresponding to each voice segment indicates that the voice is collected Whether the smart device is performing voice broadcast during the segment;
  • the type determining module 820 is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
  • the type determining module 820 is specifically configured to:
  • the type determining module 820 is specifically configured to:
  • the first quantity of the first type of information From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information
  • the first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.
  • the type determining module 820 is specifically configured to:
  • the proportion information of a type of information From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio
  • the proportion information of a type of information determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,
  • the type determination module is specifically configured to:
  • the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
  • the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
  • the device further includes:
  • the information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.
  • the device further includes:
  • the result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;
  • the information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
  • an embodiment of the present invention also provides an electronic device, as shown in FIG. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where: The processor 901, the communication interface 902, and the memory 903 communicate with each other through the communication bus 904,
  • the memory 903 is used to store computer programs
  • the processor 901 is configured to implement a voice processing method provided in the foregoing embodiment of the present invention when executing a program stored in the memory 903.
  • the aforementioned voice processing method includes:
  • the smart device Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
  • the sound type of the voice information to be recognized is determined.
  • the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized.
  • the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
  • the communication interface is used for communication between the above-mentioned electronic device and other devices.
  • the memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage.
  • NVM non-Volatile Memory
  • the memory may also be at least one storage device located far away from the foregoing processor.
  • the above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the embodiment of the present invention also provides a computer-readable storage medium.
  • the computer program is executed by a processor, any voice provided by the foregoing embodiment of the present invention is implemented. Approach.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided are a voice processing method and apparatus, and an intelligent device and a storage medium. The method comprises: acquiring voice information to be recognized that is collected by an intelligent device and broadcast state information corresponding to each voice segment included in the voice information to be recognized, wherein the broadcast state information corresponding to each voice segment represents whether the intelligent device is conducting a voice broadcast when the voice segment is collected; and determining, on the basis of the acquired broadcast state information, the sound type of the voice information to be recognized. Compared with the prior art, the recognition accuracy of the sound type of voice information can be improved by applying the solution provided in the embodiments of the present invention.

Description

一种语音处理方法、装置、智能设备及存储介质Voice processing method, device, intelligent equipment and storage medium
相关申请的交叉引用Cross-references to related applications
本申请基于申请号为201911398330.X,申请日为2019年12月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with the application number 201911398330.X and the filing date on December 30, 2019, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by reference.
技术领域Technical field
本发明涉及智能机器人技术领域,特别是涉及一种语音处理方法、装置、智能设备及存储介质。The present invention relates to the technical field of intelligent robots, in particular to a voice processing method, device, intelligent equipment and storage medium.
背景技术Background technique
商场等区域内通常会设置有可以与用户进行持续对话的智能设备,例如智能机器人、智能音响等。而在唤醒该智能设备后,用户可以与该智能机器人进行多次语音交互,并且在每次交互之间不需要再次唤醒该智能设备。In areas such as shopping malls, smart devices such as smart robots, smart speakers, etc., that can conduct continuous conversations with users are usually set up. After waking up the smart device, the user can perform multiple voice interactions with the smart robot, and there is no need to wake up the smart device again between each interaction.
例如,通过触摸唤醒智能设备后,用户可以发出语音信息“今天天气怎么样”,接着,该智能设备向用户播报所查询到的天气状态。然后,用户可以再次发出语音信息“星巴克在什么地方”,从而,该智能设备可以继续向用户播报所查询到的星巴克的位置。其中,在用户发出“今天天气怎么样”和“星巴克在什么地方”两个语音信息之间,该智能设备处于唤醒状态,从而不需要用户再次唤醒。For example, after waking up the smart device by touch, the user can send out the voice message "How is the weather today", and then the smart device broadcasts the queried weather status to the user. Then, the user can send out the voice message "Where is the Starbucks" again, so that the smart device can continue to broadcast the location of the Starbucks that has been queried to the user. Among them, the smart device is in a wake-up state between the two voice messages of "What's the weather today" and "Where is the Starbucks?"
然而,在上述过程中,智能设备人处于唤醒状态时,可以接收到自身所播报的语音信息,并将该语音信息作为用户发出的语音信息进行响应,即智能设备可以将自身的机器声误认为用户的人声,从而,出现“自问自答”的错误行为,影响用户体验。However, in the above process, when the smart device is awake, it can receive the voice information broadcast by itself and respond to the voice information as the voice information sent by the user, that is, the smart device can mistake its own machine sound The user’s vocals, therefore, appear to be "self-questioning and self-answering" wrong behaviors, which affects the user experience.
基于此,如何提高对语音信息的声音类型的识别准确率,是一个亟待解决的问题。Based on this, how to improve the recognition accuracy of the voice type of the voice information is a problem to be solved urgently.
发明内容Summary of the invention
本发明实施例的目的在于提供一种语音处理方法、装置、智能设备及存储介质,以提高对语音信息的声音类型的识别准确率。具体技术方案如下:The purpose of the embodiments of the present invention is to provide a voice processing method, device, smart device, and storage medium to improve the recognition accuracy of the voice type of voice information. The specific technical solutions are as follows:
第一方面,本发明实施例提供了一种语音处理方法,所述方法包括:In the first aspect, an embodiment of the present invention provides a voice processing method, and the method includes:
获取智能设备采集的待识别语音信息以及所述待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时所述智能设备是否在进行语音播报;Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
基于所获取的播报状态信息,确定所述待识别语音信息的声音类型。Based on the acquired broadcast status information, the sound type of the voice information to be recognized is determined.
可选的,一种具体实现方式中,所述基于所获取的播报状态信息,确定所述待识别语音信息的声音类型的步骤,包括:Optionally, in a specific implementation manner, the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:
判断所述各个语音片段中,首个语音片段对应的播报状态信息是否表征采集该语音片段时所述智能设备未进行语音播报;Judging whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected;
如果是,确定所述待识别语音信息的声音类型为人声。If yes, it is determined that the voice type of the voice information to be recognized is human voice.
可选的,一种具体实现方式中,所述基于所获取的播报状态信息,确定所述待识别语音信息的声音类型的步骤,包括:Optionally, in a specific implementation manner, the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:
从所获取的播报状态信息中,确定第一类信息的第一数量;其中,所述第一类信息表征在采集所对应语音片段时所述智能设备未进行语音播报;From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;
基于所述第一类信息的第一数量,确定所述第一类信息的占比信息;Determine the proportion information of the first type of information based on the first quantity of the first type of information;
根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型。The sound type of the voice information to be recognized is determined according to the relationship between the proportion information and the set threshold.
可选的,一种具体实现方式中,所述基于所述第一类信息的第一数量,确定所述第一类信息的占比信息的步骤,包括:Optionally, in a specific implementation manner, the step of determining the proportion information of the first type of information based on the first quantity of the first type of information includes:
计算所述第一数量与所获取的播报状态信息的总数量的第一比值,将所述第一比值作为所述第一类信息的占比信息;或者,Calculate the first ratio of the first number to the total number of acquired broadcast status information, and use the first ratio as the proportion information of the first type of information; or,
从所获取的播报状态信息中,确定第二类信息的第二数量,计算所述第一数量与所述第二数量的第二比值,将所述第二比值作为所述第一类信息的占比信息;From the acquired broadcast status information, determine the second quantity of the second type of information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the value of the first type of information Proportion information;
其中,所述第二类信息表征在采集所对应语音片段时所述智能设备正在进行语音播报。Wherein, the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
可选的,一种具体实现方式中,所述根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型的步骤,包括:Optionally, in a specific implementation manner, the step of determining the sound type of the voice information to be recognized according to the relationship between the proportion information and a set threshold includes:
若所述占比信息大于设定阈值,确定所述待识别语音信息为人声;或者,If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
若所述占比信息不大于设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为人声,确定所述待识别语音信息为人声;或者,If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
若所述占比信息不大于设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为机器声,确定所述待识别语音信息为机器声。If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
可选的,一种具体实现方式中,所述方法还包括:Optionally, in a specific implementation manner, the method further includes:
若确定所述待识别语音信息为机器声,向所述智能设备反馈用于提示所述待识别语音信息为机器声的提示信息。If it is determined that the voice information to be recognized is a machine sound, prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.
可选的,一种具体实现方式中,所述方法还包括:Optionally, in a specific implementation manner, the method further includes:
获取所述待识别语音信息对应的文本识别结果;Obtaining a text recognition result corresponding to the voice information to be recognized;
若确定所述待识别语音信息为人声,基于所述文本识别结果进行语义识别,确定所述待识别语音信息的响应信息。If it is determined that the voice information to be recognized is a human voice, semantic recognition is performed based on the text recognition result, and the response information of the voice information to be recognized is determined.
第二方面,本发明实施例提供了一种语音处理装置,所述装置包括:In the second aspect, an embodiment of the present invention provides a voice processing device, the device including:
信息获取模块,用于获取智能设备采集的待识别语音信息以及所述待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时所述智能设备是否在进行语音播报;The information acquisition module is used to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment indicates that the voice segment is collected Whether the smart device is performing voice broadcast at the time;
类型确定模块,用于基于所获取的播报状态信息,确定所述待识别语音信息的声音类型。The type determination module is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
可选的,一种具体实现方式中,所述类型确定模块具体用于:Optionally, in a specific implementation manner, the type determination module is specifically configured to:
判断所述各个语音片段中,首个语音片段对应的播报状态信息是否表征采集该语音片段时所述智能设备未进行语音播报;如果是,确定所述待识别语音信息的声音类型为人声。Determine whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected; if so, determine that the voice type of the voice information to be recognized is human voice.
可选的,一种具体实现方式中,所述类型确定模块具体用于:Optionally, in a specific implementation manner, the type determination module is specifically configured to:
从所获取的播报状态信息中,确定第一类信息的第一数量;其中,所述第一类信息表征在采集所对应语音片段时所述智能设备未进行语音播报;基于所述第一类信息的第一数量,确定所述第一类信息的占比信息;根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型。From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information The first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.
可选的,一种具体实现方式中,所述类型确定模块具体用于:Optionally, in a specific implementation manner, the type determination module is specifically configured to:
从所获取的播报状态信息中,确定第一类信息的第一数量;计算所述第一数量与所获取的播报状态信息的总数量的第一比值,将所述第一比值作为所述第一类信息的占比信息;根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型;或者,From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio The proportion information of a type of information; determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,
根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型;从所获取的播报状态信息中,确定第二类信息的第二数量,计算所述第一数量与所述第二数量的第二比值,将所述第二比值作为所述第一类信息的占比信息;根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型;其中,所述第二类信息表征在采集所对应语音片段时所述智能设备正在进行语音播报。Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; determine the second quantity of the second type of information from the acquired broadcast status information, and calculate the first quantity The second ratio to the second number, the second ratio is used as the proportion information of the first type of information; the to-be-recognized voice information is determined according to the relationship between the proportion information and the set threshold The type of sound; wherein the second type of information characterizes that the smart device is performing a voice broadcast when the corresponding voice segment is collected.
可选的,一种具体实现方式中,所述类型确定模块具体用于:Optionally, in a specific implementation manner, the type determination module is specifically configured to:
若所述占比信息大于设定阈值,确定所述待识别语音信息为人声;或者,If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
若所述占比信息不大于设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为人声,确定所述待识别语音信息为人声;或者,If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
若所述占比信息不大于设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为机器声,确定所述待识别语音信息为机器声。If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
可选的,一种具体实现方式中,所述装置还包括:Optionally, in a specific implementation manner, the device further includes:
信息反馈模块,用于若确定所述待识别语音信息为机器声,向所述智能设备反馈用于提示所述待识别语音信息为机器声的提示信息。The information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.
可选的,一种具体实现方式中,所述装置还包括:Optionally, in a specific implementation manner, the device further includes:
结果获取模块,用于获取所述待识别语音信息对应的文本识别结果;The result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;
信息确定模块,用于若确定所述待识别语音信息为人声,基于所述文本识别结果进行语义识别,确定所述待识别语音信息的响应信息。The information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
第三方面,本发明实施例提供了一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;In a third aspect, an embodiment of the present invention provides an electronic device, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
存储器,用于存放计算机程序;Memory, used to store computer programs;
处理器,用于执行存储器上所存放的程序时,实现上述第一方面提供的任一种语音处理方法的步骤。The processor is configured to implement the steps of any voice processing method provided in the first aspect when executing the program stored in the memory.
第四方面,本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面提供的任一种语音处理方法的步骤。In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the above-mentioned aspects provided in the first aspect is implemented. The steps of the voice processing method.
第五方面,本发明实施例提供了一种计算机程序,所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时实现上述第一方面提供的任一种语音处理方法的步骤。In a fifth aspect, an embodiment of the present invention provides a computer program. The computer program product includes a computer program stored on a computer-readable storage medium. The computer program includes program instructions that are executed by a processor. The steps of any one of the voice processing methods provided in the first aspect above are implemented.
以上可见,应用本发明实施例提供的方案,智能设备采集的待识别语音信息中包含至少一个语音片段,并且,可以通过检测在采集每个语音片段时,智能设备是否进行语音播报确定每个语音片段对应的播报状态信息。这样,在对该待识别语音信息的声音类型进行识别时,便可以基于每个语音片段对应的播报状态信息,确定该待识别语音信息的声音类型。也就是说,在本发明实施例提供的方案中,可以利用待识别语音信息中,各个语音片段的语音播报状态信息识别待识别语音的声音类型。其中,由于语音播报状态信息可以反映所接收到的待识别语音信息中是否存在智能设备语音播报发出的机器声,因此,可以提高对语音信息的声音类型的识别准确率。It can be seen from the above that applying the solution provided by the embodiment of the present invention, the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected. The broadcast status information corresponding to the fragment. In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为本发明实施例提供的一种语音处理方法的流程示意图;FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention;
图2为图1中S101的一种具体实现方式的流程示意图;FIG. 2 is a schematic flowchart of a specific implementation of S101 in FIG. 1;
图3为图1中S101的另一种具体实现方式的流程示意图;FIG. 3 is a schematic flowchart of another specific implementation of S101 in FIG. 1;
图4为图1中S102的一种具体实现方式的流程示意图;FIG. 4 is a schematic flowchart of a specific implementation of S102 in FIG. 1;
图5为图1中S102的另一种具体实现方式的流程示意图;FIG. 5 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;
图6为图1中S102的另一种具体实现方式的流程示意图;FIG. 6 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;
图7为本发明实施例提供的另一种语音处理方法的流程示意图;FIG. 7 is a schematic flowchart of another voice processing method provided by an embodiment of the present invention;
图8为本发明实施例提供的一种语音处理装置的结构示意图;FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention;
图9为本发明实施例提供的一种电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
为了减少智能设备“自问自答”行为的发生,智能设备在采集到语音信息后,利用预设的声纹模型对该语音信息进行检测,以确定该语音信息的声音类型,即该语音信息是人声还是机器声。由于声纹模型是基于智能设备的机器声训练得到的,而训练声纹模型所采用的声纹与部分用户的人声的语谱相似,因此,声纹模型会将该部分用户的人声误判为机器声,从而导致该部分人声无法得到智能设备的响应,仍然会影响用户体验。基于此,如何提高对语音信息的声音类型的识别准确率,是一个亟待解决的问题。In order to reduce the "self-question and self-answer" behavior of smart devices, after the smart device collects voice information, it uses a preset voiceprint model to detect the voice information to determine the voice type of the voice information, that is, the voice information is Human voice or machine voice. Since the voiceprint model is obtained based on the machine voice training of the smart device, and the voiceprint used for training the voiceprint model is similar to the voice spectrum of some users, the voiceprint model will mistake the voice of some users. It is judged as a machine sound, which results in that part of the human voice cannot be responded to by the smart device and still affects the user experience. Based on this, how to improve the recognition accuracy of the voice type of the voice information is a problem to be solved urgently.
为了解决上述技术问题,本发明实施例提供了一种语音处理方法。其中,该方法包括:In order to solve the above technical problem, the embodiment of the present invention provides a voice processing method. Among them, the method includes:
获取智能设备采集的待识别语音信息以及该待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时智能设备是否在进行语音播报;Acquire the to-be-recognized voice information collected by the smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice when the voice segment is collected Broadcast
基于所获取的播报状态信息,确定该待识别语音信息的声音类型。Based on the acquired broadcast status information, the sound type of the voice information to be recognized is determined.
以上可见,应用本发明实施例提供的方案,智能设备采集的待识别语音信息中包含至少一个语音片段,并且,可以通过检测在采集每个语音片段时,智能设备是否进行语音播报确定每个语音片段对应的播报状态信息。这样,在对该待识别语音信息的声音类型进行识别时,便可以基于每个语音片段对应的播报状态信息,确定该待识别语音信息的声音类型。也就是说,在本发明实施例提供的方案中,可以利用待识别语音信息中,各个语音片段的语音播报 状态信息识别待识别语音的声音类型。其中,由于语音播报状态信息可以反映所接收到的待识别语音信息中是否可能存在智能设备语音播报发出的机器声,因此,可以提高对语音信息的声音类型的识别准确率。It can be seen from the above that applying the solution provided by the embodiment of the present invention, the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected. The broadcast status information corresponding to the fragment. In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is to say, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, since the voice broadcast status information can reflect whether there may be machine sounds generated by the smart device voice broadcast in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
下面,对本发明实施例提供的一种语音处理方法进行具体说明。In the following, a speech processing method provided by an embodiment of the present invention will be described in detail.
其中,本发明实施例提供的一种语音处理方法的执行主体可以是采集待识别语音信息的智能设备,从而,该识别方法可以是离线完成的。具体的,该智能设备可以为任一需要进行语音处理的智能电子设备,例如,智能机器人、智能音箱、智能手机、平板电脑等。对此,本发明实施例不做具体限定。Wherein, the execution subject of a voice processing method provided in the embodiment of the present invention may be a smart device that collects voice information to be recognized, and thus, the recognition method may be completed offline. Specifically, the smart device may be any smart electronic device that needs to perform voice processing, for example, a smart robot, a smart speaker, a smart phone, a tablet computer, etc. In this regard, the embodiment of the present invention does not make a specific limitation.
相应的,该执行主体也可以是为采集待识别语音信息的智能设备提供语音处理的服务器,从而,该识别方法可以是在线完成的。具体的,当该执行主体是服务器时,智能设备在采集到所处环境中的各个声音信号时,便可以在本地对该声音信号进行处理,从而得到待识别语音信息以及待识别语音信息包含的各个语音片段对应的播报状态信息,进而,便可以将该待识别语音信息和各个语音片段上传至服务器,以使得服务器可以执行本发明实施例提供的一种语音处理方法。Correspondingly, the execution subject may also be a server that provides voice processing for the smart device that collects the voice information to be recognized, so that the recognition method may be completed online. Specifically, when the execution subject is the server, when the smart device collects various sound signals in the environment, it can process the sound signals locally, so as to obtain the voice information to be recognized and the information contained in the voice information to be recognized. The broadcast state information corresponding to each voice segment can then be uploaded to the server for the to-be-recognized voice information and each voice segment, so that the server can execute a voice processing method provided by an embodiment of the present invention.
基于此,为了描述方便,以下将本发明实施例提供的一种语音处理方法的执行主体统称为电子设备。Based on this, for the convenience of description, the executors of a voice processing method provided by the embodiments of the present invention are collectively referred to as electronic devices in the following.
图1为本发明实施例提供的一种语音处理方法的流程示意图。如图1所示,该方法可以包括如下步骤:FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention. As shown in Figure 1, the method may include the following steps:
S101:获取智能设备采集的待识别语音信息以及待识别语音信息包含的各个语音片段对应的播报状态信息;S101: Obtain the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment included in the voice information to be recognized;
其中,每个语音片段对应的播报状态信息表征在采集该语音片段时智能设备是否在进行语音播报;Wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice broadcast when the voice segment is collected;
在本发明实施例时,电子设备所确定的是:所接收到的待识别语音信息的声音类型,因此,电子设备需要首先获取该待识别语音信息。其中,当电子设备的类型不同时,电子设备获取待识别语音信息的方式可以是不同的。In the embodiment of the present invention, what the electronic device determines is the sound type of the received voice information to be recognized. Therefore, the electronic device needs to first obtain the voice information to be recognized. Wherein, when the types of electronic devices are different, the ways in which the electronic devices obtain the voice information to be recognized may be different.
进一步的,在本发明实施例中,电子设备是利用待识别语音信息包含的各个语音片段对应的播报状态信息确定待识别语音信息的声音类型的,因此,电子设备还需要获取待识别语音信息包含的各个语音片段对应的播报状态信息。其中,类似的,当电子设备的类型不同时,电子设备获取待识别语音信息包含的各个语音片段对应的播报状态信息的方式也可以是不同的。Further, in the embodiment of the present invention, the electronic device uses the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to determine the sound type of the voice information to be recognized. Therefore, the electronic device also needs to obtain the voice information to be recognized. The broadcast status information corresponding to each voice segment of. Wherein, similarly, when the types of electronic devices are different, the manner in which the electronic devices obtain the broadcast status information corresponding to each voice segment included in the voice information to be recognized may also be different.
例如,当电子设备为智能设备时,电子设备可以在采集到所处环境中的各个声音信号时, 对该声音信号进行处理,从而,得到待识别语音信息和待识别语音信息包含的各个语音片段对应的播报状态信息;当电子设备为服务器时,电子设备可以接收所对应的智能设备上传的待识别语音信息和待识别语音信息包含的各个语音片段对应的播报状态信息。For example, when the electronic device is a smart device, the electronic device can process the sound signals when collecting various sound signals in the environment, so as to obtain the to-be-recognized voice information and the voice fragments contained in the to-be-recognized voice information. Corresponding broadcast status information; when the electronic device is a server, the electronic device can receive the to-be-recognized voice information uploaded by the corresponding smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information.
其中,为了行文清晰,后续将会对步骤S101的具体实现方式进行详细说明。Among them, for the sake of clarity, the specific implementation of step S101 will be described in detail later.
S102:基于所获取的播报状态信息,确定待识别语音信息的声音类型。S102: Determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
这样,在获取到上述待识别语音信息和该待识别语音信息包含的各个语音片段对应的播报状态信息后,电子设备便可以基于所获取的播报状态信息,确定待识别语音信息的声音类型。In this way, after acquiring the aforementioned voice information to be recognized and the broadcast status information corresponding to each voice segment included in the voice information to be recognized, the electronic device can determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
其中,电子设备可以通过多种方式执行上述步骤S102,对此,本发明实施例不做具体限定。为了行文清晰,后续将会对上述步骤S102的具体实现方式进行举例说明。Wherein, the electronic device can perform the above step S102 in a variety of ways, which is not specifically limited in the embodiment of the present invention. For clarity of writing, the specific implementation manner of the above step S102 will be described with an example in the following.
以上可见,在本发明实施例提供的方案中,可以利用待识别语音信息包含的各个语音片段的语音播报状态信息识别待识别语音的声音类型。其中,由于语音播报状态信息可以反映所接收到的待识别语音信息中是否存在智能设备语音播报发出的机器声,因此,可以提高对语音信息的声音类型的识别准确率。It can be seen from the above that in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment included in the voice information to be recognized can be used to recognize the sound type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
可选的,一种具体实现方式中,如图2所示,当电子设备为智能设备时,上述步骤S101可以包括如下步骤:Optionally, in a specific implementation manner, as shown in FIG. 2, when the electronic device is a smart device, the above step S101 may include the following steps:
S201:对采集到的声音信号进行语音活动检测;S201: Perform voice activity detection on the collected sound signal;
S202:当检测到语音起始信号时,按照预设划分规则,对从目标时刻开始所采集到的声音信号进行划分,得到多个语音片段,直至检测到语音终止信号;S202: When a voice start signal is detected, the voice signal collected from the target time is divided according to a preset division rule to obtain multiple voice segments until the voice termination signal is detected;
其中,目标时刻为:采集语音起始信号的时刻;Among them, the target moment is: the moment when the voice start signal is collected;
S203:在采集每个语音片段时,检测智能设备是否正在进行语音播报,并根据检测结果,确定该语音片段的播报状态信息;S203: When collecting each voice segment, detect whether the smart device is performing voice broadcast, and determine the broadcast status information of the voice segment according to the detection result;
S204:基于所划分得到的多个语音片段,确定待识别语音信息。S204: Determine the to-be-recognized voice information based on the multiple voice segments obtained by the division.
在本具体实现方式中,每个语音片段对应的播报状态信息为:在采集该语音片段时,所读取到的智能设备的播报状态信息。In this specific implementation, the broadcast status information corresponding to each voice segment is: the broadcast status information of the smart device that is read when the voice segment is collected.
智能设备在启动后,可以实时采集所处环境中的声音信号。其中,该声音信号中可以包括用户发出的语音信息,也可以包括智能设备自身发出的语音信息,还可以包括作为该环境的背景声音的各类噪音的声音信号。After the smart device is started, it can collect sound signals in the environment in real time. Wherein, the sound signal may include the voice information sent by the user, may also include the voice information sent by the smart device itself, and may also include the sound signals of various noises as background sounds of the environment.
这样,在采集到声音信号后,智能设备便可以对所采集到的声音信号进行语音活动检测,以检测得到所采集到的声音信号中的可以作为待识别语音信息的声音信号。In this way, after collecting the sound signal, the smart device can perform voice activity detection on the collected sound signal to detect the sound signal that can be used as the voice information to be recognized among the collected sound signals.
具体的,在每接收到一声音信号时,智能设备便可以检测该声音信号是否可以作为语音 起始信号。进而,当检测到一声音信号为语音起始信号时,智能设备便可以确定该语音起始信号,以及在采集到该语音起始信号的时刻之后所采集到的声音信号可以作为待识别语音信息中所包括的语音信息。并且,该语音起始信号可以作为待识别语音信息的起始信息。Specifically, every time a sound signal is received, the smart device can detect whether the sound signal can be used as a voice start signal. Furthermore, when a sound signal is detected as a voice start signal, the smart device can determine the voice start signal, and the sound signal collected after the time when the voice start signal is collected can be used as the voice information to be recognized Voice information included in. In addition, the voice start signal can be used as the start information of the voice information to be recognized.
进一步的,智能设备还可以对采集到语音起始信号的时刻之后所采集到的声音信号进行逐一检测,以确定该声音信号是否可以作为语音终止信号。进而,在检测到一声音信号为语音终止信号时,便可以确定该语音终止信号为待识别语音信息中的终止信息。Further, the smart device can also perform one-by-one detection on the sound signals collected after the moment when the voice start signal is collected, to determine whether the sound signal can be used as a voice termination signal. Furthermore, when it is detected that a voice signal is a voice termination signal, it can be determined that the voice termination signal is termination information in the voice information to be recognized.
这样,上述所检测到的语音起始信号、语音终止信号,以及位于语音起始信号和语音终止信号之间的声音信号构成了待识别语音信息。并且,该语音起始信号可以作为待识别语音信息的起始信息,该语音终止信号为待识别语音信息中的终止信息。In this way, the detected voice start signal, voice termination signal, and the sound signal located between the voice start signal and the voice termination signal constitute the voice information to be recognized. In addition, the voice start signal may be used as the start information of the voice information to be recognized, and the voice termination signal is the termination information in the voice information to be recognized.
此外,由于声音信号是流式传输的,因此,智能设备可以持续采集所处环境中的声音,并依次生成对应的声音信号的。In addition, because the sound signal is streamed, the smart device can continuously collect the sound in the environment and generate the corresponding sound signal in turn.
基于此,在检测到语音起始信号后,智能设备便可以按照预设划分规则,对从采集语音起始信号的目标时刻开始,所采集到的声音信号进行片段划分,依次得到多个语音片段,直至检测到语音终止信号。Based on this, after detecting the voice initiation signal, the smart device can divide the collected voice signal into segments according to the preset division rules, starting from the target moment of collecting the voice initiation signal, to obtain multiple voice segments in turn Until the voice termination signal is detected.
其中,语音片段的划分是在待识别语音信息的采集过程中进行的。具体的,在检测到语音起始信号后,智能设备继续采集声音信号。当采集到的某第一时刻时,智能设备确定从目标时刻开始至该时刻之间所采集到的声音信号满足预设划分规则,则可以将从目标时刻开始至该时刻之间所采集到的声音信号划分为一个语音片段。接着,继续采集声音信号,当采集到另一第二时刻时,智能设备确定从上述第一时刻开始至上述第二时刻之间所采集到的声信号再次满足预设划分规则,则可以将从上述第一时刻开始至上述第二时刻之间所采集到的声信号划分为下一个语音片段。依次类推,直至检测到语音终止信号。Among them, the division of speech segments is carried out during the collection process of the speech information to be recognized. Specifically, after detecting the voice start signal, the smart device continues to collect the voice signal. When a certain first moment is collected, the smart device determines that the sound signal collected from the target moment to this moment meets the preset division rules, and the collected sound signals from the target moment to this moment can be collected. The sound signal is divided into a speech segment. Then, continue to collect the sound signal. When another second moment is collected, the smart device determines that the sound signal collected from the first moment to the second moment again meets the preset division rule. The acoustic signal collected between the first moment and the second moment is divided into the next voice segment. And so on, until the voice termination signal is detected.
显然,所检测到的语音终止信号包括在所确定的最后一个声音片段中,并且,最后一个声音片段所包括的声音信号可以不满足预设划分规则。Obviously, the detected voice termination signal is included in the determined last sound segment, and the sound signal included in the last sound segment may not satisfy the preset division rule.
其中,该预设划分规则可以为:采集声音信号的时间满足一定的预设数值;也可以为:采集的声音信号对应一音节,对此,本发明实施例不做具体介绍。The preset division rule may be: the time for collecting the sound signal satisfies a certain preset value; or: the collected sound signal corresponds to a syllable, which is not described in detail in the embodiment of the present invention.
可选的,该语音活动检测可以为VAD(Voice Activity Detection,语音端点检测)。具体的:在采集到所处环境的声音信号后,智能设备可以利用VAD检测该声音信号中的语音起始端点和语音终止端点。其中,语音起始端点即为待识别语音信息的语音起始信号,语音终止端点即为待识别语音信息的语音终止信号。其中,在检测到语音起始端点后,智能设备便可以按照预设划分规则,将从检测到语音起始端点开始所采集到的声音信号划分为各个语音片段,直至在检测到语音终止端点时,将该语音终止端点划分入待识别语音信息所包含的 最后一个语音片段中。Optionally, the voice activity detection may be VAD (Voice Activity Detection, voice endpoint detection). Specifically: After collecting the sound signal of the environment in which it is located, the smart device can use the VAD to detect the voice start endpoint and the voice termination endpoint in the voice signal. Among them, the voice initiation endpoint is the voice initiation signal of the voice information to be recognized, and the voice termination endpoint is the voice termination signal of the voice information to be recognized. Among them, after detecting the voice initiation endpoint, the smart device can divide the collected sound signal from the detection of the voice initiation endpoint into each voice segment according to the preset division rule, until the voice termination endpoint is detected , Divide the voice termination endpoint into the last voice segment contained in the voice information to be recognized.
这样,在得到各个语音片段之后,智能设备便可以基于所划分得到的多个语音片段,确定待识别语音信息。In this way, after each voice segment is obtained, the smart device can determine the voice information to be recognized based on the divided voice segments.
其中,由于所划分得到的第一个语音片段中的第一语音信号为待识别语音信息的起始信息,所划分得到的最后一个语音片段中的最后一个语音信号为待识别语音信息的终止信息,则可以按照划分顺序,将各个语音片段中的各个声音信号依次排列,进而,所排列形成的声音信号组合即为待识别语音信息。Wherein, since the first voice signal in the first voice segment obtained by dividing is the start information of the voice information to be recognized, the last voice signal in the last voice segment obtained by dividing is the termination information of the voice information to be recognized , The sound signals in the speech segments can be arranged in sequence according to the order of division, and the sound signal combination formed by the arrangement is the speech information to be recognized.
例如,假设:预设划分规则为:采集声音信号的时长达到0.1秒,在所采集到的第1秒时,检测到语音起始端点,确定当前所采集到的信号为语音起始信号。则在采集到第1.1秒时,便可以将该第1秒-第1.1秒之间所采集到的声音信号划分为第一个语音片段;接着,在采集到第1.2秒时,便可以将该第1.1秒-第1.2秒之间所采集到的声音信号划分为第二个语音片段;依次类推,直至第1.75秒所采集到的声音信号被检测为语音终止端点时,则确定该第1.75时所裁剪掉的声音信号为语音终止端点,从而将第1.7秒-1.75秒之间所采集到的声音信号划分为最后一个语音片段。这样,便可以得到8个语音片段,且第8个,也就是最后一个语音片段的采集时长为0.05秒,其可以不符合预设划分规则。For example, suppose that the preset division rule is: the duration of collecting the sound signal reaches 0.1 second, and at the first second of the collection, the voice start endpoint is detected, and it is determined that the currently collected signal is the voice start signal. Then when the first 1.1 second is collected, the sound signal collected between the first second and the first 1.1 second can be divided into the first voice segment; then, when the first 1.2 second is collected, the sound signal can be divided into the first voice segment. The sound signal collected between 1.1 seconds and 1.2 seconds is divided into the second speech segment; and so on, until the sound signal collected in the 1.75 second is detected as a voice termination endpoint, the 1.75 time is determined The clipped sound signal is the voice termination endpoint, so the sound signal collected between 1.7 seconds and 1.75 seconds is divided into the last voice segment. In this way, 8 speech fragments can be obtained, and the acquisition time of the eighth, that is, the last speech fragment, is 0.05 seconds, which may not meet the preset division rule.
这样,上述第1秒-第1.75秒所采集到的声音信号所构成的声音信号组合即为待识别语音信息。In this way, the voice signal combination formed by the voice signals collected from the first second to the 1.75 second second is the voice information to be recognized.
并且,在本具体实现方式中,在采集一语音片段时,智能设备便随之检测在采集该语音片段中的各个声音信号的过程中,自身是否正在进行语音播报,从而,便可以根据检测结果,确定该语音片段对应的播报状态信息。Moreover, in this specific implementation, when collecting a voice segment, the smart device will then detect whether it is performing voice broadcast during the process of collecting each sound signal in the voice segment, so that it can be based on the detection result. To determine the broadcast status information corresponding to the voice segment.
其中,当智能设备在采集某一语音片段时,未进行语音播报,则该语音片段对应的播报状态信息可以称为第一类信息;相应的,当电子设备在采集某一语音片段时,正在进行语音播报,则该语音片段对应的播报状态信息可以称为第二类信息。Among them, when the smart device is collecting a certain voice segment, it is not performing voice broadcast, then the broadcast status information corresponding to the voice segment can be called the first type of information; correspondingly, when the electronic device is collecting a certain voice segment, it is When voice broadcast is performed, the broadcast state information corresponding to the voice segment can be referred to as the second type of information.
可选的,智能设备中可以通过状态文件记录各个时刻,智能设备是否进行语音播报,即记录各个时刻对应的智能设备的播报状态信息。这样,在划分得到的每个语音片段时,智能设备便可以确定采集该语音片段的时刻,从而,直接从状态文件中读取该时刻智能设备的播报状态信息,则读取到的播报状态信息即为该语音片段的播报状态信息。Optionally, the smart device can record each moment through a state file, whether the smart device performs voice broadcast, that is, record the broadcast status information of the smart device corresponding to each moment. In this way, when each voice segment is divided, the smart device can determine the time when the voice segment is collected, so that the broadcast status information of the smart device at that time can be directly read from the status file, and then the read broadcast status information It is the broadcast status information of the voice segment.
可选的,该播报状态信息可以为TTS(Text To Speech,从文本到语音)状态信息。具体的,一种情况下,在智能设备中,当智能设备进行播报时,智能设备通过离线模型将待播报的文本信息转换为语音信息,进而,播报该语音信息;另一种情况下,服务器通过云端模型将待播报的文本信息转换为语音信息,再将转换得到的语音信息反馈给智能设备。这样, 智能设备便可以播报所接收到的语音信息。其中,将待播报的文本信息转换为语音信息即为TTS,显然,该过程可以通过智能设备中的离线模型进行处理,也可以通过云端模型在服务器端在线进行处理。Optionally, the broadcast status information may be TTS (Text To Speech) status information. Specifically, in one case, when the smart device broadcasts, the smart device converts the text information to be broadcast into voice information through an offline model, and then broadcasts the voice information; in another case, the server The text information to be broadcast is converted into voice information through the cloud model, and the converted voice information is fed back to the smart device. In this way, the smart device can broadcast the received voice information. Among them, the conversion of the text information to be broadcast into voice information is TTS. Obviously, this process can be processed through an offline model in a smart device, or online through a cloud model on the server side.
其中,当智能设备在采集某一语音片段时,未进行语音播报,则该语音片段对应的TTS状态信息可以记做:TTS空闲状态,且该TTS空闲状态可以定义为1,即第一类信息定义为1;相应的,当智能设备在采集某一语音片段时,正在进行语音播报,则该语音片段对应的TTS状态信息可以记做:TTS播报状态,且该TTS播报状态可以定义为0,即第二类信息定义为0。Among them, when the smart device is collecting a certain voice segment without voice broadcast, the TTS state information corresponding to the voice segment can be recorded as: TTS idle state, and the TTS idle state can be defined as 1, which is the first type of information Defined as 1. Correspondingly, when the smart device is collecting a certain voice segment and is performing voice broadcast, the TTS status information corresponding to the voice segment can be recorded as: TTS broadcast status, and the TTS broadcast status can be defined as 0, That is, the second type of information is defined as 0.
进一步的,在上述图2所示具体实现方式中,智能设备在实时采集所处环境中的各个声音信号时,为了避免所采集到的该环境背景声音中的噪音影响智能设备对所采集到的声音信号中的待识别语音信息的检测,可以在采集到声音信号后,首先对所采集到的声音信号进行信号预处理,减弱所采集到的噪声,增强可以作为待检测语音信息的声音信号。Further, in the specific implementation shown in Figure 2 above, when the smart device collects various sound signals in the environment in real time, in order to avoid the noise in the collected environmental background sound from affecting the smart device’s For the detection of the voice information to be recognized in the voice signal, after the voice signal is collected, the collected voice signal may be preprocessed to reduce the collected noise and enhance the voice signal that can be used as the voice information to be detected.
基于此,可选的,另一种具体实现方式中,如图3所示,上述步骤S101,还可以包括如下步骤:Based on this, optionally, in another specific implementation manner, as shown in FIG. 3, the above step S101 may further include the following steps:
S200:按照采集到的声音信号的声波形状,对声音信号进行信号预处理;S200: Perform signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal;
相应的,上述步骤S201,便可以包括如下步骤:Correspondingly, the above step S201 may include the following steps:
S201A:对信号预处理后的声音信号进行语音活动检测。S201A: Perform voice activity detection on the sound signal after signal preprocessing.
在采集到声音信号时,智能设备可以获取到该声音信号的声波形状,从而,智能设备可以按照该声音信号的声波形状,对声音信号进行信号预处理。When the sound signal is collected, the smart device can obtain the sound wave shape of the sound signal, so that the smart device can perform signal preprocessing on the sound signal according to the sound wave shape of the sound signal.
具体的,对声波形状与噪声的声波形状相匹配的声音信号进行减弱,对声波形状与可以作为待识别语音信息的声音信号的声波形状相匹配的声音信号进行增强。Specifically, the sound signal whose sound wave shape matches the sound wave shape of the noise is attenuated, and the sound signal whose sound wave shape matches the sound wave shape of the sound signal that can be used as the voice information to be recognized is enhanced.
相应的,在本具体实现方式中,上述步骤S201,对采集到的声音信号进行语音活动检测,即为对信号预处理后的声音信号进行语音活动检测。Correspondingly, in this specific implementation manner, the above step S201 is to perform voice activity detection on the collected voice signal, that is, perform voice activity detection on the voice signal after signal preprocessing.
可选的,智能设备可以预先采集到各类噪声的声波形状,以及各类可以作为待检测语音信息的声音信号的声波形状,从而,利用这些声波形状,和每个声波形状所对应的标签,进行模型训练,得到声波检测模型。其中,每个声波形状所对应的标签为:用于表征该声波形状为噪声的声波形状或者可以作为待检测语音信息的声音信号的声波形状的标签。并且,可以作为待检测语音信息的声音信号可以为用户发出的语音信号,也可以为智能设备播报的语音信号,即可以作为待检测语音信息的声音信号的声音类型可以为人声,也可以机器声。Optionally, the smart device can pre-collect the sound wave shapes of various noises and various sound wave shapes that can be used as the sound signal of the voice information to be detected, so that these sound wave shapes and the labels corresponding to each sound wave shape can be used, Carry out model training to obtain the acoustic wave detection model. Wherein, the label corresponding to each sound wave shape is: a label used to characterize that the sound wave shape is a sound wave shape of noise or that can be used as a sound wave shape of a sound signal of the voice information to be detected. In addition, the sound signal that can be used as the voice information to be detected can be the voice signal sent by the user or the voice signal broadcast by the smart device. That is, the sound type of the voice signal that can be used as the voice information to be detected can be human voice or machine voice. .
这样,通过学习大量的声波形状的图像特点,该声波检测模型可以建立声波形状的图像特点和标签之间的对应关系。从而,在采集到声音信号时,可以利用该声波检测模型对所采 集到的声音信号进行检测,以确定该声音信号的标签,从而,减弱标签为噪声的声音信号,增强标签为可以作为待检测语音信息的声音信号。In this way, by learning a large number of image characteristics of the sound wave shape, the sound wave detection model can establish the corresponding relationship between the image characteristics of the sound wave shape and the label. Therefore, when a sound signal is collected, the sound wave detection model can be used to detect the collected sound signal to determine the label of the sound signal, thereby reducing the sound signal whose label is noise, and enhancing the label as the to-be-detected sound signal. The sound signal of the voice message.
相应于上述电子设备为智能设备的情况,可选的,另一种具体实现方式中,当电子设备为服务器时,上述步骤S101,可以包括如下步骤:Corresponding to the situation that the above-mentioned electronic device is a smart device, optionally, in another specific implementation manner, when the electronic device is a server, the above-mentioned step S101 may include the following steps:
接收智能设备发送的待识别语音信息和待识别语音信息包含的各个语音片段对应的播报状态信息。Receive the to-be-recognized voice information sent by the smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information.
显然,在本具体实现方式中,声音类型确定是在线完成的。智能设备采集所处环境中的各个声音信号,从所采集到的声音信号中获取待识别语音信息,并确定待识别语音信息包含的各个语音片段对应的播报状态信息,从而,将该待识别语音信息和各个播报状态信息发送给服务器,以使该服务器执行本发明实施例提供的一种语音处理方法,确定待识别语音信息的声音类型。Obviously, in this specific implementation, the sound type determination is done online. The smart device collects various sound signals in the environment, obtains the voice information to be recognized from the collected voice signals, and determines the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, thereby, the voice to be recognized The information and each broadcast status information are sent to the server, so that the server executes a voice processing method provided in an embodiment of the present invention to determine the sound type of the voice information to be recognized.
其中,可选的,在本具体实现方式中,智能设备可以通过上述图2或图3所示实施例提供的方案,确定待识别语音信息和待识别语音信息包含的各个语音片段对应的播报状态信息,并将所确定的待识别语音信息和待识别语音信息包含的各个语音片段对应的播报状态信息发送给服务器。Among them, optionally, in this specific implementation, the smart device can determine the voice information to be recognized and the broadcast status corresponding to each voice segment contained in the voice information to be recognized through the solution provided in the embodiment shown in Figure 2 or Figure 3 above. And send the determined voice information to be recognized and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to the server.
基于此,智能设备在向服务器发送待识别语音信息时,所发送的具体信息内容可以为:所划分得到的各个语音片段和所得到的每个语音片段对应的播报状态信息,以使服务器可以同时接收到待识别语音信息的包含各个语音片段和所得到的每个语音片段对应的播报状态信息。Based on this, when the smart device sends the voice information to be recognized to the server, the specific information content sent can be: each divided voice segment and the broadcast status information corresponding to each voice segment obtained, so that the server can simultaneously The received voice information to be recognized includes each voice segment and the broadcast state information corresponding to each obtained voice segment.
进而,由于按照划分顺序,将各个语音片段中的各个声音信号依次排列所形成的声音信号组合即为待识别语音信息,因此,服务器在依次得到待识别语音信息包含的各个语音片段后,便可以得到待识别语音信息。也就是说,服务器所接收到的各个语音片段的整体,即为待识别语音信息。Furthermore, since the voice signal combination formed by sequentially arranging the voice signals in each voice segment in the order of division is the voice information to be recognized, the server can obtain the voice information to be recognized after obtaining each voice segment in the voice information to be recognized in sequence. Obtain the voice information to be recognized. In other words, the entirety of each voice segment received by the server is the voice information to be recognized.
基于上述任一实施例,可选的,一种具体实现方式中,上述步骤S102可以包括如下步骤:Based on any of the foregoing embodiments, optionally, in a specific implementation manner, the foregoing step S102 may include the following steps:
判断各个语音片段中,首个语音片段对应的播报状态信息是否表征采集该语音片段时智能设备未进行语音播报;如果是,确定待识别语音信息的声音类型为人声。Determine whether the broadcast status information corresponding to the first voice segment in each voice segment indicates that the smart device did not perform voice broadcast when the voice segment was collected; if it is, it is determined that the voice type of the voice information to be recognized is human voice.
在本具体实现方式中,电子设备可以获取到待识别语音信息包含的各个语音片段对应的播报状态信息,从而,电子设备便可以获取到各个语音片段中,首个语音片段对应的播报状态信息,进而,电子设备便可以判断该播报状态信息是否表征采集该语音片段时智能设备未进行语音播报。In this specific implementation, the electronic device can obtain the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, so that the electronic device can obtain the broadcast status information corresponding to the first voice segment in each voice segment. Furthermore, the electronic device can determine whether the broadcast status information indicates that the smart device did not perform voice broadcast when the voice segment was collected.
其中,如果判断结果为是,即在采集到该待识别语音信息包含的首个语音片段时,智能设备未进行语音播报,从而,可以说明该待识别语音信息是由用户发出的,因此,电子设备可以确定待识别语音信息的声音类型为人声。Among them, if the judgment result is yes, that is, when the first voice segment contained in the voice information to be recognized is collected, the smart device does not perform voice broadcast, thus, it can be explained that the voice information to be recognized is sent by the user. Therefore, the electronic The device can determine that the voice type of the voice information to be recognized is human voice.
可选的,另一种具体实现方式中,如图4所示,步骤S102可包括如下步骤:Optionally, in another specific implementation manner, as shown in FIG. 4, step S102 may include the following steps:
S401:从所获取的播报状态信息中,确定第一类信息的第一数量;S401: Determine the first quantity of the first type of information from the acquired broadcast status information;
其中,第一类信息表征在采集所对应语音片段时智能设备未进行语音播报;Among them, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;
得到待识别语音信息和待识别语音信息包含的各个语音片段对应的播报状态信息后,电子设备便可以从各个播报状态信息中,确定第一类信息的第一数量。After obtaining the voice information to be recognized and the broadcast state information corresponding to each voice segment included in the voice information to be recognized, the electronic device can determine the first quantity of the first type of information from each broadcast state information.
其中,由于第一类信息表征在采集所对应语音片段时智能设备未进行语音播报,因此,所确定的第一数量可以表征待识别语音信息包含的各个语音片段中,声音信息的类型为人声的语音片段的数量。Among them, because the first type of information represents that the smart device did not perform voice broadcast when the corresponding voice segment was collected, the determined first number can represent that the type of voice information in each voice segment included in the voice information to be recognized is human voice. The number of speech fragments.
S402:基于第一类信息的第一数量,确定第一类信息的占比信息;S402: Determine the proportion information of the first type of information based on the first quantity of the first type of information;
在确定第一类信息的第一数量后,电子设备便可以基于第一类信息的第一数量,确定第一类信息的占比信息。After determining the first quantity of the first type of information, the electronic device can determine the proportion information of the first type of information based on the first quantity of the first type of information.
可选的,一种具体实现方式中,如图5所示,步骤S402可以包括如下步骤:Optionally, in a specific implementation manner, as shown in FIG. 5, step S402 may include the following steps:
S402A:计算第一数量与所获取的播报状态信息的总数量的第一比值,将第一比值作为第一类信息的占比信息。S402A: Calculate the first ratio of the first quantity to the total quantity of the acquired broadcast status information, and use the first ratio as the proportion information of the first type of information.
当一语音片段的播报状态信息为上述第一类信息时,则智能设备在采集该语音片段时,未进行语音播报,并且,由于该语音片段可以作为待识别语音信息的片段,因此,可以确定该语音片段为用户发出的语音信息,即可以确定该语音片段的声音类型为人声。When the broadcast status information of a voice segment is the above-mentioned first type of information, the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined The voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.
相应的,当一语音片段的播报状态信息为表征智能设备在采集该语音片段时,进行语音播报的第二类信息时,则智能设备在采集该语音片段时,正在进行语音播报,并且,由于该语音片段可以作为待识别语音信息的片段,因此,可以确定该语音片段的语音信息中存在智能设备所播报的语音信息,即可以确定该语音片段为智能设备所播报的语音信息,或者,同时包括用户发出的语音信息和智能设备所播报的语音信息。而上述两种情况,均可能导致智能设备出现“自问自答”的错误行为。Correspondingly, when the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected, then the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device. The above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices.
基于此,便可以计算第一数量与所获取的播报状态信息的总数量的第一比值,并将该第一比值作为第一类信息的占比信息。其中,在本具体实现方式中,上述所计算得到的第一类信息的占比信息可以理解为:待识别语音信息包含的各个语音片段中,声音类型为人声的语音片段所占的比值,显然,该比值越高,可以说明该待识别语音信息的声音类型为人声的可能性越大。Based on this, the first ratio of the first quantity to the total quantity of the acquired broadcast status information can be calculated, and the first ratio can be used as the proportion information of the first type of information. Among them, in this specific implementation, the above-mentioned calculated proportion information of the first type of information can be understood as: the proportion of speech fragments whose sound type is human voice among the speech fragments contained in the speech information to be recognized. Obviously , The higher the ratio, the greater the possibility that the voice type of the voice information to be recognized is human voice.
进而,当所获取的播报状态信息中,第一类信息的数量为0时,则上述第一比值为0,则说明该待识别语音信息的声音类型为机器声的可能性较大;Furthermore, when the number of the first type of information in the acquired broadcast status information is 0, then the first ratio is 0, indicating that the sound type of the voice information to be recognized is more likely to be machine sound;
相应的,当所获取的播报状态信息中,第二类信息的数量为0时,则上述第一比值为1,则说明该待识别语音信息的声音类型为人声的可能性较大。Correspondingly, when the amount of the second type of information in the acquired broadcast status information is 0, then the first ratio is 1, indicating that the voice type of the voice information to be recognized is more likely to be human voice.
可选的,当播报状态信息为TTS状态信息,且TTS播放状态定义为0,TTS空闲状态定义为1时,则上述所计算得到第一比值,即为所获取到的TTS状态信息中,数值为1的个数与所获取到的TTS状态信息的总数量的比值。Optionally, when the broadcast status information is TTS status information, and the TTS play status is defined as 0, and the TTS idle status is defined as 1, then the first ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the total number of acquired TTS status information.
例如,所获取到的TTS状态信息的总数量为10,其中,TTS状态信息为1的个数为9,则可以计算得到上述第一比值为:0.9。For example, if the total number of acquired TTS status information is 10, and the number of TTS status information being 1 is 9, then the first ratio can be calculated to be 0.9.
可选的,另一种具体实现方式中,如图6所示,步骤S402可包括如下步骤:Optionally, in another specific implementation manner, as shown in FIG. 6, step S402 may include the following steps:
S402B:从所获取的播报状态信息中,确定第二类信息的第二数量,计算第一数量与第二数量的第二比值,将第二比值作为第一类信息的占比信息;S402B: Determine the second quantity of the second type of information from the acquired broadcast status information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information;
其中,第二类信息表征在采集对应语音片段时智能设备正在进行语音播报。Among them, the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
在确定第一类信息的第一数量后,电子设备可以进一步从各个播报状态信息中,确定第二类信息的第二数量。从而,电子设备便可以计算所确定的第一数量与第二数量的第二比值,并将第二比值作为第一类信息的占比信息。After determining the first quantity of the first type of information, the electronic device may further determine the second quantity of the second type of information from each broadcast status information. Therefore, the electronic device can calculate the determined second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information.
当一语音片段的播报状态信息为上述第一类信息时,则智能设备在采集该语音片段时,未进行语音播报,并且,由于该语音片段可以作为待识别语音信息的片段,因此,可以确定该语音片段为用户发出的语音信息,即可以确定该语音片段的声音类型为人声。When the broadcast status information of a voice segment is the above-mentioned first type of information, the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined The voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.
相应的,当一语音片段的播报状态信息为表征智能设备在采集该语音片段时,进行语音播报的第二类信息时,则智能设备在采集该语音片段时,正在进行语音播报,并且,由于该语音片段可以作为待识别语音信息的片段,因此,可以确定该语音片段的语音信息中存在智能设备所播报的语音信息,即可以确定该语音片段为智能设备所播报的语音信息,或者,同时包括用户发出的语音信息和智能设备所播报的语音信息。而上述两种情况,均可能导致智能设备出现“自问自答”的错误行为。这样,可以确定该语音片段的声音类型为机器声。Correspondingly, when the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected, then the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device. The above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices. In this way, it can be determined that the sound type of the speech segment is machine sound.
基于此,便可以计算第一数量与第二数量的第二比值,并将该第二比值作为第一类信息的占比信息。其中,在本具体实现方式中,上述所计算得到的第一类信息的占比信息可以理解为:待识别语音信息包含的各个语音片段中,声音类型为人声的语音片段与声音类型为机器声的语音片段的比值,显然,该比值越高,可以说明该待识别语音信息的声音类型为人声的可能性越大。Based on this, the second ratio of the first quantity to the second quantity can be calculated, and the second ratio can be used as the proportion information of the first type of information. Among them, in this specific implementation, the above-mentioned calculated proportion information of the first type of information can be understood as: among the speech fragments contained in the speech information to be recognized, the voice fragments whose sound type is human voice and the sound type is machine sound Obviously, the higher the ratio, the greater the possibility that the voice type of the voice information to be recognized is human voice.
进而,当所获取的播报状态信息中,第一类信息的数量为0时,则上述第二比值为0, 则说明该待识别语音信息的声音类型为机器声的可能性较大;Furthermore, when the amount of the first type of information in the acquired broadcast status information is 0, then the second ratio is 0, which means that the sound type of the voice information to be recognized is more likely to be machine sound;
相应的,当所获取的播报状态信息中,第二类信息的数量为0时,可以直接表明该待识别语音信息的声音类型为人声的可能性较大。Correspondingly, when the amount of the second type of information in the acquired broadcast status information is 0, it can directly indicate that the voice type of the voice information to be recognized is more likely to be human voice.
可选的,当播报状态信息为TTS状态信息,且TTS播放状态定义为0,TTS空闲状态定义为1时,则上述所计算得到第二比值,即为所获取到的TTS状态信息中,数值为1的个数与0的个数的比值。Optionally, when the broadcast status information is TTS status information, and the TTS play status is defined as 0, and the TTS idle status is defined as 1, then the second ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the number of 0.
例如,所获取到的TTS状态信息的总数量为10,其中,TTS状态信息为1的个数为7,0的个数为3,则可以计算得到上述第二比值为:7/3。For example, if the total number of acquired TTS status information is 10, where the number of TTS status information is 1 is 7, and the number of 0 is 3, the above second ratio can be calculated to be 7/3.
S403:根据占比信息与设定阈值的大小关系,确定待识别语音信息的声音类型。S403: Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.
在确定第一类信息的占比信息后,电子设备便可以根据该占比信息与设定阈值的大小关系,确定该待识别语音信息的声音类型。After determining the proportion information of the first type of information, the electronic device can determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.
可选的,一种具体实现方式中,上述步骤S403,可以包括如下步骤:Optionally, in a specific implementation manner, the foregoing step S403 may include the following steps:
若占比信息大于设定阈值,确定待识别语音信息为人声;或者,If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
若占比信息不大于设定阈值,且基于声纹模型对待识别语音信息的检测结果确定待识别语音信息为人声,确定待识别语音信息为人声;或者,If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voiceprint model for the voice information to be recognized, and the voice information to be recognized is determined to be a human voice; or,
若占比信息不大于设定阈值,且基于声纹模型对待识别语音信息的检测结果确定待识别语音信息为机器声,确定待识别语音信息为机器声。If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model to be recognized, the voice information to be recognized is determined to be a machine sound.
根据上述对图5和图6所示具体实现方式的介绍,所确定的第一类信息的占比信息越大,则可以说明该待识别语音信息的声音类型为人声的可能性越大。According to the above description of the specific implementations shown in FIG. 5 and FIG. 6, the greater the proportion information of the first type of information determined, the greater the possibility that the voice type of the voice information to be recognized is a human voice.
基于此,在本具体实现方式中,若占比信息大于设定阈值,则可以确定待识别语音信息为人声。Based on this, in this specific implementation, if the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.
相应的,当占比信息不大于设定阈值时,则说明该待识别语音信息可能为机器声,为了能够进一步准确地确定该待识别语音信息的声音类型,电子设备便可以确定声纹模型对待识别语音信息进行检测的检测结果,从而,当该检测结果为人声时,可以确定待识别语音信息为人声。Correspondingly, when the proportion information is not greater than the set threshold, it means that the voice information to be recognized may be machine sound. In order to further accurately determine the sound type of the voice information to be recognized, the electronic device can determine the voiceprint model to treat The detection result of the detection by recognizing the voice information, so that when the detection result is a human voice, it can be determined that the voice information to be recognized is a human voice.
进一步的,当占比信息不大于设定阈值,且声纹模型对待识别语音信息进行检测的检测结果为机器声时,便可以确定待识别语音信息为机器声。Further, when the proportion information is not greater than the set threshold, and the detection result of the voiceprint model detecting the voice information to be recognized is a machine voice, it can be determined that the voice information to be recognized is a machine voice.
其中,需要说明的是,针对上述对图5和图6所示具体实现方式中,步骤S402A和S402B所提供的两种占比信息计算方式,所设定的上述预定阈值可以相同,也可以不同。Among them, it should be noted that, for the two calculation methods of proportion information provided in steps S402A and S402B in the specific implementations shown in FIG. 5 and FIG. 6, the predetermined thresholds set above may be the same or different. .
其中,电子设备可以在执行完步骤S101接收到待识别语音信息时,即利用预设的声纹模型对待识别语音信息进行检测,以得到检测结果,从而,在本具体实现方式中,便可以直 接使用该已经得到的检测结果;也可以在执行上述步骤S403时,当确定出占比信息不大于设定阈值时,再利用预设的声纹模型对待识别语音信息进行检测,得到检测结果,从而,使用该检测结果。Wherein, the electronic device may use a preset voiceprint model to detect the voice information to be recognized after performing step S101 and receive the voice information to be recognized, so as to obtain the detection result. Therefore, in this specific implementation, it can be directly Use the obtained detection result; it is also possible to perform the above step S403, when it is determined that the proportion information is not greater than the set threshold, then use the preset voiceprint model to detect the voice information to be recognized to obtain the detection result, thereby , Use the test result.
可选的,一种实施例中,可以首先判断占比信息是否大于设定阈值,进而,当判断出占比信息大于设定阈值时,便可以确定该待识别语音信息为人声。Optionally, in an embodiment, it is possible to first determine whether the proportion information is greater than a set threshold, and then, when it is determined that the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.
进而,当判断出占比信息不大于设定阈值,可以获取声纹模型对待识别语音信息进行检测的检测结果,其中,当检测结果为人声时,便可以确定待识别语音信息为人声,相应的,当检测结果为机器声时,便可以确定待识别语音信息为机器声。Furthermore, when it is determined that the proportion information is not greater than the set threshold, the voiceprint model can obtain the detection result of the voice information to be recognized. When the detection result is a human voice, it can be determined that the voice information to be recognized is a human voice. When the detection result is machine sound, it can be determined that the voice information to be recognized is machine sound.
可选的,另一种实施例中,可以首先获取声纹模型对待识别语音信息进行检测的检测结果,当检测结果为人声时,可以确定待识别语音信息为人声。Optionally, in another embodiment, the voiceprint model may first obtain the detection result of the voice information to be recognized, and when the detection result is a human voice, it may be determined that the voice information to be recognized is a human voice.
相应的,当检测结果为机器声时,可以判断所计算得到的占比信息是否大于设定阈值,其中,如果大于,便可以确定待识别语音信息为人声;如果不大于,便可以确定待识别语音信息为机器声。Correspondingly, when the detection result is machine sound, it can be judged whether the calculated proportion information is greater than the set threshold. If it is greater, it can be determined that the voice information to be recognized is a human voice; if it is not greater than, it can be determined to be recognized. The voice information is machine sound.
可选的,一种具体实现方式中,本发明实施例还可以包括如下步骤:Optionally, in a specific implementation manner, the embodiment of the present invention may further include the following steps:
若确定待识别语音信息为机器声,向智能设备反馈用于提示待识别语音信息为机器声的提示信息。If it is determined that the voice information to be recognized is a machine sound, prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.
在本具体实现方式中,当确定出待识别语音信息为机器声时,电子设备便可以向采集该待识别语音信息的智能设备反馈用于提示待识别语音信息为机器声的提示信息。这样,智能设备将不会进行响应该待识别语音信息,从而,避免出现“自问自答”行为。其中,该提示信息可以为预设的“错误码”。In this specific implementation, when it is determined that the voice information to be recognized is machine sound, the electronic device can feed back to the smart device that collects the voice information to be recognized prompt information for prompting that the voice information to be recognized is machine sound. In this way, the smart device will not respond to the to-be-recognized voice information, thereby avoiding "self-questioning and self-answering" behaviors. Wherein, the prompt information may be a preset "error code".
并且,当确定待识别语音信息为机器声时,电子设备可以不对待识别语音信息的文本识别结果进行语义识别。Moreover, when it is determined that the voice information to be recognized is a machine sound, the electronic device may not perform semantic recognition on the text recognition result of the voice information to be recognized.
进一步的,可选的,电子设备还可以不对所获取到的待识别语音信息进行语音识别,即电子设备可以不得到待识别语音信息对应的文本识别结果。Further, optionally, the electronic device may not perform voice recognition on the acquired voice information to be recognized, that is, the electronic device may not obtain a text recognition result corresponding to the voice information to be recognized.
可选的,一种具体实现方式中,如图7所示,本发明实施例还可以包括如下步骤:Optionally, in a specific implementation manner, as shown in FIG. 7, the embodiment of the present invention may further include the following steps:
S103:获取待识别语音信息对应的文本识别结果;S103: Obtain a text recognition result corresponding to the voice information to be recognized;
S104:若确定待识别语音信息为人声,基于文本识别结果进行语义识别,确定待识别语音信息的响应信息。S104: If it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
在获取到待识别语音信息后,电子设备可以随之获取待识别语音信息对应的文本识别结果。After obtaining the voice information to be recognized, the electronic device can subsequently obtain the text recognition result corresponding to the voice information to be recognized.
进一步的,在确定出待识别语音信息为人声后,电子设备便可以确定该待识别语音信息 是用户发出的语音信息,从而,电子设备需要响应该用户发出的语音信息。Further, after determining that the voice information to be recognized is a human voice, the electronic device can determine that the voice information to be recognized is voice information sent by the user, so that the electronic device needs to respond to the voice information sent by the user.
基于此,在确定出待识别语音信息为人声后,电子设备便可以对所获取到的文本识别结果进行语义识别,从而,确定待识别语音信息的响应信息。Based on this, after determining that the voice information to be recognized is a human voice, the electronic device can perform semantic recognition on the obtained text recognition result, thereby determining the response information of the voice information to be recognized.
其中,可选的,电子设备可以将该文本识别结果输入给语义模型,从而,使语义模型可以分析出该文本识别结果的语义,然后,确定该语义所对应的响应结果,作为待识别语音信息的响应信息。Optionally, the electronic device can input the text recognition result to the semantic model, so that the semantic model can analyze the semantics of the text recognition result, and then determine the response result corresponding to the semantics as the voice information to be recognized Response information.
其中,语义模型用于对该文本识别信息的语义进行识别,得到待识别语音信息所对应的用户需求,并根据该用户需求做出与该用户需求相对应的动作,从而,得到该语义所对应的响应结果,作为待识别语音信息的响应信息。例如,从指定的网址或者存储空间中获取该用户需求所对应的结果,或者,执行该用户需求所对应的动作等。Among them, the semantic model is used to recognize the semantics of the text recognition information, obtain the user needs corresponding to the voice information to be recognized, and make actions corresponding to the user needs according to the user needs, thereby obtaining the semantics corresponding to the The response result of is used as the response information of the voice information to be recognized. For example, obtain the result corresponding to the user demand from a designated website or storage space, or execute an action corresponding to the user demand, etc.
示例性的,文本识别信息为:今天天气怎么样。进而,语义模型便可以识别得到该文本识别信息中关键词“今天”和“天气”,进而,通过定位系统获知当前所处的地理位置,从而,语义模型可以确定用户需求为:当前所处地理位置在今天的天气状况,进而,语义模型便可以自动连接用于查询天气的网站,并在该网站中获取到当前所处地理位置在今天的天气状况,例如,北京天气晴温度23摄氏度,进而,便可以将所获取到的天气状况确定为该语义所对应的响应结果,作为待识别语音信息的响应信息。Exemplarily, the text recognition information is: how is the weather today. Furthermore, the semantic model can recognize the keywords "today" and "weather" in the text recognition information, and then know the current geographic location through the positioning system, so that the semantic model can determine the user's needs as: the current geographic location The location is today’s weather conditions, and then, the semantic model can automatically connect to the website for querying the weather, and obtain the current weather conditions in the current geographic location from the website, for example, the weather in Beijing is 23 degrees Celsius, and then , The acquired weather condition can be determined as the response result corresponding to the semantics as the response information of the voice information to be recognized.
示例性的,文本识别信息为:星巴克在哪里。进而,语义模型便可以识别得到该文本识别信息中关键词“星巴克”和“哪里”,进而,语义模型可以确定用户需求为:星巴克的位置,进而,语义模型便可以从预设存储空间中预先存储的信息中,读取星巴克的位置信息,例如,本商厦三楼东北角,进而,便可以将所获取到的位置信息确定为该语义所对应的响应结果,作为待识别语音信息的响应信息。Exemplarily, the text recognition information is: Where is Starbucks. Furthermore, the semantic model can recognize the keywords "Starbucks" and "Where" in the text recognition information. Furthermore, the semantic model can determine the user's needs as: the location of Starbucks. Furthermore, the semantic model can be preset from the preset storage space. In the stored information, read the location information of Starbucks, for example, the northeast corner of the third floor of this commercial building, and then determine the location information obtained as the response result corresponding to the semantics, as the response information of the voice information to be recognized .
示例性的,文本识别信息为:前行两米。进而,语义模型便可以识别得到该文本识别信息中关键词“前行”和“两米”,进而,语义模型可以确定用户需求为:希望自己向前移动两米,进而,语义模型便可以生成相应的控制指令,从而,控制自身向前移动两米的距离。显然,智能设备向前移动的动作即为该语义所对应的响应结果。Exemplarily, the text recognition information is: two meters ahead. Furthermore, the semantic model can recognize the keywords "forward" and "two meters" in the text recognition information. Furthermore, the semantic model can determine the user's needs as follows: I want to move forward two meters, and then the semantic model can be generated The corresponding control instruction, thus, controls itself to move forward a distance of two meters. Obviously, the action of the smart device moving forward is the response result corresponding to the semantics.
进一步的,可选的,电子设备所获取到的待识别语音信息包括多个语音片段,因此,为了保证所得到的文本识别结果的准确性,获取待识别语音信息对应的文本识别结果的方式,可以包括如下步骤:Further, optionally, the voice information to be recognized acquired by the electronic device includes multiple voice fragments. Therefore, in order to ensure the accuracy of the obtained text recognition result, the manner of obtaining the text recognition result corresponding to the voice information to be recognized is It can include the following steps:
在接收到第一个语音片段时,对该第一个语音片段进行语音识别,得到临时文本结果;在接收到非第一个语音片段时,基于已经得到的临时文本结果,对已接收到的全部语音片段进行语音识别,得到新的临时文本结果,直至接收到最后一个语音片段时,得到待识别语音 信息对应的文本识别结果。When the first speech segment is received, perform speech recognition on the first speech segment to obtain a temporary text result; when receiving a non-first speech segment, based on the temporary text result that has been obtained, the received Perform voice recognition on all voice fragments to obtain a new temporary text result. Until the last voice fragment is received, the text recognition result corresponding to the voice information to be recognized is obtained.
具体的,在接收到第一个语音片段时,对该第一个语音片段进行语音识别,得到第一个语音片段的临时文本结果;进而,在接收到第二个语音片段时,便可以基于第一个语音片段的临时文本结果,对第一个和第二个语音片段构成的语音信息进行识别,得到前两个语音片段的临时文本结果;接着,在接收到第三个语音片段时,便可以基于前两个语音片段的临时文本结果,对第一至第三个语音片段构成的语音信息进行识别,得到前三个语音片段的临时文本结果;依次类推,直至在接收到最后一个语音片段时,便可以基于第一个语音片段至倒数第二个语音片段的临时文本结果,对第一至最后一个语音片段构成的语音信息进行识别,得到第一至最后一个语音片段的临时文本结果,显然,此时得到的结果即为待识别语音信息对应的文本识别结果。Specifically, when the first speech fragment is received, the first speech fragment is recognized by speech, and the temporary text result of the first speech fragment is obtained; furthermore, when the second speech fragment is received, it can be based on The temporary text result of the first speech segment, the speech information composed of the first and second speech segments are recognized, and the temporary text results of the first two speech segments are obtained; then, when the third speech segment is received, Based on the temporary text results of the first two speech fragments, the speech information composed of the first to third speech fragments can be recognized, and the temporary text results of the first three speech fragments can be obtained; and so on, until the last speech is received When segmenting, based on the temporary text results from the first voice segment to the penultimate voice segment, the voice information composed of the first to last voice segments can be recognized, and the temporary text results of the first to last voice segments can be obtained Obviously, the result obtained at this time is the text recognition result corresponding to the voice information to be recognized.
在本具体实现方式中,在对待识别语音信息的语音识别过程中,充分考虑了待识别语音信息中上下文之间的关系对文本识别结果的影响,从而,可以提高所得到的文本识别结果的准确率。In this specific implementation, in the speech recognition process of the speech information to be recognized, the influence of the relationship between the contexts in the speech information to be recognized on the text recognition result is fully considered, so that the accuracy of the obtained text recognition result can be improved. rate.
可选的,可以利用电子设备中的语音识别模型对待识别语音信息进行语音识别。利用语音样本对语音识别模型进行训练,每个语音样本中包括语音信息和该语音信息所对应的文本信息,进而,通过大量语音样本的学习,语音识别模型便可以建立语音信息和文本信息的对应关系。这样,训练完成的语音识别模型在接收到待识别语音信息后,便可以根据所建立的对应关系,确定与该待识别语音信息对应的文本识别结果。其中,该语音识别模型可以称为解码器。Optionally, the voice recognition model in the electronic device may be used to perform voice recognition on the voice information to be recognized. Use voice samples to train the voice recognition model. Each voice sample includes voice information and text information corresponding to the voice information. Furthermore, through the study of a large number of voice samples, the voice recognition model can establish the correspondence between voice information and text information. relationship. In this way, after the trained voice recognition model receives the to-be-recognized voice information, it can determine the text recognition result corresponding to the to-be-recognized voice information according to the established correspondence. Among them, the speech recognition model can be called a decoder.
进一步的,可选的,在每次得到上述至少一个语音片段的临时识别结果时,电子设备可以向用户输出该临时识别结果。Further, optionally, each time a temporary recognition result of the at least one speech segment is obtained, the electronic device may output the temporary recognition result to the user.
其中,当电子设备为服务器时,电子设备向发送该待识别语音信息的智能设备发送该临时识别结果,以使该智能设备通过显示屏输出该临时识别结果;Wherein, when the electronic device is a server, the electronic device sends the temporary recognition result to the smart device sending the voice information to be recognized, so that the smart device outputs the temporary recognition result through the display screen;
当电子设备为智能设备时,电子设备便可以直接通过显示屏输出该临时识别结果。When the electronic device is a smart device, the electronic device can directly output the temporary recognition result through the display screen.
相应的,可选的,在得到待识别语音信息的文本识别结果时,电子设备也可以向用户输出该文本识别结果。Correspondingly, optionally, when the text recognition result of the voice information to be recognized is obtained, the electronic device may also output the text recognition result to the user.
其中,当电子设备为服务器时,电子设备向发送该待识别语音信息的智能设备发送该文本识别结果,以使该智能设备通过显示屏输出该文本识别结果;Wherein, when the electronic device is a server, the electronic device sends the text recognition result to the smart device sending the voice information to be recognized, so that the smart device outputs the text recognition result through the display screen;
当电子设备为智能设备时,电子设备便可以直接通过显示屏输出该文本识别结果。When the electronic device is a smart device, the electronic device can directly output the text recognition result through the display screen.
进一步的,可选的,电子设备在得到待识别语音信息的响应信息后,便可以向用户播报该响应信息。Further, optionally, after obtaining the response information of the voice information to be recognized, the electronic device may broadcast the response information to the user.
当电子设备为服务器时,电子设备向发送该待识别语音信息的智能设备发送该响应信息,以使该智能设备向用户播报该响应信息;When the electronic device is a server, the electronic device sends the response information to the smart device that sent the voice information to be recognized, so that the smart device broadcasts the response information to the user;
当电子设备为智能设备时,电子设备便可以直接播报该响应信息。When the electronic device is a smart device, the electronic device can directly broadcast the response information.
为了更好地理解本发明实施例提供的一种语音处理方法,下面通过一个具体实施例对该语音处理方法进行说明。In order to better understand a voice processing method provided by an embodiment of the present invention, the voice processing method will be described below through a specific embodiment.
其中,在本具体实施例中,上述电子设备为服务器。具体的:Wherein, in this specific embodiment, the above-mentioned electronic device is a server. specific:
智能设备实时采集所处环境中的各个声音信号,并按照采集到的声音信号的声波形状,对声音信号进行信号预处理。The smart device collects each sound signal in the environment in real time, and performs signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal.
进而,智能设备对信号预处理后的声音信号进行语音活动检测。具体的:可以利用VAD检测信号预处理后的声音信号中的语音起始端点和语音终止端点,并在检测到语音起始端点后,按照预设划分规则,依次将所采集到的声音信号划分为语音片段,直至检测到语音终止端点。Furthermore, the smart device performs voice activity detection on the sound signal after signal preprocessing. Specifically: VAD can be used to detect the voice start endpoint and voice termination endpoint in the voice signal preprocessed by the signal, and after the voice start endpoint is detected, the collected voice signals are divided in sequence according to the preset division rule It is a voice segment, until the voice termination endpoint is detected.
并且,在上述过程中,每划分得到一语音片段时,读取到的智能设备的TTS状态信息,并将每一语音片段和该语音片段对应的TTS状态信息发送给服务器。In addition, in the above process, each time a voice segment is divided, the TTS status information of the smart device is read, and each voice segment and the TTS status information corresponding to the voice segment are sent to the server.
服务器接收智能设备发送的每一语音片段和该语音片段对应的TTS状态信息,将每一语音片段发送给解码器和声纹模型。The server receives each voice segment sent by the smart device and the TTS state information corresponding to the voice segment, and sends each voice segment to the decoder and voiceprint model.
其中,解码器对当前所接收到的全部语音片段进行语音识别,得到临时识别结果,并将该临时识别结果发送给智能设备,以使该智能设备通过显示屏输出该临时识别结果。Wherein, the decoder performs voice recognition on all the currently received voice segments to obtain a temporary recognition result, and sends the temporary recognition result to the smart device, so that the smart device outputs the temporary recognition result through the display screen.
相应的,在得到待识别语音信息的文本识别结果时,将该文本识别结果发送给智能设备,以使该智能设备通过显示屏输出该文本识别结果。Correspondingly, when the text recognition result of the voice information to be recognized is obtained, the text recognition result is sent to the smart device, so that the smart device outputs the text recognition result through the display screen.
这样,在接收到完整的待识别语音信息时,便可以获取待识别语音信息对应的文本识别结果,并使该智能设备通过显示屏输出该待识别语音信息对应的文本识别结果。In this way, when the complete voice information to be recognized is received, the text recognition result corresponding to the voice information to be recognized can be obtained, and the smart device can output the text recognition result corresponding to the voice information to be recognized through the display screen.
并且,声纹模型对当前所接收到的全部语音片段进行声纹检测,并记录检测结果,相应的,在接收到构成待识别语音信息的全部语音片段时,对该待识别语音信息进行声纹检测,并记录检测结果。In addition, the voiceprint model performs voiceprint detection on all voice segments currently received, and records the detection results. Accordingly, when all voice segments constituting the voice information to be recognized are received, voiceprint is performed on the voice information to be recognized. Test and record the test results.
服务器在接收到构成待识别语音信息的全部语音片段中,各个语音片段对应的TTS状态信息后,计算所接收到的TTS状态信息中,1的数量,进而,计算1的数量与所接收到的TTS状态信息的数量的比值,并判断该比值与设定阈值的大小关系。After the server receives the TTS status information corresponding to each voice segment among all the voice segments that constitute the voice information to be recognized, it calculates the number of 1s in the received TTS status information, and then calculates the number of 1s and the received TTS status information. The ratio of the number of TTS status information, and determine the relationship between the ratio and the set threshold.
进而,当判断出该比值大于设定阈值时,便可以确定该待识别语音信息为人声。当判断出该比值不大于设定阈值时,基于声纹模型对待识别语音信息的检测结果确定待识别语音信息为人声时,确定待识别语音信息为人声,当占比信息不大于设定阈值,且基于声纹模型对 待识别语音信息的检测结果确定待识别语音信息为机器声时,确定待识别语音信息为机器声。Furthermore, when it is determined that the ratio is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice. When it is determined that the ratio is not greater than the set threshold, based on the detection result of the voiceprint model that the voice information to be recognized is determined to be a human voice, the voice information to be recognized is determined to be a human voice. When the proportion information is not greater than the set threshold, And when it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice information to be recognized based on the voiceprint model, it is determined that the voice information to be recognized is a machine sound.
进一步的,服务器在确定出待识别语音信息为人声后,便可以通过语义模型确定待识别语音信息的响应信息,并将该响应信息发送给智能设备。Further, after the server determines that the voice information to be recognized is a human voice, it can determine the response information of the voice information to be recognized through the semantic model, and send the response information to the smart device.
智能设备在接收到响应信息后,便可以输出该响应信息。After receiving the response information, the smart device can output the response information.
相应于上述本发明实施例提供的一种语音处理方法,本发明实施例还提供了一种语音处理装置。Corresponding to the voice processing method provided in the foregoing embodiment of the present invention, the embodiment of the present invention also provides a voice processing device.
图8为本发明实施例提供的一种语音处理装置的结构示意图。如图8所示,该语音处理装置包括如下模块:FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention. As shown in Figure 8, the voice processing device includes the following modules:
信息获取模块810,用于获取智能设备采集的待识别语音信息以及所述待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时所述智能设备是否在进行语音播报;The information acquisition module 810 is configured to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein the broadcast status information corresponding to each voice segment indicates that the voice is collected Whether the smart device is performing voice broadcast during the segment;
类型确定模块820,用于基于所获取的播报状态信息,确定所述待识别语音信息的声音类型。The type determining module 820 is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
以上可以,在本发明实施例提供的方案中,可以利用待识别语音信息中,各个语音片段的语音播报状态信息识别待识别语音的声音类型。其中,由于语音播报状态信息可以反映所接收到的待识别语音信息中是否存在智能设备语音播报发出的机器声,因此,可以提高对语音信息的声音类型的识别准确率。The above is possible. In the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
可选的,一种具体实现方式中,所述类型确定模块820具体用于:Optionally, in a specific implementation manner, the type determining module 820 is specifically configured to:
判断所述各个语音片段中,首个语音片段对应的播报状态信息是否表征采集该语音片段时所述智能设备未进行语音播报;如果是,确定所述待识别语音信息的声音类型为人声。Determine whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected; if so, determine that the voice type of the voice information to be recognized is human voice.
可选的,一种具体实现方式中,所述类型确定模块820具体用于:Optionally, in a specific implementation manner, the type determining module 820 is specifically configured to:
从所获取的播报状态信息中,确定第一类信息的第一数量;其中,所述第一类信息表征在采集所对应语音片段时所述智能设备未进行语音播报;基于所述第一类信息的第一数量,确定所述第一类信息的占比信息;根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型。From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information The first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.
可选的,一种具体实现方式中,所述类型确定模块820具体用于:Optionally, in a specific implementation manner, the type determining module 820 is specifically configured to:
从所获取的播报状态信息中,确定第一类信息的第一数量;计算所述第一数量与所获取的播报状态信息的总数量的第一比值,将所述第一比值作为所述第一类信息的占比信息;根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型;或者,From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio The proportion information of a type of information; determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,
根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型;从所 获取的播报状态信息中,确定第二类信息的第二数量,计算所述第一数量与所述第二数量的第二比值,将所述第二比值作为所述第一类信息的占比信息;根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型;其中,所述第二类信息表征在采集所对应语音片段时所述智能设备正在进行语音播报。Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; determine the second quantity of the second type of information from the acquired broadcast status information, and calculate the first quantity The second ratio to the second number, the second ratio is used as the proportion information of the first type of information; the to-be-recognized voice information is determined according to the relationship between the proportion information and the set threshold The type of sound; wherein the second type of information characterizes that the smart device is performing a voice broadcast when the corresponding voice segment is collected.
可选的,一种具体实现方式中,所述类型确定模块具体用于:Optionally, in a specific implementation manner, the type determination module is specifically configured to:
若所述占比信息大于设定阈值,确定所述待识别语音信息为人声;或者,If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
若所述占比信息不大于设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为人声,确定所述待识别语音信息为人声;或者,If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,
若所述占比信息不大于设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为机器声,确定所述待识别语音信息为机器声。If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.
可选的,一种具体实现方式中,所述装置还包括:Optionally, in a specific implementation manner, the device further includes:
信息反馈模块,用于若确定所述待识别语音信息为机器声,向所述智能设备反馈用于提示所述待识别语音信息为机器声的提示信息。The information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.
可选的,一种具体实现方式中,所述装置还包括:Optionally, in a specific implementation manner, the device further includes:
结果获取模块,用于获取所述待识别语音信息对应的文本识别结果;The result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;
信息确定模块,用于若确定所述待识别语音信息为人声,基于所述文本识别结果进行语义识别,确定所述待识别语音信息的响应信息。The information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.
相应于本发明实施例提供的一种语音处理方法,本发明实施例还提供了一种电子设备,如图9所示,包括处理器901、通信接口902、存储器903和通信总线904,其中,处理器901,通信接口902,存储器903通过通信总线904完成相互间的通信,Corresponding to a voice processing method provided by an embodiment of the present invention, an embodiment of the present invention also provides an electronic device, as shown in FIG. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where: The processor 901, the communication interface 902, and the memory 903 communicate with each other through the communication bus 904,
存储器903,用于存放计算机程序;The memory 903 is used to store computer programs;
处理器901,用于执行存储器903上所存放的程序时,实现上述本发明实施例提供的一种语音处理方法。The processor 901 is configured to implement a voice processing method provided in the foregoing embodiment of the present invention when executing a program stored in the memory 903.
具体的,上述语音处理方法,包括:Specifically, the aforementioned voice processing method includes:
获取智能设备采集的待识别语音信息以及所述待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时所述智能设备是否在进行语音播报;Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
基于所获取的播报状态信息,确定所述待识别语音信息的声音类型。Based on the acquired broadcast status information, the sound type of the voice information to be recognized is determined.
需要说明的是,上述处理器901执行存储器903上存放的程序而实现的一种语音处理方法的其他实现方式,与前述方法实施例部分提供的一种语音处理方法实施例相同,这里不再赘述。It should be noted that other implementations of a voice processing method implemented by the processor 901 executing the program stored in the memory 903 are the same as the voice processing method embodiment provided in the foregoing method embodiment section, and will not be repeated here. .
以上可见,在本发明实施例提供的方案中,可以利用待识别语音信息中,各个语音片段的语音播报状态信息识别待识别语音的声音类型。其中,由于语音播报状态信息可以反映所接收到的待识别语音信息中是否存在智能设备语音播报发出的机器声,因此,可以提高对语音信息的声音类型的识别准确率。It can be seen from the above that in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.
上述电子设备提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
通信接口用于上述电子设备与其他设备之间的通信。The communication interface is used for communication between the above-mentioned electronic device and other devices.
存储器可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. Optionally, the memory may also be at least one storage device located far away from the foregoing processor.
上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
相应于上述本发明实施例提供的一种语音处理方法,本发明实施例还提供了一种计算机可读存储介质,该计算机程序被处理器执行时实现上述本发明实施例提供的任一种语音处理方法。Corresponding to the voice processing method provided by the foregoing embodiment of the present invention, the embodiment of the present invention also provides a computer-readable storage medium. When the computer program is executed by a processor, any voice provided by the foregoing embodiment of the present invention is implemented. Approach.
相应于上述本发明实施例提供的一种语音处理方法,本发明实施例还提供了一种计算机程序,所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时实现上述本发明实施例提供的任一种语音处理方法。Corresponding to the voice processing method provided by the foregoing embodiment of the present invention, the embodiment of the present invention also provides a computer program. The computer program product includes a computer program stored on a computer-readable storage medium, and the computer program includes Program instructions, which, when executed by a processor, implement any of the voice processing methods provided in the foregoing embodiments of the present invention.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or equipment that includes the element.
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例、电子设备实施例、计算机可读存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in this specification are described in a related manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device embodiment, the electronic device embodiment, and the computer-readable storage medium embodiment, since they are basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
以上所述仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims (10)

  1. 一种语音处理方法,其特征在于,所述方法包括:A voice processing method, characterized in that the method includes:
    获取智能设备采集的待识别语音信息以及所述待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时所述智能设备是否在进行语音播报;Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;
    基于所获取的播报状态信息,确定所述待识别语音信息的声音类型。Based on the acquired broadcast status information, the sound type of the voice information to be recognized is determined.
  2. 根据权利要求1所述的方法,其特征在于,所述基于所获取的播报状态信息,确定所述待识别语音信息的声音类型的步骤,包括:The method according to claim 1, wherein the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information comprises:
    判断所述各个语音片段中,首个语音片段对应的播报状态信息是否表征采集该语音片段时所述智能设备未进行语音播报;Judging whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected;
    如果是,确定所述待识别语音信息的声音类型为人声。If yes, it is determined that the voice type of the voice information to be recognized is human voice.
  3. 根据权利要求1所述的方法,其特征在于,所述基于所获取的播报状态信息,确定所述待识别语音信息的声音类型的步骤,包括:The method according to claim 1, wherein the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information comprises:
    从所获取的播报状态信息中,确定第一类信息的第一数量;其中,所述第一类信息表征在采集所对应语音片段时所述智能设备未进行语音播报;From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;
    基于所述第一类信息的第一数量,确定所述第一类信息的占比信息;Determine the proportion information of the first type of information based on the first quantity of the first type of information;
    根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型。The sound type of the voice information to be recognized is determined according to the relationship between the proportion information and the set threshold.
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述第一类信息的第一数量,确定所述第一类信息的占比信息的步骤,包括:The method according to claim 3, wherein the step of determining the proportion information of the first type of information based on the first quantity of the first type of information comprises:
    计算所述第一数量与所获取的播报状态信息的总数量的第一比值,将所述第一比值作为所述第一类信息的占比信息;或者,Calculate the first ratio of the first number to the total number of acquired broadcast status information, and use the first ratio as the proportion information of the first type of information; or,
    从所获取的播报状态信息中,确定第二类信息的第二数量,计算所述第一数量与所述第二数量的第二比值,将所述第二比值作为所述第一类信息的占比信息;From the acquired broadcast status information, determine the second quantity of the second type of information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the value of the first type of information Proportion information;
    其中,所述第二类信息表征在采集所对应语音片段时所述智能设备正在进行语音播报。Wherein, the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
  5. 根据权利要求3或4所述的方法,其特征在于,所述根据所述占比信息与设定阈值的大小关系,确定所述待识别语音信息的声音类型的步骤,包括:The method according to claim 3 or 4, wherein the step of determining the sound type of the voice information to be recognized according to the relationship between the proportion information and a set threshold value comprises:
    若所述占比信息大于所述设定阈值,确定所述待识别语音信息为人声;或者,If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,
    若所述占比信息不大于所述设定阈值,且基于声纹模型对所述待识别语音信息的检测结果确定所述待识别语音信息为人声,确定所述待识别语音信息为人声;或者,If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or ,
    若所述占比信息不大于所述设定阈值,且基于声纹模型对所述待识别语音信息的检测结 果确定所述待识别语音信息为机器声,确定所述待识别语音信息为机器声。If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice .
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    若确定所述待识别语音信息为机器声,向所述智能设备反馈用于提示所述待识别语音信息为机器声的提示信息。If it is determined that the voice information to be recognized is a machine sound, prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.
  7. 根据权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    获取所述待识别语音信息对应的文本识别结果;Obtaining a text recognition result corresponding to the voice information to be recognized;
    若确定所述待识别语音信息为人声,基于所述文本识别结果进行语义识别,确定所述待识别语音信息的响应信息。If it is determined that the voice information to be recognized is a human voice, semantic recognition is performed based on the text recognition result, and the response information of the voice information to be recognized is determined.
  8. 一种语音处理装置,其特征在于,所述装置包括:A voice processing device, characterized in that the device includes:
    信息获取模块,用于获取智能设备采集的待识别语音信息以及所述待识别语音信息包含的各个语音片段对应的播报状态信息;其中,每个语音片段对应的播报状态信息表征在采集该语音片段时所述智能设备是否在进行语音播报;The information acquisition module is used to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment indicates that the voice segment is collected Whether the smart device is performing voice broadcast at the time;
    类型确定模块,用于基于所获取的播报状态信息,确定所述待识别语音信息的声音类型。The type determination module is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
  9. 一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;An electronic device characterized by comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;
    存储器,用于存放计算机程序;Memory, used to store computer programs;
    处理器,用于执行存储器上所存放的程序时,实现权利要求1-7任一所述的方法步骤。The processor is configured to implement the method steps of any one of claims 1-7 when executing the program stored in the memory.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-7任一所述的方法步骤。A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps according to any one of claims 1-7 are realized.
PCT/CN2020/141038 2019-12-30 2020-12-29 Voice processing method and apparatus, and intelligent device and storage medium WO2021136298A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911398330.X 2019-12-30
CN201911398330.XA CN113129902B (en) 2019-12-30 2019-12-30 Voice processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021136298A1 true WO2021136298A1 (en) 2021-07-08

Family

ID=76687322

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141038 WO2021136298A1 (en) 2019-12-30 2020-12-29 Voice processing method and apparatus, and intelligent device and storage medium

Country Status (2)

Country Link
CN (1) CN113129902B (en)
WO (1) WO2021136298A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500590A (en) * 2021-12-23 2022-05-13 珠海格力电器股份有限公司 Intelligent device voice broadcasting method and device, computer device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750952A (en) * 2011-04-18 2012-10-24 索尼公司 Sound signal processing device, method, and program
CN103167174A (en) * 2013-02-25 2013-06-19 广东欧珀移动通信有限公司 Output method, device and mobile terminal of mobile terminal greetings
CN106847285A (en) * 2017-03-31 2017-06-13 上海思依暄机器人科技股份有限公司 A kind of robot and its audio recognition method
CN108346425A (en) * 2017-01-25 2018-07-31 北京搜狗科技发展有限公司 A kind of method and apparatus of voice activity detection, the method and apparatus of speech recognition
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937693B (en) * 2010-08-17 2012-04-04 深圳市子栋科技有限公司 Video and audio playing method and system based on voice command
CN102780646B (en) * 2012-07-19 2015-12-09 上海量明科技发展有限公司 The implementation method of the sound's icon, client and system in instant messaging
CN104484045B (en) * 2014-12-26 2018-07-20 小米科技有限责任公司 Audio play control method and device
CN107507620A (en) * 2017-09-25 2017-12-22 广东小天才科技有限公司 Voice broadcast sound setting method and device, mobile terminal and storage medium
CN108509176B (en) * 2018-04-10 2021-06-08 Oppo广东移动通信有限公司 Method and device for playing audio data, storage medium and intelligent terminal
CN109524013B (en) * 2018-12-18 2022-07-22 北京猎户星空科技有限公司 Voice processing method, device, medium and intelligent equipment
CN113990309A (en) * 2019-04-09 2022-01-28 百度国际科技(深圳)有限公司 Voice recognition method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750952A (en) * 2011-04-18 2012-10-24 索尼公司 Sound signal processing device, method, and program
CN103167174A (en) * 2013-02-25 2013-06-19 广东欧珀移动通信有限公司 Output method, device and mobile terminal of mobile terminal greetings
CN108346425A (en) * 2017-01-25 2018-07-31 北京搜狗科技发展有限公司 A kind of method and apparatus of voice activity detection, the method and apparatus of speech recognition
CN106847285A (en) * 2017-03-31 2017-06-13 上海思依暄机器人科技股份有限公司 A kind of robot and its audio recognition method
CN110097890A (en) * 2019-04-16 2019-08-06 北京搜狗科技发展有限公司 A kind of method of speech processing, device and the device for speech processes

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500590A (en) * 2021-12-23 2022-05-13 珠海格力电器股份有限公司 Intelligent device voice broadcasting method and device, computer device and storage medium

Also Published As

Publication number Publication date
CN113129902A (en) 2021-07-16
CN113129902B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN107919130B (en) Cloud-based voice processing method and device
US10438595B2 (en) Speaker identification and unsupervised speaker adaptation techniques
EP3371809B1 (en) Voice commands across devices
US10074360B2 (en) Providing an indication of the suitability of speech recognition
WO2018188586A1 (en) Method and device for user registration, and electronic device
JP7348288B2 (en) Voice interaction methods, devices, and systems
CN108009303B (en) Search method and device based on voice recognition, electronic equipment and storage medium
US20200035241A1 (en) Method, device and computer storage medium for speech interaction
WO2017012242A1 (en) Voice recognition method and apparatus
CN104732975A (en) Method and device for voice instant messaging
CN110875059B (en) Method and device for judging reception end and storage device
CN105139858A (en) Information processing method and electronic equipment
CN108648765A (en) A kind of method, apparatus and terminal of voice abnormality detection
US10950221B2 (en) Keyword confirmation method and apparatus
US8868419B2 (en) Generalizing text content summary from speech content
CN109697981B (en) Voice interaction method, device, equipment and storage medium
TW201926315A (en) Audio processing method, device and terminal device recognizing the sound information made by a user more quickly and accurately
CN112002349B (en) Voice endpoint detection method and device
WO2021136298A1 (en) Voice processing method and apparatus, and intelligent device and storage medium
CN111063356B (en) Electronic equipment response method and system, sound box and computer readable storage medium
CN111933149A (en) Voice interaction method, wearable device, terminal and voice interaction system
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
WO2024099359A1 (en) Voice detection method and apparatus, electronic device and storage medium
CN111506183A (en) Intelligent terminal and user interaction method
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20910235

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20910235

Country of ref document: EP

Kind code of ref document: A1