WO2021136298A1

WO2021136298A1 - Voice processing method and apparatus, and intelligent device and storage medium

Info

Publication number: WO2021136298A1
Application number: PCT/CN2020/141038
Authority: WO
Inventors: 刘浩; 任海海
Original assignee: 北京猎户星空科技有限公司
Priority date: 2019-12-30
Filing date: 2020-12-29
Publication date: 2021-07-08
Also published as: CN113129902A; CN113129902B

Abstract

Provided are a voice processing method and apparatus, and an intelligent device and a storage medium. The method comprises: acquiring voice information to be recognized that is collected by an intelligent device and broadcast state information corresponding to each voice segment included in the voice information to be recognized, wherein the broadcast state information corresponding to each voice segment represents whether the intelligent device is conducting a voice broadcast when the voice segment is collected; and determining, on the basis of the acquired broadcast state information, the sound type of the voice information to be recognized. Compared with the prior art, the recognition accuracy of the sound type of voice information can be improved by applying the solution provided in the embodiments of the present invention.

Description

Voice processing method, device, intelligent equipment and storage medium

Cross-references to related applications

This application is based on a Chinese patent application with the application number 201911398330.X and the filing date on December 30, 2019, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by reference.

Technical field

The present invention relates to the technical field of intelligent robots, in particular to a voice processing method, device, intelligent equipment and storage medium.

Background technique

In areas such as shopping malls, smart devices such as smart robots, smart speakers, etc., that can conduct continuous conversations with users are usually set up. After waking up the smart device, the user can perform multiple voice interactions with the smart robot, and there is no need to wake up the smart device again between each interaction.

For example, after waking up the smart device by touch, the user can send out the voice message "How is the weather today", and then the smart device broadcasts the queried weather status to the user. Then, the user can send out the voice message "Where is the Starbucks" again, so that the smart device can continue to broadcast the location of the Starbucks that has been queried to the user. Among them, the smart device is in a wake-up state between the two voice messages of "What's the weather today" and "Where is the Starbucks?"

However, in the above process, when the smart device is awake, it can receive the voice information broadcast by itself and respond to the voice information as the voice information sent by the user, that is, the smart device can mistake its own machine sound The user’s vocals, therefore, appear to be "self-questioning and self-answering" wrong behaviors, which affects the user experience.

Based on this, how to improve the recognition accuracy of the voice type of the voice information is a problem to be solved urgently.

Summary of the invention

The purpose of the embodiments of the present invention is to provide a voice processing method, device, smart device, and storage medium to improve the recognition accuracy of the voice type of voice information. The specific technical solutions are as follows:

In the first aspect, an embodiment of the present invention provides a voice processing method, and the method includes:

Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;

Based on the acquired broadcast status information, the sound type of the voice information to be recognized is determined.

Optionally, in a specific implementation manner, the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information includes:

Judging whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected;

If yes, it is determined that the voice type of the voice information to be recognized is human voice.

From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;

Determine the proportion information of the first type of information based on the first quantity of the first type of information;

The sound type of the voice information to be recognized is determined according to the relationship between the proportion information and the set threshold.

Optionally, in a specific implementation manner, the step of determining the proportion information of the first type of information based on the first quantity of the first type of information includes:

Calculate the first ratio of the first number to the total number of acquired broadcast status information, and use the first ratio as the proportion information of the first type of information; or,

From the acquired broadcast status information, determine the second quantity of the second type of information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the value of the first type of information Proportion information;

Wherein, the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.

Optionally, in a specific implementation manner, the step of determining the sound type of the voice information to be recognized according to the relationship between the proportion information and a set threshold includes:

If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,

If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or,

If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice.

Optionally, in a specific implementation manner, the method further includes:

If it is determined that the voice information to be recognized is a machine sound, prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.

Optionally, in a specific implementation manner, the method further includes:

Obtaining a text recognition result corresponding to the voice information to be recognized;

If it is determined that the voice information to be recognized is a human voice, semantic recognition is performed based on the text recognition result, and the response information of the voice information to be recognized is determined.

In the second aspect, an embodiment of the present invention provides a voice processing device, the device including:

The information acquisition module is used to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment indicates that the voice segment is collected Whether the smart device is performing voice broadcast at the time;

The type determination module is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.

Optionally, in a specific implementation manner, the type determination module is specifically configured to:

Determine whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected; if so, determine that the voice type of the voice information to be recognized is human voice.

From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected; based on the first type of information The first quantity of information determines the proportion information of the first type of information; according to the relationship between the proportion information and the set threshold, the sound type of the voice information to be recognized is determined.

From the acquired broadcast status information, determine the first quantity of the first type of information; calculate the first ratio of the first quantity to the total quantity of acquired broadcast status information, and use the first ratio as the first ratio The proportion information of a type of information; determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; or,

Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold; determine the second quantity of the second type of information from the acquired broadcast status information, and calculate the first quantity The second ratio to the second number, the second ratio is used as the proportion information of the first type of information; the to-be-recognized voice information is determined according to the relationship between the proportion information and the set threshold The type of sound; wherein the second type of information characterizes that the smart device is performing a voice broadcast when the corresponding voice segment is collected.

Optionally, in a specific implementation manner, the device further includes:

The information feedback module is configured to, if it is determined that the voice information to be recognized is machine sound, feed back to the smart device prompt information for prompting that the voice information to be recognized is machine sound.

Optionally, in a specific implementation manner, the device further includes:

The result obtaining module is used to obtain the text recognition result corresponding to the voice information to be recognized;

The information determining module is configured to, if it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.

In a third aspect, an embodiment of the present invention provides an electronic device, which is characterized by including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

Memory, used to store computer programs;

The processor is configured to implement the steps of any voice processing method provided in the first aspect when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when the computer program is executed by a processor, any one of the above-mentioned aspects provided in the first aspect is implemented. The steps of the voice processing method.

In a fifth aspect, an embodiment of the present invention provides a computer program. The computer program product includes a computer program stored on a computer-readable storage medium. The computer program includes program instructions that are executed by a processor. The steps of any one of the voice processing methods provided in the first aspect above are implemented.

It can be seen from the above that applying the solution provided by the embodiment of the present invention, the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected. The broadcast status information corresponding to the fragment. In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a specific implementation of S101 in FIG. 1;

FIG. 3 is a schematic flowchart of another specific implementation of S101 in FIG. 1;

FIG. 4 is a schematic flowchart of a specific implementation of S102 in FIG. 1;

FIG. 5 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;

FIG. 6 is a schematic flowchart of another specific implementation manner of S102 in FIG. 1;

FIG. 7 is a schematic flowchart of another voice processing method provided by an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

In order to reduce the "self-question and self-answer" behavior of smart devices, after the smart device collects voice information, it uses a preset voiceprint model to detect the voice information to determine the voice type of the voice information, that is, the voice information is Human voice or machine voice. Since the voiceprint model is obtained based on the machine voice training of the smart device, and the voiceprint used for training the voiceprint model is similar to the voice spectrum of some users, the voiceprint model will mistake the voice of some users. It is judged as a machine sound, which results in that part of the human voice cannot be responded to by the smart device and still affects the user experience. Based on this, how to improve the recognition accuracy of the voice type of the voice information is a problem to be solved urgently.

In order to solve the above technical problem, the embodiment of the present invention provides a voice processing method. Among them, the method includes:

Acquire the to-be-recognized voice information collected by the smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice when the voice segment is collected Broadcast

It can be seen from the above that applying the solution provided by the embodiment of the present invention, the voice information to be recognized collected by the smart device contains at least one voice segment, and it can be determined whether each voice is broadcast by the smart device when each voice segment is collected. The broadcast status information corresponding to the fragment. In this way, when recognizing the voice type of the voice information to be recognized, the voice type of the voice information to be recognized can be determined based on the broadcast status information corresponding to each voice segment. That is to say, in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, since the voice broadcast status information can reflect whether there may be machine sounds generated by the smart device voice broadcast in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.

In the following, a speech processing method provided by an embodiment of the present invention will be described in detail.

Wherein, the execution subject of a voice processing method provided in the embodiment of the present invention may be a smart device that collects voice information to be recognized, and thus, the recognition method may be completed offline. Specifically, the smart device may be any smart electronic device that needs to perform voice processing, for example, a smart robot, a smart speaker, a smart phone, a tablet computer, etc. In this regard, the embodiment of the present invention does not make a specific limitation.

Correspondingly, the execution subject may also be a server that provides voice processing for the smart device that collects the voice information to be recognized, so that the recognition method may be completed online. Specifically, when the execution subject is the server, when the smart device collects various sound signals in the environment, it can process the sound signals locally, so as to obtain the voice information to be recognized and the information contained in the voice information to be recognized. The broadcast state information corresponding to each voice segment can then be uploaded to the server for the to-be-recognized voice information and each voice segment, so that the server can execute a voice processing method provided by an embodiment of the present invention.

Based on this, for the convenience of description, the executors of a voice processing method provided by the embodiments of the present invention are collectively referred to as electronic devices in the following.

FIG. 1 is a schematic flowchart of a voice processing method provided by an embodiment of the present invention. As shown in Figure 1, the method may include the following steps:

S101: Obtain the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment included in the voice information to be recognized;

Wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is performing voice broadcast when the voice segment is collected;

In the embodiment of the present invention, what the electronic device determines is the sound type of the received voice information to be recognized. Therefore, the electronic device needs to first obtain the voice information to be recognized. Wherein, when the types of electronic devices are different, the ways in which the electronic devices obtain the voice information to be recognized may be different.

Further, in the embodiment of the present invention, the electronic device uses the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to determine the sound type of the voice information to be recognized. Therefore, the electronic device also needs to obtain the voice information to be recognized. The broadcast status information corresponding to each voice segment of. Wherein, similarly, when the types of electronic devices are different, the manner in which the electronic devices obtain the broadcast status information corresponding to each voice segment included in the voice information to be recognized may also be different.

For example, when the electronic device is a smart device, the electronic device can process the sound signals when collecting various sound signals in the environment, so as to obtain the to-be-recognized voice information and the voice fragments contained in the to-be-recognized voice information. Corresponding broadcast status information; when the electronic device is a server, the electronic device can receive the to-be-recognized voice information uploaded by the corresponding smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information.

Among them, for the sake of clarity, the specific implementation of step S101 will be described in detail later.

S102: Determine the sound type of the voice information to be recognized based on the acquired broadcast status information.

In this way, after acquiring the aforementioned voice information to be recognized and the broadcast status information corresponding to each voice segment included in the voice information to be recognized, the electronic device can determine the sound type of the voice information to be recognized based on the acquired broadcast status information.

Wherein, the electronic device can perform the above step S102 in a variety of ways, which is not specifically limited in the embodiment of the present invention. For clarity of writing, the specific implementation manner of the above step S102 will be described with an example in the following.

It can be seen from the above that in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment included in the voice information to be recognized can be used to recognize the sound type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.

Optionally, in a specific implementation manner, as shown in FIG. 2, when the electronic device is a smart device, the above step S101 may include the following steps:

S201: Perform voice activity detection on the collected sound signal;

S202: When a voice start signal is detected, the voice signal collected from the target time is divided according to a preset division rule to obtain multiple voice segments until the voice termination signal is detected;

Among them, the target moment is: the moment when the voice start signal is collected;

S203: When collecting each voice segment, detect whether the smart device is performing voice broadcast, and determine the broadcast status information of the voice segment according to the detection result;

S204: Determine the to-be-recognized voice information based on the multiple voice segments obtained by the division.

In this specific implementation, the broadcast status information corresponding to each voice segment is: the broadcast status information of the smart device that is read when the voice segment is collected.

After the smart device is started, it can collect sound signals in the environment in real time. Wherein, the sound signal may include the voice information sent by the user, may also include the voice information sent by the smart device itself, and may also include the sound signals of various noises as background sounds of the environment.

In this way, after collecting the sound signal, the smart device can perform voice activity detection on the collected sound signal to detect the sound signal that can be used as the voice information to be recognized among the collected sound signals.

Specifically, every time a sound signal is received, the smart device can detect whether the sound signal can be used as a voice start signal. Furthermore, when a sound signal is detected as a voice start signal, the smart device can determine the voice start signal, and the sound signal collected after the time when the voice start signal is collected can be used as the voice information to be recognized Voice information included in. In addition, the voice start signal can be used as the start information of the voice information to be recognized.

Further, the smart device can also perform one-by-one detection on the sound signals collected after the moment when the voice start signal is collected, to determine whether the sound signal can be used as a voice termination signal. Furthermore, when it is detected that a voice signal is a voice termination signal, it can be determined that the voice termination signal is termination information in the voice information to be recognized.

In this way, the detected voice start signal, voice termination signal, and the sound signal located between the voice start signal and the voice termination signal constitute the voice information to be recognized. In addition, the voice start signal may be used as the start information of the voice information to be recognized, and the voice termination signal is the termination information in the voice information to be recognized.

In addition, because the sound signal is streamed, the smart device can continuously collect the sound in the environment and generate the corresponding sound signal in turn.

Based on this, after detecting the voice initiation signal, the smart device can divide the collected voice signal into segments according to the preset division rules, starting from the target moment of collecting the voice initiation signal, to obtain multiple voice segments in turn Until the voice termination signal is detected.

Among them, the division of speech segments is carried out during the collection process of the speech information to be recognized. Specifically, after detecting the voice start signal, the smart device continues to collect the voice signal. When a certain first moment is collected, the smart device determines that the sound signal collected from the target moment to this moment meets the preset division rules, and the collected sound signals from the target moment to this moment can be collected. The sound signal is divided into a speech segment. Then, continue to collect the sound signal. When another second moment is collected, the smart device determines that the sound signal collected from the first moment to the second moment again meets the preset division rule. The acoustic signal collected between the first moment and the second moment is divided into the next voice segment. And so on, until the voice termination signal is detected.

Obviously, the detected voice termination signal is included in the determined last sound segment, and the sound signal included in the last sound segment may not satisfy the preset division rule.

The preset division rule may be: the time for collecting the sound signal satisfies a certain preset value; or: the collected sound signal corresponds to a syllable, which is not described in detail in the embodiment of the present invention.

Optionally, the voice activity detection may be VAD (Voice Activity Detection, voice endpoint detection). Specifically: After collecting the sound signal of the environment in which it is located, the smart device can use the VAD to detect the voice start endpoint and the voice termination endpoint in the voice signal. Among them, the voice initiation endpoint is the voice initiation signal of the voice information to be recognized, and the voice termination endpoint is the voice termination signal of the voice information to be recognized. Among them, after detecting the voice initiation endpoint, the smart device can divide the collected sound signal from the detection of the voice initiation endpoint into each voice segment according to the preset division rule, until the voice termination endpoint is detected , Divide the voice termination endpoint into the last voice segment contained in the voice information to be recognized.

In this way, after each voice segment is obtained, the smart device can determine the voice information to be recognized based on the divided voice segments.

Wherein, since the first voice signal in the first voice segment obtained by dividing is the start information of the voice information to be recognized, the last voice signal in the last voice segment obtained by dividing is the termination information of the voice information to be recognized , The sound signals in the speech segments can be arranged in sequence according to the order of division, and the sound signal combination formed by the arrangement is the speech information to be recognized.

For example, suppose that the preset division rule is: the duration of collecting the sound signal reaches 0.1 second, and at the first second of the collection, the voice start endpoint is detected, and it is determined that the currently collected signal is the voice start signal. Then when the first 1.1 second is collected, the sound signal collected between the first second and the first 1.1 second can be divided into the first voice segment; then, when the first 1.2 second is collected, the sound signal can be divided into the first voice segment. The sound signal collected between 1.1 seconds and 1.2 seconds is divided into the second speech segment; and so on, until the sound signal collected in the 1.75 second is detected as a voice termination endpoint, the 1.75 time is determined The clipped sound signal is the voice termination endpoint, so the sound signal collected between 1.7 seconds and 1.75 seconds is divided into the last voice segment. In this way, 8 speech fragments can be obtained, and the acquisition time of the eighth, that is, the last speech fragment, is 0.05 seconds, which may not meet the preset division rule.

In this way, the voice signal combination formed by the voice signals collected from the first second to the 1.75 second second is the voice information to be recognized.

Moreover, in this specific implementation, when collecting a voice segment, the smart device will then detect whether it is performing voice broadcast during the process of collecting each sound signal in the voice segment, so that it can be based on the detection result. To determine the broadcast status information corresponding to the voice segment.

Among them, when the smart device is collecting a certain voice segment, it is not performing voice broadcast, then the broadcast status information corresponding to the voice segment can be called the first type of information; correspondingly, when the electronic device is collecting a certain voice segment, it is When voice broadcast is performed, the broadcast state information corresponding to the voice segment can be referred to as the second type of information.

Optionally, the smart device can record each moment through a state file, whether the smart device performs voice broadcast, that is, record the broadcast status information of the smart device corresponding to each moment. In this way, when each voice segment is divided, the smart device can determine the time when the voice segment is collected, so that the broadcast status information of the smart device at that time can be directly read from the status file, and then the read broadcast status information It is the broadcast status information of the voice segment.

Optionally, the broadcast status information may be TTS (Text To Speech) status information. Specifically, in one case, when the smart device broadcasts, the smart device converts the text information to be broadcast into voice information through an offline model, and then broadcasts the voice information; in another case, the server The text information to be broadcast is converted into voice information through the cloud model, and the converted voice information is fed back to the smart device. In this way, the smart device can broadcast the received voice information. Among them, the conversion of the text information to be broadcast into voice information is TTS. Obviously, this process can be processed through an offline model in a smart device, or online through a cloud model on the server side.

Among them, when the smart device is collecting a certain voice segment without voice broadcast, the TTS state information corresponding to the voice segment can be recorded as: TTS idle state, and the TTS idle state can be defined as 1, which is the first type of information Defined as 1. Correspondingly, when the smart device is collecting a certain voice segment and is performing voice broadcast, the TTS status information corresponding to the voice segment can be recorded as: TTS broadcast status, and the TTS broadcast status can be defined as 0, That is, the second type of information is defined as 0.

Further, in the specific implementation shown in Figure 2 above, when the smart device collects various sound signals in the environment in real time, in order to avoid the noise in the collected environmental background sound from affecting the smart device’s For the detection of the voice information to be recognized in the voice signal, after the voice signal is collected, the collected voice signal may be preprocessed to reduce the collected noise and enhance the voice signal that can be used as the voice information to be detected.

Based on this, optionally, in another specific implementation manner, as shown in FIG. 3, the above step S101 may further include the following steps:

S200: Perform signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal;

Correspondingly, the above step S201 may include the following steps:

S201A: Perform voice activity detection on the sound signal after signal preprocessing.

When the sound signal is collected, the smart device can obtain the sound wave shape of the sound signal, so that the smart device can perform signal preprocessing on the sound signal according to the sound wave shape of the sound signal.

Specifically, the sound signal whose sound wave shape matches the sound wave shape of the noise is attenuated, and the sound signal whose sound wave shape matches the sound wave shape of the sound signal that can be used as the voice information to be recognized is enhanced.

Correspondingly, in this specific implementation manner, the above step S201 is to perform voice activity detection on the collected voice signal, that is, perform voice activity detection on the voice signal after signal preprocessing.

Optionally, the smart device can pre-collect the sound wave shapes of various noises and various sound wave shapes that can be used as the sound signal of the voice information to be detected, so that these sound wave shapes and the labels corresponding to each sound wave shape can be used, Carry out model training to obtain the acoustic wave detection model. Wherein, the label corresponding to each sound wave shape is: a label used to characterize that the sound wave shape is a sound wave shape of noise or that can be used as a sound wave shape of a sound signal of the voice information to be detected. In addition, the sound signal that can be used as the voice information to be detected can be the voice signal sent by the user or the voice signal broadcast by the smart device. That is, the sound type of the voice signal that can be used as the voice information to be detected can be human voice or machine voice. .

In this way, by learning a large number of image characteristics of the sound wave shape, the sound wave detection model can establish the corresponding relationship between the image characteristics of the sound wave shape and the label. Therefore, when a sound signal is collected, the sound wave detection model can be used to detect the collected sound signal to determine the label of the sound signal, thereby reducing the sound signal whose label is noise, and enhancing the label as the to-be-detected sound signal. The sound signal of the voice message.

Corresponding to the situation that the above-mentioned electronic device is a smart device, optionally, in another specific implementation manner, when the electronic device is a server, the above-mentioned step S101 may include the following steps:

Receive the to-be-recognized voice information sent by the smart device and the broadcast status information corresponding to each voice segment contained in the to-be-recognized voice information.

Obviously, in this specific implementation, the sound type determination is done online. The smart device collects various sound signals in the environment, obtains the voice information to be recognized from the collected voice signals, and determines the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, thereby, the voice to be recognized The information and each broadcast status information are sent to the server, so that the server executes a voice processing method provided in an embodiment of the present invention to determine the sound type of the voice information to be recognized.

Among them, optionally, in this specific implementation, the smart device can determine the voice information to be recognized and the broadcast status corresponding to each voice segment contained in the voice information to be recognized through the solution provided in the embodiment shown in Figure 2 or Figure 3 above. And send the determined voice information to be recognized and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized to the server.

Based on this, when the smart device sends the voice information to be recognized to the server, the specific information content sent can be: each divided voice segment and the broadcast status information corresponding to each voice segment obtained, so that the server can simultaneously The received voice information to be recognized includes each voice segment and the broadcast state information corresponding to each obtained voice segment.

Furthermore, since the voice signal combination formed by sequentially arranging the voice signals in each voice segment in the order of division is the voice information to be recognized, the server can obtain the voice information to be recognized after obtaining each voice segment in the voice information to be recognized in sequence. Obtain the voice information to be recognized. In other words, the entirety of each voice segment received by the server is the voice information to be recognized.

Based on any of the foregoing embodiments, optionally, in a specific implementation manner, the foregoing step S102 may include the following steps:

Determine whether the broadcast status information corresponding to the first voice segment in each voice segment indicates that the smart device did not perform voice broadcast when the voice segment was collected; if it is, it is determined that the voice type of the voice information to be recognized is human voice.

In this specific implementation, the electronic device can obtain the broadcast status information corresponding to each voice segment contained in the voice information to be recognized, so that the electronic device can obtain the broadcast status information corresponding to the first voice segment in each voice segment. Furthermore, the electronic device can determine whether the broadcast status information indicates that the smart device did not perform voice broadcast when the voice segment was collected.

Among them, if the judgment result is yes, that is, when the first voice segment contained in the voice information to be recognized is collected, the smart device does not perform voice broadcast, thus, it can be explained that the voice information to be recognized is sent by the user. Therefore, the electronic The device can determine that the voice type of the voice information to be recognized is human voice.

Optionally, in another specific implementation manner, as shown in FIG. 4, step S102 may include the following steps:

S401: Determine the first quantity of the first type of information from the acquired broadcast status information;

Among them, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;

After obtaining the voice information to be recognized and the broadcast state information corresponding to each voice segment included in the voice information to be recognized, the electronic device can determine the first quantity of the first type of information from each broadcast state information.

Among them, because the first type of information represents that the smart device did not perform voice broadcast when the corresponding voice segment was collected, the determined first number can represent that the type of voice information in each voice segment included in the voice information to be recognized is human voice. The number of speech fragments.

S402: Determine the proportion information of the first type of information based on the first quantity of the first type of information;

After determining the first quantity of the first type of information, the electronic device can determine the proportion information of the first type of information based on the first quantity of the first type of information.

Optionally, in a specific implementation manner, as shown in FIG. 5, step S402 may include the following steps:

S402A: Calculate the first ratio of the first quantity to the total quantity of the acquired broadcast status information, and use the first ratio as the proportion information of the first type of information.

When the broadcast status information of a voice segment is the above-mentioned first type of information, the smart device did not perform voice broadcast when collecting the voice segment, and since the voice segment can be used as the segment of the voice information to be recognized, it can be determined The voice segment is the voice information sent by the user, and it can be determined that the voice type of the voice segment is human voice.

Correspondingly, when the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected, then the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device. The above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices.

Based on this, the first ratio of the first quantity to the total quantity of the acquired broadcast status information can be calculated, and the first ratio can be used as the proportion information of the first type of information. Among them, in this specific implementation, the above-mentioned calculated proportion information of the first type of information can be understood as: the proportion of speech fragments whose sound type is human voice among the speech fragments contained in the speech information to be recognized. Obviously , The higher the ratio, the greater the possibility that the voice type of the voice information to be recognized is human voice.

Furthermore, when the number of the first type of information in the acquired broadcast status information is 0, then the first ratio is 0, indicating that the sound type of the voice information to be recognized is more likely to be machine sound;

Correspondingly, when the amount of the second type of information in the acquired broadcast status information is 0, then the first ratio is 1, indicating that the voice type of the voice information to be recognized is more likely to be human voice.

Optionally, when the broadcast status information is TTS status information, and the TTS play status is defined as 0, and the TTS idle status is defined as 1, then the first ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the total number of acquired TTS status information.

For example, if the total number of acquired TTS status information is 10, and the number of TTS status information being 1 is 9, then the first ratio can be calculated to be 0.9.

Optionally, in another specific implementation manner, as shown in FIG. 6, step S402 may include the following steps:

S402B: Determine the second quantity of the second type of information from the acquired broadcast status information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information;

Among them, the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.

After determining the first quantity of the first type of information, the electronic device may further determine the second quantity of the second type of information from each broadcast status information. Therefore, the electronic device can calculate the determined second ratio of the first quantity to the second quantity, and use the second ratio as the proportion information of the first type of information.

Correspondingly, when the broadcast status information of a voice segment is the second type of information that indicates that the smart device is performing voice broadcast when the voice clip is being collected, then the smart device is performing voice broadcast when the voice clip is being collected, and because The voice segment can be used as a segment of the voice information to be recognized. Therefore, it can be determined that the voice information broadcast by the smart device exists in the voice information of the voice segment, and it can be determined that the voice segment is the voice information broadcast by the smart device, or, at the same time Including the voice information sent by the user and the voice information broadcast by the smart device. The above two situations may lead to the wrong behavior of "self-questioning and self-answering" in smart devices. In this way, it can be determined that the sound type of the speech segment is machine sound.

Based on this, the second ratio of the first quantity to the second quantity can be calculated, and the second ratio can be used as the proportion information of the first type of information. Among them, in this specific implementation, the above-mentioned calculated proportion information of the first type of information can be understood as: among the speech fragments contained in the speech information to be recognized, the voice fragments whose sound type is human voice and the sound type is machine sound Obviously, the higher the ratio, the greater the possibility that the voice type of the voice information to be recognized is human voice.

Furthermore, when the amount of the first type of information in the acquired broadcast status information is 0, then the second ratio is 0, which means that the sound type of the voice information to be recognized is more likely to be machine sound;

Correspondingly, when the amount of the second type of information in the acquired broadcast status information is 0, it can directly indicate that the voice type of the voice information to be recognized is more likely to be human voice.

Optionally, when the broadcast status information is TTS status information, and the TTS play status is defined as 0, and the TTS idle status is defined as 1, then the second ratio calculated above is the value in the acquired TTS status information It is the ratio of the number of 1 to the number of 0.

For example, if the total number of acquired TTS status information is 10, where the number of TTS status information is 1 is 7, and the number of 0 is 3, the above second ratio can be calculated to be 7/3.

S403: Determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.

After determining the proportion information of the first type of information, the electronic device can determine the sound type of the voice information to be recognized according to the relationship between the proportion information and the set threshold.

Optionally, in a specific implementation manner, the foregoing step S403 may include the following steps:

If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voiceprint model for the voice information to be recognized, and the voice information to be recognized is determined to be a human voice; or,

If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice print model to be recognized, the voice information to be recognized is determined to be a machine sound.

According to the above description of the specific implementations shown in FIG. 5 and FIG. 6, the greater the proportion information of the first type of information determined, the greater the possibility that the voice type of the voice information to be recognized is a human voice.

Based on this, in this specific implementation, if the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.

Correspondingly, when the proportion information is not greater than the set threshold, it means that the voice information to be recognized may be machine sound. In order to further accurately determine the sound type of the voice information to be recognized, the electronic device can determine the voiceprint model to treat The detection result of the detection by recognizing the voice information, so that when the detection result is a human voice, it can be determined that the voice information to be recognized is a human voice.

Further, when the proportion information is not greater than the set threshold, and the detection result of the voiceprint model detecting the voice information to be recognized is a machine voice, it can be determined that the voice information to be recognized is a machine voice.

Among them, it should be noted that, for the two calculation methods of proportion information provided in steps S402A and S402B in the specific implementations shown in FIG. 5 and FIG. 6, the predetermined thresholds set above may be the same or different. .

Wherein, the electronic device may use a preset voiceprint model to detect the voice information to be recognized after performing step S101 and receive the voice information to be recognized, so as to obtain the detection result. Therefore, in this specific implementation, it can be directly Use the obtained detection result; it is also possible to perform the above step S403, when it is determined that the proportion information is not greater than the set threshold, then use the preset voiceprint model to detect the voice information to be recognized to obtain the detection result, thereby , Use the test result.

Optionally, in an embodiment, it is possible to first determine whether the proportion information is greater than a set threshold, and then, when it is determined that the proportion information is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice.

Furthermore, when it is determined that the proportion information is not greater than the set threshold, the voiceprint model can obtain the detection result of the voice information to be recognized. When the detection result is a human voice, it can be determined that the voice information to be recognized is a human voice. When the detection result is machine sound, it can be determined that the voice information to be recognized is machine sound.

Optionally, in another embodiment, the voiceprint model may first obtain the detection result of the voice information to be recognized, and when the detection result is a human voice, it may be determined that the voice information to be recognized is a human voice.

Correspondingly, when the detection result is machine sound, it can be judged whether the calculated proportion information is greater than the set threshold. If it is greater, it can be determined that the voice information to be recognized is a human voice; if it is not greater than, it can be determined to be recognized. The voice information is machine sound.

Optionally, in a specific implementation manner, the embodiment of the present invention may further include the following steps:

In this specific implementation, when it is determined that the voice information to be recognized is machine sound, the electronic device can feed back to the smart device that collects the voice information to be recognized prompt information for prompting that the voice information to be recognized is machine sound. In this way, the smart device will not respond to the to-be-recognized voice information, thereby avoiding "self-questioning and self-answering" behaviors. Wherein, the prompt information may be a preset "error code".

Moreover, when it is determined that the voice information to be recognized is a machine sound, the electronic device may not perform semantic recognition on the text recognition result of the voice information to be recognized.

Further, optionally, the electronic device may not perform voice recognition on the acquired voice information to be recognized, that is, the electronic device may not obtain a text recognition result corresponding to the voice information to be recognized.

Optionally, in a specific implementation manner, as shown in FIG. 7, the embodiment of the present invention may further include the following steps:

S103: Obtain a text recognition result corresponding to the voice information to be recognized;

S104: If it is determined that the voice information to be recognized is a human voice, perform semantic recognition based on the text recognition result, and determine the response information of the voice information to be recognized.

After obtaining the voice information to be recognized, the electronic device can subsequently obtain the text recognition result corresponding to the voice information to be recognized.

Further, after determining that the voice information to be recognized is a human voice, the electronic device can determine that the voice information to be recognized is voice information sent by the user, so that the electronic device needs to respond to the voice information sent by the user.

Based on this, after determining that the voice information to be recognized is a human voice, the electronic device can perform semantic recognition on the obtained text recognition result, thereby determining the response information of the voice information to be recognized.

Optionally, the electronic device can input the text recognition result to the semantic model, so that the semantic model can analyze the semantics of the text recognition result, and then determine the response result corresponding to the semantics as the voice information to be recognized Response information.

Among them, the semantic model is used to recognize the semantics of the text recognition information, obtain the user needs corresponding to the voice information to be recognized, and make actions corresponding to the user needs according to the user needs, thereby obtaining the semantics corresponding to the The response result of is used as the response information of the voice information to be recognized. For example, obtain the result corresponding to the user demand from a designated website or storage space, or execute an action corresponding to the user demand, etc.

Exemplarily, the text recognition information is: how is the weather today. Furthermore, the semantic model can recognize the keywords "today" and "weather" in the text recognition information, and then know the current geographic location through the positioning system, so that the semantic model can determine the user's needs as: the current geographic location The location is today’s weather conditions, and then, the semantic model can automatically connect to the website for querying the weather, and obtain the current weather conditions in the current geographic location from the website, for example, the weather in Beijing is 23 degrees Celsius, and then , The acquired weather condition can be determined as the response result corresponding to the semantics as the response information of the voice information to be recognized.

Exemplarily, the text recognition information is: Where is Starbucks. Furthermore, the semantic model can recognize the keywords "Starbucks" and "Where" in the text recognition information. Furthermore, the semantic model can determine the user's needs as: the location of Starbucks. Furthermore, the semantic model can be preset from the preset storage space. In the stored information, read the location information of Starbucks, for example, the northeast corner of the third floor of this commercial building, and then determine the location information obtained as the response result corresponding to the semantics, as the response information of the voice information to be recognized .

Exemplarily, the text recognition information is: two meters ahead. Furthermore, the semantic model can recognize the keywords "forward" and "two meters" in the text recognition information. Furthermore, the semantic model can determine the user's needs as follows: I want to move forward two meters, and then the semantic model can be generated The corresponding control instruction, thus, controls itself to move forward a distance of two meters. Obviously, the action of the smart device moving forward is the response result corresponding to the semantics.

Further, optionally, the voice information to be recognized acquired by the electronic device includes multiple voice fragments. Therefore, in order to ensure the accuracy of the obtained text recognition result, the manner of obtaining the text recognition result corresponding to the voice information to be recognized is It can include the following steps:

When the first speech segment is received, perform speech recognition on the first speech segment to obtain a temporary text result; when receiving a non-first speech segment, based on the temporary text result that has been obtained, the received Perform voice recognition on all voice fragments to obtain a new temporary text result. Until the last voice fragment is received, the text recognition result corresponding to the voice information to be recognized is obtained.

Specifically, when the first speech fragment is received, the first speech fragment is recognized by speech, and the temporary text result of the first speech fragment is obtained; furthermore, when the second speech fragment is received, it can be based on The temporary text result of the first speech segment, the speech information composed of the first and second speech segments are recognized, and the temporary text results of the first two speech segments are obtained; then, when the third speech segment is received, Based on the temporary text results of the first two speech fragments, the speech information composed of the first to third speech fragments can be recognized, and the temporary text results of the first three speech fragments can be obtained; and so on, until the last speech is received When segmenting, based on the temporary text results from the first voice segment to the penultimate voice segment, the voice information composed of the first to last voice segments can be recognized, and the temporary text results of the first to last voice segments can be obtained Obviously, the result obtained at this time is the text recognition result corresponding to the voice information to be recognized.

In this specific implementation, in the speech recognition process of the speech information to be recognized, the influence of the relationship between the contexts in the speech information to be recognized on the text recognition result is fully considered, so that the accuracy of the obtained text recognition result can be improved. rate.

Optionally, the voice recognition model in the electronic device may be used to perform voice recognition on the voice information to be recognized. Use voice samples to train the voice recognition model. Each voice sample includes voice information and text information corresponding to the voice information. Furthermore, through the study of a large number of voice samples, the voice recognition model can establish the correspondence between voice information and text information. relationship. In this way, after the trained voice recognition model receives the to-be-recognized voice information, it can determine the text recognition result corresponding to the to-be-recognized voice information according to the established correspondence. Among them, the speech recognition model can be called a decoder.

Further, optionally, each time a temporary recognition result of the at least one speech segment is obtained, the electronic device may output the temporary recognition result to the user.

Wherein, when the electronic device is a server, the electronic device sends the temporary recognition result to the smart device sending the voice information to be recognized, so that the smart device outputs the temporary recognition result through the display screen;

When the electronic device is a smart device, the electronic device can directly output the temporary recognition result through the display screen.

Correspondingly, optionally, when the text recognition result of the voice information to be recognized is obtained, the electronic device may also output the text recognition result to the user.

Wherein, when the electronic device is a server, the electronic device sends the text recognition result to the smart device sending the voice information to be recognized, so that the smart device outputs the text recognition result through the display screen;

When the electronic device is a smart device, the electronic device can directly output the text recognition result through the display screen.

Further, optionally, after obtaining the response information of the voice information to be recognized, the electronic device may broadcast the response information to the user.

When the electronic device is a server, the electronic device sends the response information to the smart device that sent the voice information to be recognized, so that the smart device broadcasts the response information to the user;

When the electronic device is a smart device, the electronic device can directly broadcast the response information.

In order to better understand a voice processing method provided by an embodiment of the present invention, the voice processing method will be described below through a specific embodiment.

Wherein, in this specific embodiment, the above-mentioned electronic device is a server. specific:

The smart device collects each sound signal in the environment in real time, and performs signal preprocessing on the sound signal according to the sound wave shape of the collected sound signal.

Furthermore, the smart device performs voice activity detection on the sound signal after signal preprocessing. Specifically: VAD can be used to detect the voice start endpoint and voice termination endpoint in the voice signal preprocessed by the signal, and after the voice start endpoint is detected, the collected voice signals are divided in sequence according to the preset division rule It is a voice segment, until the voice termination endpoint is detected.

In addition, in the above process, each time a voice segment is divided, the TTS status information of the smart device is read, and each voice segment and the TTS status information corresponding to the voice segment are sent to the server.

The server receives each voice segment sent by the smart device and the TTS state information corresponding to the voice segment, and sends each voice segment to the decoder and voiceprint model.

Wherein, the decoder performs voice recognition on all the currently received voice segments to obtain a temporary recognition result, and sends the temporary recognition result to the smart device, so that the smart device outputs the temporary recognition result through the display screen.

Correspondingly, when the text recognition result of the voice information to be recognized is obtained, the text recognition result is sent to the smart device, so that the smart device outputs the text recognition result through the display screen.

In this way, when the complete voice information to be recognized is received, the text recognition result corresponding to the voice information to be recognized can be obtained, and the smart device can output the text recognition result corresponding to the voice information to be recognized through the display screen.

In addition, the voiceprint model performs voiceprint detection on all voice segments currently received, and records the detection results. Accordingly, when all voice segments constituting the voice information to be recognized are received, voiceprint is performed on the voice information to be recognized. Test and record the test results.

After the server receives the TTS status information corresponding to each voice segment among all the voice segments that constitute the voice information to be recognized, it calculates the number of 1s in the received TTS status information, and then calculates the number of 1s and the received TTS status information. The ratio of the number of TTS status information, and determine the relationship between the ratio and the set threshold.

Furthermore, when it is determined that the ratio is greater than the set threshold, it can be determined that the voice information to be recognized is a human voice. When it is determined that the ratio is not greater than the set threshold, based on the detection result of the voiceprint model that the voice information to be recognized is determined to be a human voice, the voice information to be recognized is determined to be a human voice. When the proportion information is not greater than the set threshold, And when it is determined that the voice information to be recognized is a machine sound based on the detection result of the voice information to be recognized based on the voiceprint model, it is determined that the voice information to be recognized is a machine sound.

Further, after the server determines that the voice information to be recognized is a human voice, it can determine the response information of the voice information to be recognized through the semantic model, and send the response information to the smart device.

After receiving the response information, the smart device can output the response information.

Corresponding to the voice processing method provided in the foregoing embodiment of the present invention, the embodiment of the present invention also provides a voice processing device.

FIG. 8 is a schematic structural diagram of a voice processing device provided by an embodiment of the present invention. As shown in Figure 8, the voice processing device includes the following modules:

The information acquisition module 810 is configured to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein the broadcast status information corresponding to each voice segment indicates that the voice is collected Whether the smart device is performing voice broadcast during the segment;

The type determining module 820 is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.

The above is possible. In the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.

Optionally, in a specific implementation manner, the type determining module 820 is specifically configured to:

Optionally, in a specific implementation manner, the device further includes:

Corresponding to a voice processing method provided by an embodiment of the present invention, an embodiment of the present invention also provides an electronic device, as shown in FIG. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where: The processor 901, the communication interface 902, and the memory 903 communicate with each other through the communication bus 904,

The memory 903 is used to store computer programs;

The processor 901 is configured to implement a voice processing method provided in the foregoing embodiment of the present invention when executing a program stored in the memory 903.

Specifically, the aforementioned voice processing method includes:

It should be noted that other implementations of a voice processing method implemented by the processor 901 executing the program stored in the memory 903 are the same as the voice processing method embodiment provided in the foregoing method embodiment section, and will not be repeated here. .

It can be seen from the above that in the solution provided by the embodiment of the present invention, the voice broadcast status information of each voice segment in the voice information to be recognized can be used to recognize the voice type of the voice to be recognized. Among them, because the voice broadcast status information can reflect whether there is a machine sound generated by the voice broadcast of the smart device in the received voice information to be recognized, the accuracy of the recognition of the voice type of the voice information can be improved.

The communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. The communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the above-mentioned electronic device and other devices.

The memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage. Optionally, the memory may also be at least one storage device located far away from the foregoing processor.

The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), a dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

Corresponding to the voice processing method provided by the foregoing embodiment of the present invention, the embodiment of the present invention also provides a computer-readable storage medium. When the computer program is executed by a processor, any voice provided by the foregoing embodiment of the present invention is implemented. Approach.

Corresponding to the voice processing method provided by the foregoing embodiment of the present invention, the embodiment of the present invention also provides a computer program. The computer program product includes a computer program stored on a computer-readable storage medium, and the computer program includes Program instructions, which, when executed by a processor, implement any of the voice processing methods provided in the foregoing embodiments of the present invention.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or equipment that includes the element.

The various embodiments in this specification are described in a related manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device embodiment, the electronic device embodiment, and the computer-readable storage medium embodiment, since they are basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims

A voice processing method, characterized in that the method includes:

Acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment represents whether the smart device is collecting the voice segment Voice broadcast is in progress;

Based on the acquired broadcast status information, the sound type of the voice information to be recognized is determined.
The method according to claim 1, wherein the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information comprises:

Judging whether the broadcast status information corresponding to the first voice segment in each of the voice segments indicates that the smart device did not perform voice broadcast when the voice segment was collected;

If yes, it is determined that the voice type of the voice information to be recognized is human voice.
The method according to claim 1, wherein the step of determining the sound type of the voice information to be recognized based on the acquired broadcast status information comprises:

From the acquired broadcast status information, determine the first quantity of the first type of information; wherein, the first type of information indicates that the smart device did not perform voice broadcast when the corresponding voice segment was collected;

Determine the proportion information of the first type of information based on the first quantity of the first type of information;

The sound type of the voice information to be recognized is determined according to the relationship between the proportion information and the set threshold.
The method according to claim 3, wherein the step of determining the proportion information of the first type of information based on the first quantity of the first type of information comprises:

Calculate the first ratio of the first number to the total number of acquired broadcast status information, and use the first ratio as the proportion information of the first type of information; or,

From the acquired broadcast status information, determine the second quantity of the second type of information, calculate the second ratio of the first quantity to the second quantity, and use the second ratio as the value of the first type of information Proportion information;

Wherein, the second type of information indicates that the smart device is performing voice broadcast when the corresponding voice segment is collected.
The method according to claim 3 or 4, wherein the step of determining the sound type of the voice information to be recognized according to the relationship between the proportion information and a set threshold value comprises:

If the proportion information is greater than the set threshold, it is determined that the voice information to be recognized is a human voice; or,

If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a human voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a human voice; or ,

If the proportion information is not greater than the set threshold, and it is determined that the voice information to be recognized is a machine voice based on the detection result of the voice print model on the voice information to be recognized, it is determined that the voice information to be recognized is a machine voice .
The method according to any one of claims 1-5, wherein the method further comprises:

If it is determined that the voice information to be recognized is a machine sound, prompt information for prompting that the voice information to be recognized is a machine sound is fed back to the smart device.
The method according to any one of claims 1-5, wherein the method further comprises:

Obtaining a text recognition result corresponding to the voice information to be recognized;

If it is determined that the voice information to be recognized is a human voice, semantic recognition is performed based on the text recognition result, and the response information of the voice information to be recognized is determined.
A voice processing device, characterized in that the device includes:

The information acquisition module is used to acquire the voice information to be recognized collected by the smart device and the broadcast status information corresponding to each voice segment contained in the voice information to be recognized; wherein, the broadcast status information corresponding to each voice segment indicates that the voice segment is collected Whether the smart device is performing voice broadcast at the time;

The type determination module is configured to determine the sound type of the voice information to be recognized based on the acquired broadcast status information.
An electronic device characterized by comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;

Memory, used to store computer programs;

The processor is configured to implement the method steps of any one of claims 1-7 when executing the program stored in the memory.
A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method steps according to any one of claims 1-7 are realized.