CN114005436A

CN114005436A - Method, device and storage medium for determining voice endpoint

Info

Publication number: CN114005436A
Application number: CN202111436597.0A
Authority: CN
Inventors: 赵晴
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-02-01

Abstract

The present disclosure relates to a method, an apparatus, and a storage medium for determining a voice endpoint, where the method includes: receiving a voice segment, and performing framing operation on the voice segment to obtain a plurality of voice segment frames; respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames; inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result; and determining whether a voice endpoint is detected or not through the voice endpoint detection algorithm according to the judgment score and the starting point detection result. By adopting the technical means, the problem of low accuracy rate of detecting the voice endpoint in a complex noise environment in the prior art is solved.

Description

Method, device and storage medium for determining voice endpoint

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for determining a speech endpoint, and a storage medium.

Background

With the development of scientific technology, voice recognition is widely applied to life, and scenes such as a voice assistant for waking up a client, a voice-controlled intelligent robot and the like can involve voice recognition. In speech recognition, detection or determination of speech endpoints is particularly important. The voice endpoint comprises a voice starting point and a voice ending point, the voice endpoint is detected by finding the voice starting point from a voice signal containing silence, noise and the like, voice recognition is started, and when the voice ending point is detected, the voice recognition is ended, so that multiple rounds of voice interaction are realized. In the conventional technology, the detection of the voice endpoint usually adopts a method based on signal processing statistical indexes to judge the voice endpoint, such as energy, zero-crossing rate, etc. Such methods are simple, but are not robust enough, especially in complex acoustic scenarios. In addition, the traditional technology also adopts a machine model to judge the voice endpoint, the robustness of the judgment of the method is relatively good, but the accuracy of judging the voice endpoint under the complex noise environment with music noise and the like is low.

In the course of implementing the disclosed concept, the inventors found that there are at least the following technical problems in the related art: the accuracy rate of judging the voice endpoint under the complex noise environment is low.

Disclosure of Invention

In order to solve the above technical problem or at least partially solve the above technical problem, embodiments of the present disclosure provide a method, an apparatus, and a storage medium for determining a voice endpoint, so as to at least solve the problem in the prior art that the accuracy of determining a voice endpoint in a complex noise environment is low

The purpose of the present disclosure is realized by the following technical scheme:

in a first aspect, an embodiment of the present disclosure provides a method for determining a voice endpoint, including: receiving a voice segment, and performing framing operation on the voice segment to obtain a plurality of voice segment frames; respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames; inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result; and determining whether a voice endpoint is detected or not through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

In an exemplary embodiment, the determining, by the voice endpoint detection algorithm, whether a voice endpoint is detected according to the judgment score and the starting point detection result includes: and under the condition that the starting point detection result is that the starting point of the voice segment is not detected, judging whether the voice segment is sent by a target object or not: in the case that the voice segment is sent by the target object, marking a voice starting point as a true value point, and receiving the next voice segment, wherein the voice starting point is marked as the true value point and is used for indicating that the starting point of the voice segment is detected, and the voice end point comprises the starting point of the voice segment; and in the case that the voice segment is not emitted by the target object, marking the voice starting point as a false value point, and receiving the next voice segment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice segment is not detected.

In an exemplary embodiment, the determining, by the voice endpoint detection algorithm, whether a voice endpoint is detected according to the judgment score and the starting point detection result includes: and under the condition that the starting point detection result is that the starting point of the voice segment is detected, judging whether the voice segment is sent by a target object or not: in the case that the voice segment is sent by the target object, adding one to the number of voiced voice segments, and receiving the next voice segment, the voice endpoint includes: a starting point of the voice segment and an ending point of the voice segment; and under the condition that the voice segment is not emitted by the target object, adding one to the number of mute voice segments, and under the condition that the number of mute voice segments is greater than a first preset threshold value and the number of voiced voice segments is greater than a second preset threshold value, determining that the end point of the voice segment is detected and not receiving the next voice segment any more.

In an exemplary embodiment, after adding one to the number of silent snippets in the case where the snippet is not uttered by the target object, the method further includes: under the condition that the number of the silent voice fragments is not larger than the first preset threshold value, marking a voice starting point as a false value point and receiving the next voice fragment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice fragment is not detected; or when the number of the mute speech segments is greater than the first preset threshold value but the number of the voiced speech segments is not greater than the second preset threshold value, marking the speech starting point as a false value point and receiving the next speech segment.

In an exemplary embodiment, the framing the speech segment to obtain a plurality of speech segment frames includes: receiving a framing instruction sent by a target object, and determining the frame length and frame shift corresponding to the framing operation according to the framing instruction; and performing the framing operation on the voice segment according to the frame length, the frame shift and the size of the voice segment to obtain a plurality of voice segment frames.

In an exemplary embodiment, before the framing the speech segment to obtain a plurality of speech segment frames, the method further includes: and continuously receiving voice packets until the size of the received voice packets meets the preset size, and combining the received voice packets into the voice segments.

In one exemplary embodiment, the neural network model includes: the device comprises a first preset number of convolution layers, a second preset number of full-connection layers and an output layer, wherein the output layer is composed of the full-connection layers and a normalization index layer.

In one exemplary embodiment, includes: acquiring environmental noise data and call data; performing phoneme alignment processing on the call data by using a voice recognition tool to obtain phoneme alignment data; performing the framing operation on the environmental noise data and the phoneme alignment data to obtain a plurality of training data segment frames; respectively carrying out the fast Fourier transform processing on the training data segment frames to obtain a plurality of training data frequency spectrums corresponding to the training data segment frames; labeling the plurality of training data frequency spectrums; and training the neural network model by using a plurality of training data spectrums after the labeling processing.

In a second aspect, an embodiment of the present disclosure provides an apparatus for determining a voice endpoint, including: the framing module is used for receiving the voice segments and performing framing operation on the voice segments to obtain a plurality of voice segment frames; the processing module is used for respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames; the model module is used for inputting the Fourier spectrums into a neural network model and outputting judgment scores and a starting point detection result; and the determining module is used for determining whether a voice endpoint is detected through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

In a third aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; and a processor for implementing the method for determining a voice endpoint or the method for processing an image as described above when executing the program stored in the memory.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium. The computer-readable storage medium stores thereon a computer program that, when executed by a processor, implements the method for determining a voice endpoint or the method for image processing as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages: receiving a voice segment, and performing framing operation on the voice segment to obtain a plurality of voice segment frames; respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames; inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result; and determining whether a voice endpoint is detected or not through the voice endpoint detection algorithm according to the judgment score and the starting point detection result. Because the voice segment is subjected to framing operation, fast fourier transform processing and input into the neural network model in sequence, and whether the voice endpoint is detected is determined through the voice endpoint detection algorithm according to the finally obtained judgment score and the initial point detection result, the technical means can solve the problem that the accuracy rate of detecting the voice endpoint in a complex noise environment is low in the prior art, and further improve the accuracy rate of detecting the voice endpoint.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a block diagram schematically illustrating a hardware structure of a computer terminal according to a method for determining a voice endpoint according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of determining a voice endpoint of an embodiment of the present disclosure;

FIG. 3 schematically illustrates an internal network diagram of a neural network model of an embodiment of the present disclosure;

fig. 4 schematically illustrates a flowchart of a method for determining a voice endpoint according to an embodiment of the present disclosure;

fig. 5 is a block diagram schematically illustrating a structure of a speech endpoint determination apparatus according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided by the embodiments of the present disclosure may be executed in a computer terminal or a similar computing device. Taking an example of the method running on a computer terminal, fig. 1 schematically shows a hardware structure block diagram of a computer terminal of a method for determining a voice endpoint according to an embodiment of the present disclosure. As shown in fig. 1, a computer terminal may include one or more processors 102 (only one is shown in fig. 1), wherein the processors 102 may include but are not limited to a processing device such as a Microprocessor (MPU) or a Programmable Logic Device (PLD) and a memory 104 for storing data, and optionally, the computer terminal may further include a transmission device 106 for communication function and an input/output device 108, it is understood by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not a limitation to the structure of the computer terminal, for example, the computer terminal may further include more or less components than those shown in fig. 1, or have equivalent functions or different configurations than those shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the voice endpoint determination method in the embodiment of the present disclosure, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In an embodiment of the present disclosure, a method for determining a voice endpoint is provided, and fig. 2 schematically illustrates a flowchart of the method for determining a voice endpoint according to the embodiment of the present disclosure, where as shown in fig. 2, the flowchart includes the following steps:

step S202, receiving a voice segment, and performing framing operation on the voice segment to obtain a plurality of voice segment frames;

step S204, respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames;

step S206, inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result;

and step S208, determining whether a voice endpoint is detected through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

The embodiment of the disclosure can be used for a conversation scene between the intelligent voice robot and the user, and at the moment, the target object is a person. The execution subject of the disclosed embodiment is an intelligent voice robot.

According to the method, a voice segment is received, and framing operation is carried out on the voice segment to obtain a plurality of voice segment frames; respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames; inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result; and determining whether a voice endpoint is detected or not through the voice endpoint detection algorithm according to the judgment score and the starting point detection result. Because the voice segment is subjected to framing operation, fast fourier transform processing and input into the neural network model in sequence, and whether the voice endpoint is detected is determined through the voice endpoint detection algorithm according to the finally obtained judgment score and the initial point detection result, the technical means can solve the problem that the accuracy rate of detecting the voice endpoint in a complex noise environment is low in the prior art, and further improve the accuracy rate of detecting the voice endpoint.

In step S208, determining whether a voice endpoint is detected by the voice endpoint detection algorithm according to the judgment score and the start point detection result, including: and under the condition that the starting point detection result is that the starting point of the voice segment is not detected, judging whether the voice segment is sent by a target object or not: in the case that the voice segment is sent by the target object, marking a voice starting point as a true value point, and receiving the next voice segment, wherein the voice starting point is marked as the true value point and is used for indicating that the starting point of the voice segment is detected, and the voice end point comprises the starting point of the voice segment; and in the case that the voice segment is not emitted by the target object, marking the voice starting point as a false value point, and receiving the next voice segment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice segment is not detected.

For example, 1 is used to represent true, 0 is used to represent false, the speech starting point is marked as a false value point, i.e., the label of the speech starting point is marked as 0, and the speech starting point is marked as a true value point, i.e., the label of the speech starting point is marked as 1.

The target object can be a person, an animal and other objects which can make sound, and if the target object is a person, the judgment on whether the voice segment is made by the target object is to judge whether the voice segment is made by the person. When a voice segment is received for the first time, the default of the detection result of the starting point is that the starting point of the voice segment is not detected, and under the condition that the voice segment is received for the first time, whether the voice segment is sent by a target object or not is judged. If the speech segment is the target object utterance, the speech start point is marked as a true value point, and if the speech segment is not the target object utterance, the speech start point is marked as a false value point. And when the voice segment is received for the first time, marking the voice starting point as a false value point, which indicates that the voice segment received for the first time is environmental noise and does not contain the sound emitted by the target object, and then receiving the voice segment for the second time, wherein the starting point detection result is default or the starting point of the voice segment is not detected. All sounds which are not emitted by the target object belong to the environmental noise.

And judging whether the target object is sent out or not according to the judgment score, and further marking the voice starting point as a true value point or a false value point. Since the voice start point marking as a true value point or marking as a false value point of the voice segment received last time is related to the start point detection result, it is said that the start point detection result is related to the determination score.

It can be understood that a speech segment inherits the state of the last speech segment in order. For example, when a voice segment is received for the first time, the voice starting point is marked as a false value point, and then the voice segment is received for the second time, because the voice starting point is marked as the false value point when the voice segment is received for the first time, the detection result of the voice segment starting point received for the second time is the starting point where the voice segment is not detected by default. The voice segment is received for the first time, the voice starting point is marked as a true value point, and then the voice segment is received for the second time.

Further, the states of the speech segments include: detecting a voice starting point, detecting a voice ending point, detecting the number of voiced voice segments and detecting the number of mute voice segments. And updating the state of the voice segment when the voice endpoint is determined every time the voice segment is received.

In step S208, determining whether a voice endpoint is detected by the voice endpoint detection algorithm according to the judgment score and the start point detection result, including: and under the condition that the starting point detection result is that the starting point of the voice segment is detected, judging whether the voice segment is sent by a target object or not: in the case that the voice segment is sent by the target object, adding one to the number of voiced voice segments, and receiving the next voice segment, the voice endpoint includes: a starting point of the voice segment and an ending point of the voice segment; and under the condition that the voice segment is not emitted by the target object, adding one to the number of mute voice segments, and under the condition that the number of mute voice segments is greater than a first preset threshold value and the number of voiced voice segments is greater than a second preset threshold value, determining that the end point of the voice segment is detected and not receiving the next voice segment any more.

The starting point detection result is the condition that the starting point of the voice segment is detected, and the voice segment is not necessarily received for the first time, because the starting point detection result of the voice segment received for the first time is the default starting point of the voice segment which is not detected. The starting point detection result is the starting point of the detected voice segment, and indicates that when the voice segment is received last time when the voice end point is determined at present, the voice starting point is marked as a true value point. And under the condition that the starting point detection result is that the starting point of the voice segment is detected, if the voice segment is judged to be sent by the target object, adding one to the number of the voiced voice segments, and receiving the next voice segment. And under the condition that the starting point detection result is that the starting point of the voice segment is detected, if the voice segment is judged not to be sent by the target object, adding one to the number of the mute voice segments. That is, the state of a speech segment is updated each time a speech segment is received and a speech endpoint is determined. And under the condition that the number of the mute speech segments is greater than a first preset threshold value and the number of the voiced speech segments is greater than a second preset threshold value, determining that the end point of the speech segment is detected and not receiving the next speech segment any more. And when the end point of the voice segment is detected, the completion of one voice communication in which the current voice segment is positioned is indicated. The first preset threshold and the first and second preset thresholds are specifically set according to specific situations.

In step S208, in a case where the voice clip is not uttered by the target object, after adding one to a mute voice clip number, the method further includes: under the condition that the number of the silent voice fragments is not larger than the first preset threshold value, marking a voice starting point as a false value point and receiving the next voice fragment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice fragment is not detected; or when the number of the mute speech segments is greater than the first preset threshold value but the number of the voiced speech segments is not greater than the second preset threshold value, marking the speech starting point as a false value point and receiving the next speech segment. And under the condition that the starting point detection result is that the starting point of the voice segment is detected, and the voice segment is not sent by the target object, if the number of the mute voice segments is not more than the first preset threshold value, or the number of the mute voice segments is more than the first preset threshold value, and the number of the voiced voice segments is not more than the second preset threshold value, marking the voice end point as a false value point, and receiving the next voice segment. Marking the voice starting point as a false value point, wherein the situation is that the model detects a voiced voice segment, then all the voiced voice segments are non-voiced segments, and the number of the voiced segments is less than a threshold value, so that the voiced voice segment is considered as a false speaking voice; this speech onset point is marked as a false value point, i.e. it is a false speech onset point. The voice robot is an intelligent conversation robot with multiple functions of automatic call dialing, multi-round voice interaction, intelligent intention judgment and the like. The method can be applied to various business scenes such as automatic telephone sales, product service promotion, information verification, voice notification, telephone collection promotion and the like.

According to the embodiment of the disclosure, through the technical means, a whole sentence of voice sent by the target object can be completely received, and the loss is avoided.

In step S202, performing framing operation on the speech segment to obtain a plurality of speech segment frames, including: receiving a framing instruction sent by a target object, and determining the frame length and frame shift corresponding to the framing operation according to the framing instruction; and performing the framing operation on the voice segment according to the frame length, the frame shift and the size of the voice segment to obtain a plurality of voice segment frames.

The framing operation of the embodiment of the present disclosure may be a method of windowing the speech segment, and it should be noted that the framing operation of the embodiment of the present disclosure may be any one of the methods of the speech framing operation in the prior art. It should be noted that, in addition to determining the frame length and the frame shift corresponding to the framing operation according to the framing instruction, the frame length and the frame shift corresponding to the framing operation may also be determined according to default settings in a system. The system is corresponding to the voice endpoint determination device. For example, the frame length is 25ms, the frame is shifted by 10ms, and the audio of 1 second can be divided into 97 frames.

Before step S202 is executed, that is, before the speech segment is framed to obtain a plurality of speech segment frames, the method further includes: and continuously receiving voice packets until the size of the received voice packets meets the preset size, and combining the received voice packets into the voice segments.

It should be noted that, in the embodiment of the present disclosure, the voice segment may be directly received, and the voice endpoint of the received voice segment is determined, or the voice packet may be continuously received, and when the size of the received multiple voice packets meets the preset size, the received multiple voice packets are merged into the voice segment, and then the voice endpoint of the merged voice segment is determined.

In an optional embodiment, the neural network model comprises: the device comprises a first preset number of convolution layers, a second preset number of full-connection layers and an output layer, wherein the output layer is composed of the full-connection layers and a normalization index layer.

It should be noted that the last convolutional layer of the first predetermined number of convolutional layers is connected to the first full link layer of the second predetermined number of full link layers, and the last full link layer of the second predetermined number of full link layers is connected to the output layer.

Optionally, the neural network model comprises: the multilayer structure comprises three convolution layers, a full connection layer and an output layer, wherein the output layer consists of the full connection layer and the normalization index layer.

In an alternative embodiment, ambient noise data and call data are obtained; performing phoneme alignment processing on the call data by using a voice recognition tool to obtain phoneme alignment data; performing the framing operation on the environmental noise data and the phoneme alignment data to obtain a plurality of training data segment frames; respectively carrying out the fast Fourier transform processing on the training data segment frames to obtain a plurality of training data frequency spectrums corresponding to the training data segment frames; labeling the plurality of training data frequency spectrums; and training the neural network model by using a plurality of training data spectrums after the labeling processing.

Labeling the training data frequency spectrums by labeling labels to the training data frequency spectrums, wherein the labels are judgment scores corresponding to the training data frequency spectrums. It should be noted that, because the starting point detection result is related to the judgment score, the multiple training data frequency spectrums are labeled, and actually, the judgment scores and the starting point detection results corresponding to the multiple training data frequency spectrums are labeled at the same time, that is, the labels labeled on the multiple training data frequency spectrums include the judgment scores and the starting point detection results corresponding to the multiple training data frequency spectrums.

The training method of the neural network model can be any one of the existing training methods in machine learning.

The ambient noise data may include noise such as car, wind, factory, and rain, the call data may be telephone channel data having a sampling rate of 8k, and the speech recognition tool may be Kaldi, and the alignment result of phoneme levels of the call data is generated by the Kaldi tool. For example: the frame length is 25ms, the frame shift is 10ms, and the audio of 1 second can be divided into 97 frames, that is, 97 results, such as "aabb mute cddde …". Then, the non-silent part is extracted as phoneme alignment data according to the alignment result.

Through the technical means, the size of the model is 126kb, and the number of the parameters of the model is 26000.

In order to better understand the technical solutions, the embodiments of the present disclosure also provide an alternative embodiment for explaining the technical solutions.

Fig. 3 schematically illustrates an internal network diagram of a neural network model according to an embodiment of the present disclosure, as shown in fig. 3:

a neural network model, comprising: the multilayer structure comprises three convolution layers, a full connection layer and an output layer, wherein the output layer consists of the full connection layer and the normalization index layer. The last convolutional layer of the first preset number of convolutional layers is connected with the first full connecting layer of the second preset number of full connecting layers, and the last full connecting layer of the second preset number of full connecting layers is connected with the output layer.

Fig. 4 schematically shows a flowchart of a method for determining a voice endpoint according to an embodiment of the present disclosure, as shown in fig. 4:

s402: receiving a voice segment;

s404: performing framing operation on the voice segments to obtain a plurality of voice segment frames;

s406: respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames;

s408: inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result;

s410: judging whether the starting point detection result is the starting point of the detected voice segment;

s412: the starting point detection result is the starting point of the voice segment which is not detected, and whether the voice segment is sent by a target object is judged;

s414: under the condition that the voice segment is sent by the target object, marking a voice starting point as a true value point and receiving the next voice segment;

s416: under the condition that the voice segment is not emitted by the target object, marking the voice starting point as a false value point, and receiving the next voice segment;

s418: the starting point detection result is the starting point of the detected voice segment, and whether the voice segment is sent by a target object is judged;

s420: adding one to the number of the voiced speech segments and receiving the next speech segment when the speech segment is sent by the target object;

s422: adding one to the number of silent voice fragments under the condition that the voice fragments are not emitted by the target object, determining that the end point of the voice fragments is detected and not receiving the next voice fragment under the condition that the number of silent voice fragments is greater than a first preset threshold and the number of voiced voice fragments is greater than a second preset threshold;

s424: and when the voice fragment is not sent by the target object, adding one to the number of mute voice fragments, wherein the number of mute voice fragments is not more than the first preset threshold value, or the number of mute voice fragments is more than the first preset threshold value, but the number of voiced voice fragments is not more than the second preset threshold value, marking the voice end point as a false value point, and receiving the next voice fragment.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present disclosure or portions contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a component server, or a network device) to execute the methods of the embodiments of the present disclosure.

In this embodiment, a device for determining a voice endpoint is further provided, where the device for determining a voice endpoint is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 5 is a block diagram schematically illustrating a structure of an apparatus for determining a speech endpoint according to an alternative embodiment of the present disclosure, and as shown in fig. 5, the apparatus includes:

a framing module 502, configured to receive a voice segment and perform framing operation on the voice segment to obtain multiple voice segment frames;

a processing module 504, configured to perform fast fourier transform processing on the multiple voice fragment frames respectively to obtain multiple fourier spectrums corresponding to the multiple voice fragment frames;

a model module 506, configured to input the multiple fourier spectrums into a neural network model, and output a judgment score and a starting point detection result;

a determining module 508, configured to determine whether a voice endpoint is detected through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

Optionally, the determining module 508 is further configured to, in a case that the starting point detection result is that the starting point of the speech segment is not detected, determine whether the speech segment is emitted by a target object: in the case that the voice segment is sent by the target object, marking a voice starting point as a true value point, and receiving the next voice segment, wherein the voice starting point is marked as the true value point and is used for indicating that the starting point of the voice segment is detected, and the voice end point comprises the starting point of the voice segment; and in the case that the voice segment is not emitted by the target object, marking the voice starting point as a false value point, and receiving the next voice segment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice segment is not detected.

Optionally, the determining module 508 is further configured to, in a case that the starting point detection result is that the starting point of the speech segment is detected, determine whether the speech segment is emitted by a target object: in the case that the voice segment is sent by the target object, adding one to the number of voiced voice segments, and receiving the next voice segment, the voice endpoint includes: a starting point of the voice segment and an ending point of the voice segment; and under the condition that the voice segment is not emitted by the target object, adding one to the number of mute voice segments, and under the condition that the number of mute voice segments is greater than a first preset threshold value and the number of voiced voice segments is greater than a second preset threshold value, determining that the end point of the voice segment is detected and not receiving the next voice segment any more.

Optionally, the determining module 508 is further configured to mark a speech starting point as a false value point and receive a next speech segment if the number of silent speech segments is not greater than the first preset threshold, where the speech starting point is marked as a false value point and is used to indicate that a starting point of the speech segment is not detected; or when the number of the mute speech segments is greater than the first preset threshold value but the number of the voiced speech segments is not greater than the second preset threshold value, marking the speech starting point as a false value point and receiving the next speech segment.

And under the condition that the starting point detection result is that the starting point of the voice segment is detected, and the voice segment is not sent by the target object, if the number of the mute voice segments is not more than the first preset threshold value, or the number of the mute voice segments is more than the first preset threshold value, and the number of the voiced voice segments is not more than the second preset threshold value, marking the voice end point as a false value point, and receiving the next voice segment. Marking the voice starting point as a false value point, wherein the situation is that the model detects a voiced voice segment, then all the voiced voice segments are non-voiced segments, and the number of the voiced segments is less than a threshold value, so that the voiced voice segment is considered as a false speaking voice; this speech onset point is marked as a false value point, i.e. it is a false speech onset point. The voice robot is an intelligent conversation robot with multiple functions of automatic call dialing, multi-round voice interaction, intelligent intention judgment and the like. The method can be applied to various business scenes such as automatic telephone sales, product service promotion, information verification, voice notification, telephone collection promotion and the like.

Optionally, the framing module 502 is further configured to receive a framing instruction sent by the target object, and determine a frame length and a frame shift corresponding to the framing operation according to the framing instruction; and performing the framing operation on the voice segment according to the frame length, the frame shift and the size of the voice segment to obtain a plurality of voice segment frames.

Optionally, the framing module 502 is further configured to continuously receive the voice packets until the size of the received voice packets meets a preset size, and combine the received voice packets into the voice segments.

Optionally, the model module 506 is further configured to obtain ambient noise data and call data; performing phoneme alignment processing on the call data by using a voice recognition tool to obtain phoneme alignment data; performing the framing operation on the environmental noise data and the phoneme alignment data to obtain a plurality of training data segment frames; respectively carrying out the fast Fourier transform processing on the training data segment frames to obtain a plurality of training data frequency spectrums corresponding to the training data segment frames; labeling the plurality of training data frequency spectrums; and training the neural network model by using a plurality of training data spectrums after the labeling processing.

Labeling the training data frequency spectrums by labeling labels to the training data frequency spectrums, wherein the labels are judgment scores corresponding to the training data frequency spectrums. It should be noted that the starting point detection result is related to the judgment score, and the multiple training data frequency spectrums are labeled, and actually, the judgment scores and the starting point detection results corresponding to the multiple training data frequency spectrums are labeled at the same time, that is, the labels labeled on the multiple training data frequency spectrums include the judgment scores and the starting point detection results corresponding to the multiple training data frequency spectrums.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present disclosure provide an electronic device.

Referring to fig. 6, an electronic device 600 provided in the embodiment of the present disclosure includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604; a memory 603 for storing a computer program; the processor 601 is configured to implement the steps in any of the above method embodiments when executing the program stored in the memory.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, receiving a voice segment, and framing the voice segment to obtain a plurality of voice segment frames;

s2, performing fast Fourier transform processing on the voice fragment frames respectively to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames;

s3, inputting the Fourier spectrums into a neural network model, and outputting judgment scores and a starting point detection result;

and S4, determining whether a voice endpoint is detected through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present disclosure described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for determining a voice endpoint, comprising:

receiving a voice segment, and performing framing operation on the voice segment to obtain a plurality of voice segment frames;

respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames;

inputting the Fourier spectrums into a pre-trained neural network model, and outputting judgment scores and a starting point detection result;

and determining whether a voice endpoint is detected or not through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

2. The method according to claim 1, wherein the determining whether a voice endpoint is detected by the voice endpoint detection algorithm according to the judgment score and the starting point detection result comprises:

and under the condition that the starting point detection result is that the starting point of the voice segment is not detected, judging whether the voice segment is sent by a target object or not:

in the case that the voice segment is sent by the target object, marking a voice starting point as a true value point, and receiving the next voice segment, wherein the voice starting point is marked as the true value point and is used for indicating that the starting point of the voice segment is detected, and the voice end point comprises the starting point of the voice segment;

and in the case that the voice segment is not emitted by the target object, marking the voice starting point as a false value point, and receiving the next voice segment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice segment is not detected.

3. The method according to claim 1, wherein the determining whether a voice endpoint is detected by the voice endpoint detection algorithm according to the judgment score and the starting point detection result comprises:

and under the condition that the starting point detection result is that the starting point of the voice segment is detected, judging whether the voice segment is sent by a target object or not:

in the case that the voice segment is sent by the target object, adding one to the number of voiced voice segments, and receiving the next voice segment, the voice endpoint includes: a starting point of the voice segment and an ending point of the voice segment;

and under the condition that the voice segment is not emitted by the target object, adding one to the number of mute voice segments, and under the condition that the number of mute voice segments is greater than a first preset threshold value and the number of voiced voice segments is greater than a second preset threshold value, determining that the end point of the voice segment is detected and not receiving the next voice segment any more.

4. The method according to claim 3, wherein after adding one to a mute speech segment number in a case where the speech segment is not emitted from the target object, the method further comprises:

under the condition that the number of the silent voice fragments is not larger than the first preset threshold value, marking a voice starting point as a false value point and receiving the next voice fragment, wherein the voice starting point is marked as the false value point and is used for indicating that the starting point of the voice fragment is not detected; or

And under the condition that the number of the silent voice fragments is greater than the first preset threshold value but the number of the voiced voice fragments is not greater than the second preset threshold value, marking the voice starting point as a false value point and receiving the next voice fragment.

5. The method of claim 1, wherein the framing the speech segment to obtain a plurality of speech segment frames comprises:

receiving a framing instruction sent by a target object, and determining the frame length and frame shift corresponding to the framing operation according to the framing instruction;

and performing the framing operation on the voice segment according to the frame length, the frame shift and the size of the voice segment to obtain a plurality of voice segment frames.

6. The method of claim 1, wherein before the framing the speech segment to obtain a plurality of speech segment frames, the method further comprises:

and continuously receiving voice packets until the size of the received voice packets meets the preset size, and combining the received voice packets into the voice segments.

7. The method of claim 1, wherein the neural network model comprises: the device comprises a first preset number of convolution layers, a second preset number of full-connection layers and an output layer, wherein the output layer is composed of the full-connection layers and a normalization index layer.

8. The method of claim 1, comprising:

acquiring environmental noise data and call data;

performing phoneme alignment processing on the call data by using a voice recognition tool to obtain phoneme alignment data;

performing the framing operation on the environmental noise data and the phoneme alignment data to obtain a plurality of training data segment frames;

respectively carrying out the fast Fourier transform processing on the training data segment frames to obtain a plurality of training data frequency spectrums corresponding to the training data segment frames;

labeling the plurality of training data frequency spectrums;

and training the neural network model by using a plurality of training data spectrums after the labeling processing.

9. An apparatus for determining a voice endpoint, comprising:

the framing module is used for receiving the voice segments and performing framing operation on the voice segments to obtain a plurality of voice segment frames;

the processing module is used for respectively carrying out fast Fourier transform processing on the voice fragment frames to obtain a plurality of Fourier spectrums corresponding to the voice fragment frames;

the model module is used for inputting the Fourier spectrums into a neural network model and outputting judgment scores and a starting point detection result;

and the determining module is used for determining whether a voice endpoint is detected through the voice endpoint detection algorithm according to the judgment score and the starting point detection result.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.