CN116844541A - Audio data processing method, man-machine interaction method, equipment and storage medium - Google Patents

Audio data processing method, man-machine interaction method, equipment and storage medium Download PDF

Info

Publication number
CN116844541A
CN116844541A CN202310678990.3A CN202310678990A CN116844541A CN 116844541 A CN116844541 A CN 116844541A CN 202310678990 A CN202310678990 A CN 202310678990A CN 116844541 A CN116844541 A CN 116844541A
Authority
CN
China
Prior art keywords
audio
training
fragment
audio fragment
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310678990.3A
Other languages
Chinese (zh)
Inventor
史莫晗
左玲云
陈谦
张仕良
舒钰淳
张结
戴礼荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202310678990.3A priority Critical patent/CN116844541A/en
Publication of CN116844541A publication Critical patent/CN116844541A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Abstract

The embodiment of the invention provides an audio data processing method, a man-machine interaction method, equipment and a storage medium, wherein the method comprises the following steps: and acquiring the first audio fragment, determining whether the first audio fragment and the second audio fragment form complete semantics according to semantic features of the first audio fragment, and acquiring the second audio fragment before the first audio fragment. And if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed. Finally, the audio to be processed is responded. In the method, since the semantic features of the first audio fragment can be obtained after the first audio fragment is obtained, whether the user generates the audio with complete semantics can be immediately determined by using the semantic features, namely, the time required for determining the complete semantics is shortened, and then the processing equipment can also immediately respond to the audio with complete semantics, so that the time delay of man-machine conversation is reduced.

Description

Audio data processing method, man-machine interaction method, equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to an audio data processing method, a man-machine interaction method, a device, and a storage medium.
Background
One common use scenario for artificial intelligence techniques may be human-machine conversations. For example, a voice dialog may be conducted with the user using a dialog system. Wherein the dialog system may be integrated in the hardware interaction device. The hardware interaction device can be an intelligent sound box, a mobile terminal device and the like. The dialogue system can be deployed in the cloud as an online intelligent customer service to meet the requirements of users for information inquiry, information feedback and the like.
In the voice conversation process, the conversation system can collect audio data generated by a user in real time, then after the audio data is determined to have complete semantics, the audio data is further subjected to semantic recognition, and further response is carried out according to the semantics, so that a round of conversation is completed. The time between the collection of the audio data from the dialogue system and the response of the dialogue system to the audio data is the time delay of human-machine dialogue.
The time delay of a man-machine conversation can obviously seriously affect the experience of the man-machine conversation. How to reduce the time delay of man-machine conversation becomes a urgent problem to be solved.
Disclosure of Invention
In view of this, the embodiments of the present invention provide an audio data processing method, a man-machine interaction method, a device and a storage medium, which are used for reducing the time delay of a man-machine conversation and improving the fluency of the man-machine conversation.
In a first aspect, an embodiment of the present invention provides an audio data processing method, including:
acquiring a first audio fragment;
determining whether the first audio fragment and a second audio fragment form complete semantics according to semantic features of the first audio fragment, wherein the second audio fragment is acquired before the first audio fragment;
if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed;
responding to the audio to be processed.
In a second aspect, an embodiment of the present invention provides an audio data processing method, including:
acquiring semantic features of a training audio fragment and a reference type of the training audio fragment, wherein the training audio fragment is any audio fragment in training audio;
the semantic features of the training audio fragments are used as training data, the reference type is used as supervision information to train a classification model, and the classification result output by the classification model is used for judging whether the training audio fragments and audio fragments played before the training audio fragments in the training audio form complete semantics or not;
Performing loss calculation on the prediction type of the training audio fragment output by the classification model and the reference type;
optimizing the classification model according to the loss calculation result.
In a third aspect, an embodiment of the present invention provides a human-computer interaction method, including:
responding to input operation of a user, and collecting a first audio fragment generated by the user;
outputting a response result of the audio to be processed with complete semantics, wherein whether the semantics of the audio to be processed are complete or not is judged according to the semantic features of the first audio segment, and the audio to be processed comprises the first audio segment and a second audio segment acquired before the first audio segment.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory is configured to store one or more computer instructions, where the one or more computer instructions, when executed by the processor, implement the audio data processing method in the first aspect or the second aspect, or the man-machine interaction method in the third aspect. The electronic device may also include a communication interface for communicating with other devices or communication systems.
In a fifth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to implement at least the audio data processing method as in the first or second aspect, or the human-machine interaction method as in the third aspect.
The audio data processing method provided by the embodiment of the invention can acquire the first audio fragment and determine whether the first audio fragment and the second audio fragment form complete semantics according to the semantic features of the first audio fragment. Wherein the second audio piece is acquired before the first audio piece. And if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed. Finally, the audio to be processed is responded.
In the method, since the semantic features of the first audio fragment can be obtained after the first audio fragment is obtained, whether the user generates the audio with complete semantics can be immediately determined by using the semantic features, namely, the time required for determining the complete semantics is shortened, and then the processing equipment can also immediately respond to the audio with complete semantics, so that the time delay of man-machine conversation is reduced. On the other hand, compared with the acoustic features, the semantic features contain richer semantic information, so that whether the user has generated semantically complete audio can be obviously and accurately identified by using the semantic features, and the response to the semantically complete audio can be performed, so that the fluency of man-machine conversation is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an audio data processing method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another audio data processing method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a model training method according to an embodiment of the present invention;
FIG. 4 is a flowchart of another audio data processing method according to an embodiment of the present invention;
FIG. 5 is a flowchart of another model training method according to an embodiment of the present invention;
FIG. 6 is a flowchart of a model training method according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a model for joint training according to an embodiment of the present invention;
FIG. 8 is a flowchart of another audio data processing method according to an embodiment of the present invention;
FIG. 9 is a flowchart of a man-machine interaction method according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a man-machine interaction scenario provided in an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of an operation interface according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an audio data processing device according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of another audio data processing device according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present invention;
FIG. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
FIG. 16 is a schematic structural diagram of another electronic device according to an embodiment of the present invention;
fig. 17 is a schematic structural diagram of still another electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two, but does not exclude the case of at least one.
It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to an identification", depending on the context. Similarly, the phrase "if determined" or "if identified (stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when identified (stated condition or event)" or "in response to an identification (stated condition or event), depending on the context.
It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present invention are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
Before describing the audio data processing method and the man-machine interaction method provided in the following embodiments of the present invention in detail, related concepts related to the following embodiments may be explained:
Voice activity detection (Voice activity detection, VAD for short): also known as speech endpoint detection, is a front-end module for processing long audio that can use acoustic features to distinguish the audio into silence audio segments and speech audio segments. The classification result may be subjected to subsequent downstream tasks such as: speech recognition is the conversion of audio into text, sensitive word detection, and so forth.
Audio clip: the result of framing the audio is that one frame of audio is an audio fragment. The length of the audio clip may be 10ms. The audio clips may include speech audio clips having semantics and silence audio clips having no semantics.
Acoustic features: acoustic features refer to the acoustic features of various speech audio signals. These characteristics are created by acoustic effects during the generation, transmission and reception of the speech signal. And since the acoustic features are obtained by fourier transforming the audio piece, the acoustic features can also be understood as low-level features of the audio piece.
Semantic features: the semantic features may be text information from feature extraction of the audio piece, and may include some other information such as pauses, volume, pitch, audio, etc. I.e. the semantic features of the audio piece contain rich information, which can also be understood as high-level features of the audio piece.
Before describing the audio data processing method and the man-machine interaction method provided by the following embodiments of the present invention in detail, the usage scenario and meaning of the method provided by the present invention may be described first:
in a man-machine conversation scenario, intelligent robots such as service robots, greeting robots, self-moving vending robots, etc. may be integrated with a conversation system. In addition, smart terminals such as mobile terminals, smart home appliances, smart wearable devices, etc. may also be integrated with dialog systems. Each of the above devices may be considered as hardware interaction devices. In addition, the dialogue system can be deployed at the cloud to provide services such as online consultation, customer service return visit and the like for the user. In general, any dialog system deployed at a hardware interactive device or cloud that supports human-machine dialog with a user in a voice manner may receive and generate audio data in response to the user.
In the man-machine conversation scenario, in order to reduce the time delay of the man-machine conversation, the method mentioned in the following embodiments of the present invention may be used to ensure the smoothness of the man-machine conversation. In addition, in a scene where long voice is required to be recognized, such as conference recording, real-time caption, voice note, etc., the methods mentioned in the following embodiments of the present invention may also be used in order to ensure the accuracy of voice recognition.
Some embodiments of the invention will now be described in detail with reference to the accompanying drawings, based on the foregoing description. In the case where there is no conflict between the embodiments, the following embodiments and features in the embodiments may be combined with each other. In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.
Fig. 1 is a flowchart of an audio data processing method according to an embodiment of the present invention. The audio data processing method provided by the embodiment of the invention can be executed by processing equipment providing a voice dialogue function. The processing device may be either a physical device or a virtual device, and may be part of a dialog system. As shown in fig. 1, the method may include the steps of:
s101, acquiring a first audio fragment.
The user may generate audio data. Optionally, the audio data may be generated in real time during the man-machine interaction between the user and the processing device, or may be obtained by recording the conference content and the teaching content in real time by the user. Alternatively, the audio data may be pre-collected, such as pre-recorded meeting content, tutorial content, and the like.
The processing device may collect audio data continuously generated by the user in real time and obtain audio clips. The processing device may then frame the collected audio data to obtain audio clips. The audio clip newly acquired by the processing device at the current time may be the first audio clip. Alternatively, the length of the first audio piece may be a preset length, such as 10ms.
For example, in a human-machine interaction scenario, such as a user-controlled cleaning robot cleaning scenario, a user may generate the following audio data for the cleaning robot: "start cleaning bedroom floor". ". In a meeting recording scene, an intelligent terminal with an audio collection function can collect the following audio data in a meeting process of a user: "Junyan, boss prepares organization staff for building, go to Beijing and play. The above-mentioned audio data may be divided into a plurality of audio clips according to a preset length.
S102, determining whether the first audio piece and the second audio piece form complete semantics according to semantic features of the first audio piece, wherein the second audio piece is collected before the first audio piece.
And S103, if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed.
S104, responding to the audio to be processed.
The processing device may then perform feature extraction on the first audio piece to obtain semantic features of the first audio piece that are presented in a vector form. As the name suggests, semantic features contain more abundant semantic information than acoustic features of audio segments. At this time, alternatively, the processing device may perform similarity calculation on the semantic features of the first audio segment and feature vectors in the preset feature vector library, so as to determine that the first audio segment and the second audio segment obtained before the first audio segment form complete semantics. Wherein the number of second audio segments is at least one. The preset feature vector library stores common vectors of words (such as 'having', 'having been' appearing at the end of a sentence).
If the similarity is greater than or equal to the preset similarity, it can be determined that the first audio segment and the second audio segment form complete semantics. The processing device may also determine the first audio piece and the second audio piece as audio to be processed and further respond to the audio to be processed.
If the similarity is smaller than the preset similarity, it can be determined that the first audio piece and the second audio piece cannot form complete semantics. At this point, the processing device may continue to collect audio clips that the user produced at the next time of the current time. Then, the processing device may continue to determine, according to the semantic features of the audio segments collected at the next time, that the audio segments collected at the next time, and the first audio segments and the second audio segments form complete semantics.
Alternatively, as with the man-machine conversation scenario and conference recording scenario mentioned above, the audio to be processed may include audio generated by the user and the processing device during the man-machine conversation or audio requiring speech transcription. The corresponding audio to be processed in different scenes can also be responded in different manners. For example, if the audio to be processed is audio generated in a man-machine interaction scene, the processing device may perform semantic recognition on the audio to be processed and play response audio corresponding to the audio to be processed. If the audio to be processed is generated in the conference recording scene, the processing equipment can perform voice recognition on the audio to be processed and display text content corresponding to the audio to be processed.
It should be noted that, after determining that the first audio segment and the second audio segment have complete semantics, the processing device may determine the current time of capturing the first audio segment as a slicing point. The processing device can segment the audio data acquired in real time according to the segmentation point to obtain the audio to be processed with complete semantics. And the processing device may continue to collect audio clips generated by the user after the first audio clip while performing the audio data slicing.
In this embodiment, the processing device obtains the first audio segment, and determines, according to semantic features of the first audio segment, whether the first audio segment and the second audio segment form complete semantics, where the second audio segment is collected before the first audio segment. And if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed. Finally, the audio to be processed is responded.
In the method, since the semantic features of the first audio fragment can be obtained after the first audio fragment is obtained, whether the user generates the audio with complete semantics can be immediately determined by using the semantic features, namely, the time required for determining the complete semantics is shortened, and then the processing equipment can also immediately respond to the audio with complete semantics, so that the time delay of man-machine conversation is reduced. On the other hand, compared with the acoustic features, the semantic features contain richer semantic information, so that whether the user has generated semantically complete audio can be obviously and accurately identified by using the semantic features, and the response to the semantically complete audio can be performed, so that the fluency of man-machine conversation is improved.
In addition, for the method provided in the embodiment shown in fig. 1, when the method is specifically applied in a man-machine conversation scenario, the technical effects that can be achieved can be further understood in combination with the following:
the dialogue system used in the man-machine dialogue scene can collect the audio fragments generated by the user in real time and detect whether the mute audio fragments appear or not by using the VAD in real time. If a preset number of mute audio segments continuously appear, i.e. the mute is maintained for a certain period of time, it is determined that the user generates a piece of semantically complete audio before the mute audio segments are collected, i.e. the user has already said all the content to be expressed. Finally, the dialogue system can perform subsequent processing such as voice recognition on the audio with complete semantics. When the method is used, the time delay of the man-machine conversation is the sum of the processing time required by the conversation system for carrying out subsequent processing on the audio with complete semantics and the duration of silence, so that the time delay of the man-machine conversation can be greatly increased.
Compared to the above-mentioned method of determining whether the audio semantics are complete according to the duration of silence or according to the acoustic features of the audio clip, the present embodiment determines whether the user has generated a piece of semantically complete audio according to the semantic features of the audio clip. The processing device can immediately determine whether the semantics are complete and immediately respond to the audio with complete semantics after the audio fragments are acquired, and does not need to wait for a period of time, so that the time delay of man-machine conversation is reduced. On the other hand, since semantic features contain more abundant semantic information than acoustic features, it is apparent that using semantic features can more accurately identify whether a user has generated semantically complete audio. Further, based on the reliable recognition result, the dialogue system can respond to the complete voice frequency, so that the fluency of man-machine dialogue is improved.
In addition, for the method provided by the embodiment shown in fig. 1, when the method is specifically applied to a conference recording scene, the embodiments provided by the invention can accurately divide sentences with complete semantics contained in long voices and accurately recognize the conference recording. When the method is specifically applied to a real-time caption scene, the embodiments provided by the invention can accurately divide sentences with complete semantics contained in long voices with low time delay and display voice recognition results of the sentences with complete semantics, namely captions.
The embodiment shown in fig. 1 has disclosed a way of determining whether a first audio piece and a second audio piece constitute complete semantics based on semantic features. In order to further improve the accuracy of determining whether the semantics are complete, optionally, the processing device may further add a step of determining the mute duration before determining the first audio piece and the second audio piece as the audio to be processed. Wherein the mute duration is equal to the product of the length of one mute audio segment and the number of mute audio segments.
In particular, after determining that the first audio piece and the second audio piece have complete semantics according to the semantic features, the processing device does not directly determine them as audio to be processed and respond to them, but continues to collect a preset number of third audio pieces generated after the first audio piece. The processing device may further perform voice activity detection on the acquired third audio segment. If the detection result shows that the preset number of third audio segments are all mute audio segments, and the mute duration after the first audio segment meets the first preset duration, that is, the user does not generate a voice audio segment within a period of time after generating the first audio segment, at this time, the user can be considered to have already said the content to be expressed, and the processing device can determine the first audio segment and the second audio segment as the audio to be processed. Since it is already possible to determine more accurately and reliably that the first audio piece and the second audio piece are able to constitute complete semantics from the semantic features, the first preset time period may be set to a smaller value, such as 300ms.
In this embodiment, after determining that the first audio segment and the second audio segment form the complete semantic by using the semantic features of the audio segments, the processing device may further detect whether the third audio segment generated after the first audio segment is a mute audio segment and a mute duration by using the acoustic features of the audio segments, and determine whether the user expresses the complete semantic according to the mute duration. The multi-dimensional feature common use mode can more accurately determine whether the first audio fragment and the second audio fragment form complete semantics or not, so that the accuracy of voice recognition of the audio to be processed subsequently is improved.
In the embodiment shown in fig. 1 it has been mentioned that the determination of whether the first audio piece and the second audio piece constitute complete semantics is performed by means of comparing feature vectors. In addition, it is also possible to determine whether the first audio piece and the second audio piece constitute complete semantics by means of classification. Fig. 2 is a flowchart of another audio data processing method according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:
s201, a first audio fragment is acquired.
The specific implementation process of the above step S201 may refer to the specific description of the related steps in the embodiment shown in fig. 1, which is not repeated herein.
S202, detecting the voice activity of the first audio fragment to obtain a first type to which the first audio fragment belongs.
Based on the first audio segment obtained in step S201, the processing device may perform voice activity detection on the first audio segment to determine a first type to which the first audio segment belongs. The first type is obtained by classifying the first audio segment by using acoustic characteristics, and specifically may include: silence audio segments or speech audio segments. Alternatively, for voice activity detection, it can be judged as a mute audio segment or a voice audio segment by detecting the signal strength of the audio segment; the determination of silence audio segments or speech audio segments may also be accomplished by means of neural network models.
S203, semantic breakpoint classification is carried out on the first audio fragment according to the semantic features of the first audio fragment so as to obtain a second type to which the first audio fragment belongs.
The processing device may then perform feature extraction on the first audio piece to obtain semantic features of the first audio piece. The processing device may then further classify the first audio segment according to semantic features to obtain a second type to which the first audio segment belongs. The second type is obtained by classifying the first audio segment by using semantic features, and specifically may include: semantic breakpoints or non-semantic breakpoints. The output of the first classification model is actually of a single type, as compared to the second classification model in the subsequent embodiments.
The semantic breakpoint indicates that the moment when the first audio segment is acquired is a segmentation point, that is, the processing device may determine that the audio segment acquired before the segmentation point is audio to be processed with complete semantics. The non-semantic break point indicates that the instant at which the first audio piece was acquired is not a cut-off point, i.e. the processing device may determine that the audio piece acquired before the cut-off point does not result in audio to be processed with complete semantics.
For the above-mentioned determination of the second type, alternatively, the processing device may perform semantic breakpoint classification on the first audio segment according to the semantic features of the first audio segment and a classification algorithm, so as to obtain the second type to which the first audio segment belongs. Wherein the classification algorithm may comprise: any one of a naive Bayes algorithm, a K-nearest neighbor algorithm, a support vector machine algorithm and the like.
Alternatively, the processing device may input semantic features of the first audio piece into the first classification model to output, by the first classification model, a second type to which the first audio piece belongs. Alternatively, the first classification model may include any model with a classification function, such as any one of a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN) model, a Long Short-Term Memory (LSTM) model, and a decision tree model.
S204, determining whether the first audio piece and the second audio piece form complete semantics according to the first type and the second type, wherein the second audio piece is collected before the first audio piece.
S205, if the first audio piece and the second audio piece form complete semantics, determining the first audio piece and the second audio piece as audio to be processed.
The processing device may then comprehensively analyze whether the first audio segment is a silent audio segment and whether the first audio segment has a semantic breakpoint based on the first type and the second type to determine whether the first audio segment and the second audio segment constitute complete semantics. Wherein the second audio piece is acquired before the first audio piece.
If the first type reflects that the first audio segment is a mute audio segment and the second type reflects that the first audio segment has a semantic breakpoint, indicating that the user has a high probability of not generating a voice audio segment within a period of time after the first audio segment is generated, at this time, the user can be considered to have already said what is desired to be expressed, and it is determined that the first audio segment and the second audio segment form a complete semantic. Alternatively, in this case, the processing device may directly determine the first audio piece and the second audio piece as audio to be processed.
The processing device may further add a step of determining the mute duration before determining the first audio piece and the second audio piece as audio to be processed as mentioned in the above embodiments. If the mute duration after the first audio segment is judged to meet the second preset duration, the processing device may determine the first audio segment and the second audio segment as audio to be processed. Since it is already possible to determine more accurately and reliably that the first audio piece and the second audio piece are able to constitute complete semantics from the semantic features, the second preset time period may also be set to a smaller value, such as 300ms.
In another case, if the first type reflects that the first audio segment is a mute audio segment and the second type reflects that the first audio segment has a non-semantic breakpoint, indicating that the user has a high probability of generating a speech audio segment within a period of time after the first audio segment is generated, and at this time, the user may be considered to have not finished speaking the content desired to be expressed, the processing device may continue to collect a fourth audio segment generated after the first audio segment and perform voice activity detection on the fourth audio segment.
If the detection result shows that the preset number of fourth audio segments are all mute audio segments, which shows that the mute duration after the first audio segment meets the third preset duration, that is, the user does not generate a voice audio segment within a period of time after generating the first audio segment, at this time, the user can be considered to have already said the content to be expressed, and the processing device can determine the first audio segment and the second audio segment as the audio to be processed.
Alternatively, the non-semantic break point may specifically be a non-punctuation, or may be a non-ending punctuation. Wherein, the non-ending punctuation may be comma or pause. A user may experience a short pause in the process of speaking continuously, which may occur in the same intent group in a sentence or between different intent groups.
When the first audio segment has non-punctuation, it is indicated that the first audio segment may be a pause in the presentation of the same interest group by the user. When the first audio segment has a non-ending punctuation, it is indicated that the first audio segment may have been a pause by the user between presentation disagreement groups. The likelihood that the user will continue to produce speech audio segments after the non-punctuation is greater than for non-ending punctuation and therefore the third preset time period may be set to a greater value, such as 700ms, when the first audio segment has non-punctuation. When the first audio piece has a non-ending punctuation, the third preset time period may be set to a small value, such as 400ms.
S206, responding to the audio to be processed.
The specific implementation process of step 206 may refer to the specific description of the relevant steps in the embodiment shown in fig. 1, which is not repeated here.
In this embodiment, the processing device primarily determines whether the first audio segment is a mute audio segment according to a first type to which the first audio segment belongs, and further determines whether the first audio segment has a semantic breakpoint according to a second type to which the first audio segment belongs. The processing device can accurately determine whether the first audio fragment and the second audio fragment form complete semantics through comprehensively analyzing the results of the two classifications.
The embodiment shown in fig. 2 has mentioned a way to determine whether the first audio segment and the second audio segment form the complete semantic meaning by classification, and fig. 3 is a flowchart of another audio data processing method according to an embodiment of the present invention, as shown in fig. 3, where the method may include the following steps:
s301, acquiring a first audio fragment.
The specific implementation process of the above step S301 may refer to the specific description of the related steps in the embodiment shown in fig. 1, which is not repeated herein.
S302, inputting the semantic features of the first audio fragment into a second classification model to output the composite type to which the first audio fragment belongs by the second classification model.
Alternatively, the processing device may perform feature extraction on the first audio segment to obtain semantic features of the first audio segment. The processing device may then input the semantic features of the first audio piece into a second classification model to output, from the second classification model, a composite type to which the first audio piece belongs. Alternatively, the second classification model may include any model having a classification function, such as one of a CNN model, an LSTM model, and a decision tree model. The second classification model may have the same or a different model structure than the first classification model. Wherein the composite type includes: the first audio segment is a silent audio segment and the first audio segment has a semantic breakpoint, the first audio segment is a silent audio segment and the first audio segment has a non-semantic breakpoint, the first audio segment is a speech audio segment and the first audio segment has a semantic breakpoint and the first audio segment is a speech audio segment and the first audio segment has any one of a non-semantic breakpoint.
S303, if the composite type reflects that the first audio segment is a mute audio segment and the first audio segment has a semantic breakpoint, determining that the first audio segment and the second audio segment form complete semantics, wherein the second audio segment is collected before the first audio segment.
S304, if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed.
Optionally, in one case, if the composite type reflects that the first audio segment is a mute audio segment and the first audio segment has a semantic breakpoint, which indicates that the user does not generate a voice audio segment within a period of time after generating the first audio segment, the user may be considered to have already said content to be expressed, and it is determined that the first audio segment and the second audio segment form a complete semantic.
Alternatively, in this case, the processing device may directly determine the first audio piece and the second audio piece as audio to be processed. The processing device may further add a step of determining the mute duration before determining the first audio piece and the second audio piece as audio to be processed as mentioned in the above embodiments. The specific process may be referred to the description of the related embodiments, and will not be repeated here.
In another case, if the composite type reflects three cases except that the first audio segment is a mute audio segment and the first audio segment has a semantic breakpoint, it indicates that the user has a high probability of generating a voice audio segment within a period of time after generating the first audio segment, and at this time, the user may be considered to have no speaking about the content to be expressed, the processing device may continue to collect the fourth audio segment after the first audio segment, and determine whether to determine, according to a detection result of performing voice activity detection on the fourth audio segment, the first audio segment and the second audio segment as audio to be processed having complete semantics.
It should be noted that, the second classification model used in this embodiment may be considered to classify the audio segments as a speech audio segment or a mute audio segment, and also classify the audio segments as semantic breakpoints. That is, the second classification model may be considered as a classification model that integrates VAD detection and semantic breakpoint classification functions. In contrast, the first classification model has a semantic breakpoint classification function.
S305, responding to the audio to be processed.
The specific implementation process of the step S305 may refer to the specific description of the related steps in the embodiment shown in fig. 2, which is not repeated herein.
In this embodiment, the processing device may perform semantic breakpoint classification on the first audio segment through the second classification model, to obtain a composite type to which the first audio segment belongs. According to the result reflected by the composite type and the artificially set endpoint end label, namely the comprehensive multiple semantic judgment methods, whether the first audio fragment and the second audio fragment form complete semantics can be judged more accurately. In addition, the details of the embodiment that are not described in detail and the technical effects that can be achieved can be referred to the related descriptions in the above embodiments, which are not described herein.
It should be noted that the first classification model mentioned in the embodiment shown in fig. 2 and the second classification model mentioned in the embodiment shown in fig. 3 may be used alone or in combination. Optionally, when the first classification model and the second classification model are used in combination, the first audio segment and the second audio segment are determined to be audio to be processed when the first classification model and/or the second classification model determines that the first audio segment is a mute audio segment and has a semantic breakpoint.
As can be seen from the above-described process of the embodiments shown in fig. 1 to 3, the processing device may extract the semantic features of the first audio piece. Specifically, the processing device may extract the acoustic features of the first audio segment by fourier transform, windowing, and other processing methods, that is, extract the low-level features of the first audio segment. The processing device may then extract semantic features of the first audio piece from the acoustic features, which may be considered as high-level features of the first audio piece.
Alternatively, the processing device may utilize a feature extraction model for semantic feature extraction. Wherein the feature extraction model may include: the bi-directional encoder of the converter represents one of the (Bidirectional Encoder Representation from Transformers, bert) model, the CNN model, the LSTM model, etc.
As can be seen from the above-described use of the processing device according to the embodiment shown in fig. 1 to 3, the processing device uses at least one of the first classification model, the second classification model, and the feature extraction model when determining whether the first audio piece and the second audio piece constitute complete semantics according to the semantic features. It should be noted that, the models may be deployed in the interaction device or may be deployed to the cloud.
The training process of each model can be described as follows. And the embodiments described below may be specifically executed by the training device, alternatively, the training device may be executed by the processing device mentioned in the embodiments above, or may be executed by another device other than the processing device, where the other device may be deployed in the cloud.
In order to ensure that the first classification model can output more accurate classification results, the training device may train the first classification model in a supervised training manner. Fig. 4 is a flowchart of a model training method according to an embodiment of the present invention. The present embodiment may be performed by the training device described above. As shown in fig. 4, the method may include the steps of:
S401, semantic features of the first training audio piece and a reference type of the first training audio piece are obtained.
Alternatively, the semantic features of the first training audio piece and the reference type of the first training audio piece may be obtained by preprocessing. That is, the training device may obtain a first training audio segment from the pre-collected training audio segment set, and perform feature extraction on the obtained first training audio segment to obtain semantic features corresponding to the first training audio segment. For the extraction process of the semantic features, reference may be made to the description of the use process of the processing device in the above embodiments. The reference type of the first training audio segment may be obtained by manual annotation, and the reference type may include a semantic breakpoint or a non-semantic breakpoint.
S402, training a first classification model by taking semantic features of the first training audio fragment as training data and taking a reference type as supervision information.
S403, performing loss calculation on the prediction type of the first training audio piece and the reference type of the first training audio piece output by the first classification model.
S404, optimizing the first classification model according to the loss calculation result.
Based on the semantic features and the reference types of the first training audio segment acquired in step S401, the training device may train the first classification model by using the semantic features of the first training audio segment as training data and the reference types corresponding to the first training audio segment as supervision information.
Then, the training device may perform loss calculation on the prediction type of the first training audio segment output by the first classification model and the reference type corresponding to the first training audio segment, and optimize the first classification model according to the obtained loss calculation result. Alternatively, the loss function used by the loss calculation process may include one of a cross entropy loss function, a logarithmic loss function, and a square loss function.
In this embodiment, the first classification model is trained by using a supervised training manner, that is, by using the reference type corresponding to the first training audio segment, so that the training effect of the first classification model is better, and further, the prediction type of the first training audio segment output by the first classification model is more accurate during the model use stage.
In order to ensure that the second classification model can output more accurate classification results, the training device may also train the second classification model in a supervised training manner, similar to the process of training the first classification model. This embodiment may equally be performed by the training device described above. Fig. 5 is a flowchart of another model training method according to an embodiment of the present invention. As shown in fig. 5, the method may include the steps of:
S501, acquiring semantic features of the training audio fragment and a reference composite type of the training audio fragment.
Alternatively, the semantic features of the training audio segments and the reference composite type of the training audio segments may be obtained by preprocessing. The acquisition process may be described with reference to step S401 in the embodiment of fig. 4, and will not be described herein.
The reference composite types for any training audio segment may include: the training audio segment is a silence audio segment and the training audio segment has semantic breakpoints, the training audio segment is a silence audio segment and the training audio segment has non-semantic breakpoints, the training audio segment is a speech audio segment and the training audio segment has semantic breakpoints and the training audio segment is a speech audio segment and the training audio segment has any one of the non-semantic breakpoints.
S502, training a second classification model by taking semantic features of the training audio fragments as training data and taking the reference composite type as supervision information.
And S503, calculating the loss of the predicted composite type of the training audio fragment output by the second classification model and the reference composite type of the training audio fragment.
S504, optimizing the second classification model according to the loss calculation result.
The training process of the second classification model is similar to that of the first model, and reference is specifically made to the relevant description in the embodiment shown in fig. 4.
In this embodiment, the second classification model is trained by using a supervised training manner, that is, by means of a reference composite type, so that the effect of the second classification model is better, and further, the prediction type of the training audio segment output by the second classification model is more accurate during the model use stage.
Since the accuracy of the classification result is related to the input of the classification model, i.e. the semantic features, the classification accuracy of the classification model can be improved together by means of the additional speech recognition model as well as the above mentioned feature extraction model. And the feature extraction capability of the feature extraction model can be improved by means of the voice recognition result of the voice recognition model. The training device may then jointly train the classification model, the speech recognition model, and the feature extraction model. The classification model herein may include the first classification model and/or the second classification model mentioned in the above embodiments, among others.
Figure 6 is a flow chart of yet another model training provided by an embodiment of the present invention. This embodiment may equally be performed by the training device described above. As shown in fig. 6, the method may include the steps of:
S601, semantic features of the second training audio segment, reference types of the second training audio segment and reference voice recognition results of the second training audio segment are obtained.
S602, training a classification model by taking semantic features of the second training audio fragment as training data and taking a reference type as supervision information, wherein the classification model comprises a first classification model and/or a second classification model.
Optionally, the semantic features of the second training audio segment, the reference type of the second training audio segment, and the reference speech recognition result of the second training audio segment may be obtained by preprocessing. That is, the training device may obtain a second training audio segment from the pre-collected training audio segment set, and perform feature extraction on the obtained second training audio segment, so as to obtain semantic features corresponding to the second training audio segment. For the extraction process of the semantic features, reference may be made to the description of the use process of the processing device in the above embodiments.
The reference type of the second training audio segment can be obtained through manual labeling, and the reference type corresponds to the classification model. Specifically, if the classification model includes a first classification model, the reference types may include: semantic break points or non-semantic break points.
If the classification model includes a second classification model, the reference types may include: the second training audio segment is a silent audio segment and the second training audio segment has a semantic breakpoint, the second training audio segment is a silent audio segment and the second training audio segment has a non-semantic breakpoint, the second training audio segment is a speech audio segment and the second training audio segment has a semantic breakpoint and the second training audio segment is a speech audio segment and the second training audio segment has any one of the non-semantic breakpoints.
If the classification model comprises a first classification model and a second classification model, the reference type may comprise at least one of all types mentioned above.
The reference speech recognition result of the second training audio segment may be determined artificially, and the speech recognition result may include: text information corresponding to the second training audio segment.
Then, the training device may train the classification model by using the semantic features of the second training audio segment as training data and using the reference type corresponding to the second training audio segment as supervision information.
S603, training a voice recognition model by taking semantic features of the second training audio fragment as training data and taking a reference voice recognition result as supervision information.
Based on the semantic features of the second training audio segment and the reference speech recognition result obtained in step S601, the training device may further train the speech recognition model by using the semantic features of the second training audio segment as training data of the speech recognition model and using the reference speech recognition result as supervision information.
S604, determining a first loss value according to the prediction type of the second training audio segment and the reference type of the second training audio segment output by the classification model.
S605, determining a second loss value according to the predicted voice recognition result and the reference voice recognition result of the second training audio segment output by the voice recognition model.
S606, optimizing the classification model, the voice recognition model and the feature extraction model according to the first loss value, the second loss value and the weight parameters.
Then, the training device may determine a first loss value according to the prediction type of the second training audio piece and the reference type of the second training audio piece output by the classification model; and determining a second loss value according to the predicted voice recognition result of the second training audio fragment output by the voice recognition model and the reference voice recognition result. Wherein the first loss value and the second loss value may be calculated by a loss function.
Finally, the training device may optimize the classification model, the speech recognition model, and the feature extraction model based on the first loss value, the second loss value, and the respective weight parameters. Because the method provided by the invention focuses on determining whether the first audio fragment and the second audio fragment form complete semantics by utilizing the classification result obtained by the classification model, the weight parameter of the first loss value corresponding to the classification model can be set to be larger. Of course, for the setting mode of the weight parameters, different modes can be adopted according to actual requirements.
The above-mentioned combined training process of multiple models can also be understood with reference to fig. 7, where the training device obtains the semantic features of the second training audio segment through the feature extraction model, then inputs the semantic features into the speech recognition model and the classification model respectively, so as to output the predicted speech recognition result by the speech recognition model, and outputs the predicted type by the classification model. The prediction types may include, among other things, a prediction semantic breakpoint type and/or a prediction complex type. In this process, the feature extraction model may be further trained using the first loss value generated by the speech recognition model to better extract semantic features of the first audio segment. The accurate semantic features can be obtained through the voice recognition model and the feature extraction model, and at the moment, the accurate semantic features are input into the classification model, so that the classification accuracy of the classification model can be better improved.
In this embodiment, on the one hand, the feature extraction capability of the feature extraction model can be improved by using the voice recognition result of the voice recognition model, so that the semantic features extracted by the feature extraction model have richer and more accurate semantic information. On the other hand, according to the accurate semantic features extracted by the voice recognition model and the feature extraction model, the classification model, the voice recognition model and the feature extraction model are jointly trained, so that the classification accuracy of the classification model can be further improved.
Fig. 8 is a flowchart of another audio data processing method according to an embodiment of the present invention. The audio data processing method provided by the embodiment of the invention is essentially a training method of a classification model, and the method can be executed by training equipment. Alternatively, the training device may be the processing device mentioned in the foregoing embodiment, or may be another device other than the processing device, and the other device may be deployed in the cloud. As shown in fig. 7, the method may include the steps of:
s801, semantic features of a training audio fragment and a reference type of the training audio fragment are acquired, wherein the training audio fragment is any audio fragment in training audio.
S802, training the classification model by taking semantic features of the training audio fragments as training data and taking the reference type as supervision information, wherein the classification result output by the classification model is used for judging whether the training audio fragments and audio fragments played before the training audio fragments in the training audio form complete semantics.
S803, performing loss calculation on the prediction type and the reference type of the training audio segment output by the classification model.
S804, optimizing the classification model according to the loss calculation result.
The classification model in the present embodiment may be the first classification model or the second classification model mentioned in the above embodiment. The training process of the classification model may be referred to the embodiment shown in fig. 4 or fig. 5, and will not be described herein.
In addition, the details of the embodiment that are not described in detail and the technical effects that can be achieved can be referred to the related descriptions in the above embodiments, which are not described herein.
The embodiments shown in fig. 1 to 8 describe in detail the processing of audio data by the processing device. On the basis, a man-machine interaction method can be provided from the interaction flow of the user. Fig. 9 is a flowchart of a man-machine interaction method according to an embodiment of the present invention. The method may also be provided to the user as a SaaS service. The execution subject of the method may specifically be a service platform. As shown in fig. 9, the method may include the steps of:
s901, responding to input operation of a user, and collecting a first audio fragment generated by the user.
S902, outputting a response result of the audio to be processed with complete semantics, wherein whether the semantics of the audio to be processed are complete or not is judged according to semantic features of the first audio segment, and the audio to be processed comprises the first audio segment and a second audio segment acquired before the first audio segment.
Alternatively, the service platform may collect the first audio clip generated by the user in response to a touch operation or a voice instruction of the user. For example, when a user touches a control of the service platform, a conference recording may be started, and in a conference recording scene, the user may upload audio to be transcribed to the service platform by touching the control on an operation interface provided by the service platform. The service platform can perform framing processing on the audio data uploaded by the user to obtain audio clips. The audio clip newly acquired by the service platform at the current time may be the first audio clip.
In a cleaning robot cleaning scene, a user can send a voice command to the cleaning robot, and the cleaning robot can acquire audio data in real time in the process of generating the audio data by the user so as to obtain an audio clip which is acquired most recently at the current moment, namely a first audio clip.
The service platform may then output a response result for the audio to be processed with complete semantics. Alternatively, the audio to be processed may include audio generated by a user and a service platform during a human-machine conversation or audio requiring speech transcription. The corresponding audio to be processed in different scenes can also be responded in different manners.
For example, if the audio to be processed is audio generated in a man-machine interaction scene, the processing device may perform semantic recognition on the audio to be processed and play response audio corresponding to the audio to be processed. If the audio to be processed is generated in the conference recording scene, the processing equipment can perform voice recognition on the audio to be processed and display text content corresponding to the audio to be processed.
Whether the semantics of the audio to be processed are complete or not is judged according to the semantic features of the first audio fragment, and the audio to be processed comprises the first audio fragment and a second audio fragment acquired before the first audio fragment. In addition, for the process of obtaining the first audio segment and the process of determining the complete semantics, reference may be made to the related descriptions in the above embodiments, which are not described herein.
In addition, the details not described in detail in this embodiment and the technical effects that can be achieved may be related to the descriptions in the above embodiments, which are not described herein.
For ease of understanding, the specific implementation procedure of the audio data processing method provided in the above embodiments is exemplarily described in connection with the following scenario.
The specific implementation process of the audio data processing method and the human-computer interaction method provided in the foregoing embodiments in the human-computer interaction scene can be understood with reference to fig. 10. In a human-machine interaction scenario, such as a user-controlled cleaning robot cleaning scenario, a user may generate audio data for the cleaning robot. Meanwhile, the cleaning robot can acquire the audio data in real time in the process of generating the audio data by the user so as to obtain an audio clip generated at the time T1, namely a first audio clip.
Then, the cleaning robot may perform feature extraction on the acquired first audio piece to obtain semantic features of the first audio piece. Then, the cleaning robot can comprehensively analyze whether the first audio piece and the second audio piece form complete semantics according to the semantic features of the first audio piece and the classification result of the classification model. Wherein the second audio piece is acquired before the first audio piece. Alternatively, the classification model may comprise a first classification model and/or a second classification model.
After the cleaning robot determines that the first audio piece and the second audio piece have complete semantics, the cleaning robot can obtain a piece of audio to be processed with complete semantics, such as 'start cleaning the floor of a bedroom'. Finally, the cleaning robot can perform semantic recognition on the obtained audio to be processed, and play response audio corresponding to the audio to be processed, for example, the cleaning robot is good, and cleaning is started now.
In addition, the cleaning robot can continuously collect the audio fragments generated by the user at the time T2 while carrying out semantic recognition on the audio to be processed.
In addition, for the extracting process of the semantic features, the determining process of the complete semantics, the training process of the classification model, and the technical effects that can be achieved, reference may be made to the related descriptions in the above embodiments, which are not repeated herein.
In the meeting recording scenario, the user can upload audio to be transcribed to the service platform using an operation interface provided by the service platform as shown in fig. 11. The service platform can perform framing processing on the audio data uploaded by the user to obtain audio clips.
The service platform may perform the following processing for each audio clip in turn: for any audio fragment, the service platform may perform feature extraction on the obtained audio fragment to obtain semantic features of the audio fragment. Then, the service platform can comprehensively analyze whether the audio fragment and the audio fragment before the audio fragment in the audio data form complete semantics according to the semantic features of the audio fragment and the classification result of the classification model. Alternatively, the classification model may comprise a first classification model and/or a second classification model.
After the service platform determines that the plurality of audio fragments have complete semantics, the service platform can perform voice recognition on the audio fragments with complete semantics, and the voice recognition result can be displayed on an operation interface provided by the service platform, namely, the text content of the audio to be processed is displayed on the interface.
In addition, for the extracting process of the semantic features, the determining process of the complete semantics, the training process of the classification model, and the technical effects that can be achieved, reference may be made to the related descriptions in the above embodiments, which are not repeated herein.
An audio data processing device of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these audio data processing devices may be configured using commercially available hardware components through the steps taught by the present solution.
Fig. 12 is a schematic structural diagram of an audio data processing device according to an embodiment of the present invention, where, as shown in fig. 12, the device includes:
the obtaining module 11 is configured to obtain a first audio segment.
The complete semantic determining module 12 is configured to determine whether the first audio segment and a second audio segment form a complete semantic according to semantic features of the first audio segment, where the second audio segment is collected before the first audio segment.
The to-be-processed audio determining module 13 is configured to determine the first audio segment and the second audio segment as to-be-processed audio if the first audio segment and the second audio segment form complete semantics.
A response module 14 for responding to the audio to be processed.
Optionally, the apparatus further comprises: an activity detection module 15, configured to perform a voice activity detection on at least one third audio segment collected after the first audio segment.
The audio to be processed determining module 13 is further configured to determine the first audio segment and the second audio segment as audio to be processed if the detection result indicates that the third audio segment is a mute audio segment.
Optionally, the complete semantic determining module 12 is configured to perform voice activity detection on the first audio segment to obtain a first type to which the first audio segment belongs; performing semantic breakpoint classification on the first audio fragment according to the semantic features of the first audio fragment to obtain a second type to which the first audio fragment belongs; and determining whether the first audio fragment and the second audio fragment form complete semantics according to the first type and the second type.
Optionally, the complete semantic determining module 12 is configured to determine that the first audio segment and the second audio segment form a complete semantic if the first type reflects that the first audio segment is a mute audio segment and the second type reflects that the first audio segment has a semantic breakpoint.
Optionally, the complete semantic determining module 12 is configured to perform voice activity detection on at least one fourth audio segment collected after the first audio segment if the first type reflects that the first audio segment is a silent audio segment and the second type reflects that the first audio segment has a non-semantic breakpoint; and if the fourth audio fragment is a mute audio fragment, determining that the first audio fragment and the second audio fragment form complete semantics.
Optionally, the complete semantic determination module 12 is configured to input semantic features of the first audio piece into a first classification model to output a second type of the first audio piece from the first classification model.
Optionally, the apparatus further comprises: a training module 16, configured to obtain semantic features of a first training audio segment and a reference type of the training audio segment; training the first classification model by taking semantic features of the first training audio fragment as training data and taking the reference type as supervision information; performing loss calculation on the prediction type of the first training audio fragment output by the first classification model and the reference type of the first training audio fragment; optimizing the first classification model according to the loss calculation result.
Optionally, the complete semantic determining module 12 is configured to input semantic features of the first audio segment into a second classification model, so as to output, by the second classification model, a composite type to which the first audio segment belongs; and if the composite type reflects that the first audio fragment is a mute audio fragment and the first audio fragment is a semantic breakpoint, determining that the first audio fragment and the second audio fragment form complete semantics.
Optionally, the apparatus further comprises: an acoustic feature extraction module 17 and a semantic feature extraction module 18.
The acoustic feature extraction module 17 is configured to perform feature extraction on the first audio segment to obtain acoustic features of the first audio segment.
The semantic feature extraction module 18 is configured to input the acoustic feature into a feature extraction model to output semantic features of the first audio segment from the feature extraction model.
Optionally, the training module 16 is configured to obtain semantic features of a second training audio segment, a reference type of the second training audio segment, and a reference speech recognition result of the second training audio segment; training a classification model by taking semantic features of the second training audio fragment as training data and taking the reference type as supervision information, wherein the classification model comprises the first classification model and/or the second classification model; training a voice recognition model by taking semantic features of the second training audio fragment as training data and taking the reference voice recognition result as supervision information; determining a first loss value according to the prediction type of the second training audio fragment and the reference type of the second training audio fragment output by the classification model; determining a second loss value according to the predicted voice recognition result of the second training audio fragment and the reference voice recognition result output by the voice recognition model; optimizing the classification model, the speech recognition model and the feature extraction model according to the first loss value, the second loss value and the respective weight parameters.
The apparatus shown in fig. 12 may perform the method of the embodiment shown in fig. 1 to 7, and reference is made to the relevant description of the embodiment shown in fig. 1 to 7 for a part of this embodiment that is not described in detail. The implementation process and technical effects of this technical solution are described in the embodiments shown in fig. 1 to 7, and are not described herein.
Fig. 13 is a schematic structural diagram of another audio data processing device according to an embodiment of the present invention, as shown in fig. 13, where the device includes:
the training audio segment obtaining module 21 is configured to obtain semantic features of a training audio segment and a reference type of the training audio segment, where the training audio segment is any audio segment in training audio.
The classification model training module 22 is configured to train a classification model with the semantic features of the training audio segment as training data and the reference type as supervision information, where the classification model is used to determine whether the training audio segment is a semantic breakpoint in the training audio.
A loss calculation module 23, configured to perform loss calculation on the predicted type of the training audio segment and the reference type output by the classification model.
An optimization module 24 for optimizing the classification model based on the loss calculation result.
The apparatus shown in fig. 13 may perform the method of the embodiment shown in fig. 8, and reference is made to the relevant description of the embodiment shown in fig. 8 for parts of this embodiment not described in detail. The implementation process and the technical effect of this technical solution refer to the description in the embodiment shown in fig. 8, and are not repeated here.
Fig. 14 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present invention, as shown in fig. 14, where the device includes:
the acquisition module 31 is configured to acquire a first audio clip generated by a user in response to an input operation of the user.
The output module 32 is configured to output a response result of the audio to be processed having complete semantics, where whether the semantics of the audio to be processed are complete is determined according to the semantic features of the first audio segment, and the audio to be processed includes the first audio segment and a second audio segment collected before the first audio segment.
The apparatus shown in fig. 14 may perform the method of the embodiment shown in fig. 9, and reference is made to the relevant description of the embodiment shown in fig. 9 for parts of this embodiment not described in detail. The implementation process and the technical effect of this technical solution are described in the embodiment shown in fig. 9, and are not described herein.
In one possible design, the audio data processing method provided in the above embodiments may be applied to an electronic device, as shown in fig. 15, where the electronic device may include: a first processor 41 and a first memory 42. Wherein the first memory 42 is for storing a program for supporting the electronic device to execute the audio data processing method provided in the embodiment shown in fig. 1 to 7 described above, the first processor 41 is configured for executing the program stored in the first memory 42.
The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor 41, are capable of performing the steps of:
acquiring a first audio fragment;
determining whether the first audio fragment and a second audio fragment form complete semantics according to semantic features of the first audio fragment, wherein the second audio fragment is acquired before the first audio fragment;
if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed;
responding to the audio to be processed.
Optionally, the first processor 41 is further configured to perform all or part of the steps in the embodiments shown in fig. 1 to 7.
The electronic device may further include a first communication interface 43 in its structure for communicating with other devices or communication systems.
In addition, an embodiment of the present invention provides a computer storage medium storing computer software instructions for the electronic device, which includes a program for executing the audio data processing method shown in fig. 1 to 7.
In one possible design, the audio data processing method provided in the foregoing embodiments may be applied to another electronic device, as shown in fig. 16, where the electronic device may include: a second processor 51 and a second memory 52. Wherein the second memory 52 is for storing a program for supporting the electronic device to execute the audio data processing method provided in the embodiment shown in fig. 8 described above, the second processor 51 is configured for executing the program stored in the second memory 52.
The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor 51, are capable of performing the steps of:
acquiring semantic features of a training audio fragment and a reference type of the training audio fragment, wherein the training audio fragment is any audio fragment in training audio;
The semantic features of the training audio fragments are used as training data, the reference type is used as supervision information to train a classification model, and the classification result output by the classification model is used for judging whether the training audio fragments and audio fragments played before the training audio fragments in the training audio form complete semantics or not;
performing loss calculation on the prediction type of the training audio fragment output by the classification model and the reference type;
optimizing the classification model according to the loss calculation result.
Optionally, the second processor 51 is further configured to perform all or part of the steps in the embodiment shown in fig. 8.
The electronic device may further include a second communication interface 53 in its structure for communicating with other devices or communication systems.
In addition, an embodiment of the present invention provides a computer storage medium storing computer software instructions for the electronic device, which includes a program for executing the audio data processing method shown in fig. 8.
In one possible design, the man-machine interaction method provided in the above embodiments may be applied to another electronic device, as shown in fig. 17, where the electronic device may include: a third processor 61 and a third memory 62. The third memory 62 is used for storing a program for supporting the electronic device to execute the man-machine interaction method provided in the embodiment shown in fig. 9, and the third processor 61 is configured to execute the program stored in the third memory 62.
The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor 61, are capable of performing the steps of:
responding to input operation of a user, and collecting a first audio fragment generated by the user;
outputting a response result of the audio to be processed with complete semantics, wherein whether the semantics of the audio to be processed are complete or not is judged according to the semantic features of the first audio segment, and the audio to be processed comprises the first audio segment and a second audio segment acquired before the first audio segment.
Optionally, the third processor 61 is further configured to perform all or part of the steps in the embodiment shown in fig. 9.
A third communication interface 63 may also be included in the structure of the electronic device for the electronic device to communicate with other devices or communication systems.
In addition, an embodiment of the present invention provides a computer storage medium, configured to store computer software instructions for the electronic device, where the computer storage medium includes a program for executing the man-machine interaction method shown in fig. 9.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (14)

1. A method of processing audio data, comprising:
acquiring a first audio fragment;
determining whether the first audio fragment and a second audio fragment form complete semantics according to semantic features of the first audio fragment, wherein the second audio fragment is acquired before the first audio fragment;
if the first audio fragment and the second audio fragment form complete semantics, determining the first audio fragment and the second audio fragment as audio to be processed;
responding to the audio to be processed.
2. The method of claim 1, wherein prior to determining the first audio segment and the second audio segment as audio to be processed, the method further comprises:
performing voice activity detection on at least one third audio segment acquired after the first audio segment;
and if the detection result shows that the third audio fragment is a mute audio fragment, determining the first audio fragment and the second audio fragment as audio to be processed.
3. The method of claim 1, wherein determining whether the first audio piece and the second audio piece constitute complete semantics based on the semantic features comprises:
Performing voice activity detection on the first audio fragment to obtain a first type to which the first audio fragment belongs;
performing semantic breakpoint classification on the first audio fragment according to the semantic features of the first audio fragment to obtain a second type to which the first audio fragment belongs;
and determining whether the first audio fragment and the second audio fragment form complete semantics according to the first type and the second type.
4. The method of claim 3, wherein the determining whether the first audio piece and the second audio piece constitute complete semantics based on the first type and the second type comprises:
and if the first type reflects that the first audio fragment is a mute audio fragment and the second type reflects that the first audio fragment has a semantic breakpoint, determining that the first audio fragment and the second audio fragment form complete semantics.
5. The method of claim 3, wherein the determining whether the first audio piece and the second audio piece constitute complete semantics based on the first type and the second type comprises:
if the first type reflects that the first audio segment is a mute audio segment and the second type reflects that the first audio segment has a non-semantic breakpoint, performing voice activity detection on at least one fourth audio segment acquired after the first audio segment;
And if the fourth audio fragment is a mute audio fragment, determining that the first audio fragment and the second audio fragment form complete semantics.
6. The method of claim 3, wherein said classifying the first audio segment for semantic breakpoints based on the semantic features comprises:
the semantic features of the first audio segment are input into a first classification model to output a second type of the first audio segment from the first classification model.
7. The method of claim 6, wherein the method further comprises:
acquiring semantic features of a first training audio fragment and a reference type of the training audio fragment;
training the first classification model by taking semantic features of the first training audio fragment as training data and taking the reference type as supervision information;
performing loss calculation on the prediction type of the first training audio fragment output by the first classification model and the reference type of the first training audio fragment;
optimizing the first classification model according to the loss calculation result.
8. The method of claim 1, wherein determining whether the first audio piece and the second audio piece constitute complete semantics based on the semantic features comprises:
Inputting semantic features of the first audio fragment into a second classification model to output a composite type to which the first audio fragment belongs by the second classification model;
and if the composite type reflects that the first audio fragment is a mute audio fragment and the first audio fragment has a semantic breakpoint, determining that the first audio fragment and the second audio fragment form complete semantics.
9. The method according to claim 6 or 8, characterized in that the method further comprises:
extracting features of the first audio segment to obtain acoustic features of the first audio segment;
inputting the acoustic features into a feature extraction model to output semantic features of the first audio segment from the feature extraction model.
10. The method according to claim 9, wherein the method further comprises:
acquiring semantic features of a second training audio fragment, a reference type of the second training audio fragment and a reference voice recognition result of the second training audio fragment;
training a classification model by taking semantic features of the second training audio fragment as training data and taking the reference type as supervision information, wherein the classification model comprises the first classification model and/or the second classification model;
Training a voice recognition model by taking semantic features of the second training audio fragment as training data and taking the reference voice recognition result as supervision information;
determining a first loss value according to the prediction type of the second training audio fragment and the reference type of the second training audio fragment output by the classification model;
determining a second loss value according to the predicted voice recognition result of the second training audio fragment and the reference voice recognition result output by the voice recognition model;
optimizing the classification model, the speech recognition model and the feature extraction model according to the first loss value, the second loss value and the respective weight parameters.
11. A method of audio data processing, the method comprising:
acquiring semantic features of a training audio fragment and a reference type of the training audio fragment, wherein the training audio fragment is any audio fragment in training audio;
the semantic features of the training audio fragments are used as training data, the reference type is used as supervision information to train a classification model, and the classification result output by the classification model is used for judging whether the training audio fragments and audio fragments played before the training audio fragments in the training audio form complete semantics or not;
Performing loss calculation on the prediction type of the training audio fragment output by the classification model and the reference type;
optimizing the classification model according to the loss calculation result.
12. The man-machine interaction method is characterized by being applied to a service platform and comprising the following steps of:
responding to input operation of a user, and collecting a first audio fragment generated by the user;
outputting a response result of the audio to be processed with complete semantics, wherein whether the semantics of the audio to be processed are complete or not is judged according to the semantic features of the first audio segment, and the audio to be processed comprises the first audio segment and a second audio segment acquired before the first audio segment.
13. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the audio data processing method according to any one of claims 1 to 11 or the human-machine interaction method according to claim 12.
14. A non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the audio data processing method of any of claims 1 to 11 or the human-machine interaction method of claim 12.
CN202310678990.3A 2023-06-08 2023-06-08 Audio data processing method, man-machine interaction method, equipment and storage medium Pending CN116844541A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310678990.3A CN116844541A (en) 2023-06-08 2023-06-08 Audio data processing method, man-machine interaction method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310678990.3A CN116844541A (en) 2023-06-08 2023-06-08 Audio data processing method, man-machine interaction method, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116844541A true CN116844541A (en) 2023-10-03

Family

ID=88168015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310678990.3A Pending CN116844541A (en) 2023-06-08 2023-06-08 Audio data processing method, man-machine interaction method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116844541A (en)

Similar Documents

Publication Publication Date Title
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
CN110136749B (en) Method and device for detecting end-to-end voice endpoint related to speaker
CN111933129A (en) Audio processing method, language model training method and device and computer equipment
KR101622111B1 (en) Dialog system and conversational method thereof
CN111161726B (en) Intelligent voice interaction method, device, medium and system
CN113314119B (en) Voice recognition intelligent household control method and device
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN112735385A (en) Voice endpoint detection method and device, computer equipment and storage medium
CN116417003A (en) Voice interaction system, method, electronic device and storage medium
CN113035180A (en) Voice input integrity judgment method and device, electronic equipment and storage medium
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium
KR20210123545A (en) Method and apparatus for conversation service based on user feedback
CN110853669A (en) Audio identification method, device and equipment
CN114399992B (en) Voice instruction response method, device and storage medium
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN116844541A (en) Audio data processing method, man-machine interaction method, equipment and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113593565A (en) Intelligent home device management and control method and system
CN112632234A (en) Human-computer interaction method and device, intelligent robot and storage medium
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
CN116483960B (en) Dialogue identification method, device, equipment and storage medium
EP4099320A2 (en) Method and apparatus of processing speech, electronic device, storage medium, and program product
CN113160821A (en) Control method and device based on voice recognition
CN116050431A (en) Man-machine interaction method, device, robot, intelligent equipment and storage medium
CN117711376A (en) Language identification method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination