CN115424617A - Model training, dialogue recognition and voice interaction method, device and storage medium - Google Patents

Model training, dialogue recognition and voice interaction method, device and storage medium Download PDF

Info

Publication number
CN115424617A
CN115424617A CN202210782504.8A CN202210782504A CN115424617A CN 115424617 A CN115424617 A CN 115424617A CN 202210782504 A CN202210782504 A CN 202210782504A CN 115424617 A CN115424617 A CN 115424617A
Authority
CN
China
Prior art keywords
text
sound source
information
sample
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210782504.8A
Other languages
Chinese (zh)
Inventor
纪璇
朱磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Maojing Artificial Intelligence Technology Co ltd
Original Assignee
Zhejiang Maojing Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Maojing Artificial Intelligence Technology Co ltd filed Critical Zhejiang Maojing Artificial Intelligence Technology Co ltd
Priority to CN202210782504.8A priority Critical patent/CN115424617A/en
Publication of CN115424617A publication Critical patent/CN115424617A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention provides a model training method, a dialogue recognition method, a dialogue interaction method, a dialogue recognition device and a voice interaction device and a storage medium. Acquiring a text sample and a sound source information sample corresponding to voice data and a conversation attribute label; at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics; and training a conversation attribute prediction model based on the fusion features as input and the conversation attribute labels as supervision conditions. In the scheme of the embodiment of the invention, the text characteristic and the sound source characteristic are fused, so that the sound source characteristic with reference value can be learned during the training of the conversation attribute prediction model, whether the conversation attribute prediction model carries out more accurate prediction on the machine speaking or not, and the user experience of the speech interaction is improved.

Description

Model training, dialogue recognition and voice interaction method, device and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a method, equipment and a storage medium for model training, dialogue recognition and voice interaction.
Background
With the continuous development of computer technology and the continuous progress of artificial intelligence technology. The application of the small intelligent sound box in the household field is more and more extensive, and the small intelligent sound box for the household is rapidly developed.
The existing intelligent sound box facing to the household is mostly in an interactive mode of passive response, the intelligent sound box needs to repeatedly detect external voice information, a wake-up word needs to be used for voice wake-up during voice interaction at each time, a far-field home environment is complex and changeable, various home noises and multidirectional human voice interference are generated, and the mobile sound source can influence the experience of conversation with the intelligent sound box. Such interactive mode is rigid and tiresome, and there are cases where people-to-people chatting or people-to-machine chatting are difficult to distinguish.
In order to make the behavior of the smart sound box more natural and vivid, conform to the scene of human-to-human conversation, improve the user experience of the smart sound box, and more accurately judge whether the machine is speaking, a new voice interaction method is needed.
Disclosure of Invention
Embodiments of the present invention provide a model training method, a dialogue recognition method, a dialogue interaction method, a dialogue recognition device, a dialogue interaction device, and a storage medium to at least partially solve the above problems.
According to a first aspect of embodiments of the present invention, there is provided a model training method, including: acquiring a text sample and a sound source information sample corresponding to voice data and a conversation attribute label; at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics; and training a conversation attribute prediction model based on the fusion features as input and the conversation attribute labels as supervision conditions.
In another implementation manner of the present application, obtaining a text sample corresponding to voice data includes: extracting a voice information sample from voice data; and inputting the voice information sample into the text recognition model to obtain a text sample.
In another implementation of the present application, obtaining a sound source information sample includes: sound source information samples are extracted from the speech data.
In another implementation manner of the present application, fusing at least a text feature of a text sample and a sound source feature of a sound source information sample to obtain a fused feature, including: and fusing the voice characteristics of the voice information sample, the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics.
In another implementation manner of the present application, the model training method further includes: acquiring a user face information sample corresponding to voice data; at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics, wherein the fusion characteristics comprise: and fusing the facial features of the user facial information samples, the text features of the text samples and the sound source features of the sound source information samples to obtain fused features.
According to a second aspect of the embodiments of the present invention, there is provided a dialog recognition method, including: acquiring text information and sound source information corresponding to voice data; at least fusing the text characteristics of the text information and the sound source characteristics of the sound source information to obtain fusion characteristics; and inputting the fusion features into a conversation attribute prediction model to obtain a conversation attribute prediction result, and training the conversation attribute prediction model according to the method in the first aspect.
According to a third aspect of the embodiments of the present invention, there is provided a voice interaction method, including: sending the acquired voice data; receiving a dialog attribute prediction result, the dialog attribute prediction result being determined based on the method according to the second aspect; based on the dialog attribute prediction result, dialog attributes of the speech data are determined.
In another implementation of the present application, determining a dialog attribute of speech data based on a dialog attribute prediction result includes: and comparing the conversation attribute probability indicated by the conversation attribute prediction result with a preset probability threshold value, and judging the conversation attribute of the voice data.
In another implementation manner of the present application, the voice interaction method further includes: acquiring user face information corresponding to voice data; based on the user face information, a preset probability threshold is determined.
In another implementation of the present application, determining the preset probability threshold based on the user face information includes: determining a human-computer interaction direction indicated by user face information; and determining the probability threshold of the conversation attribute as the man-machine interaction conversation based on the matching degree between the man-machine interaction direction and the reference man-machine interaction direction, wherein the matching degree is inversely related to the probability threshold.
In another implementation manner of the present application, the voice interaction method further includes: and judging whether to perform voice recognition on the voice data according to the dialogue attribute of the voice data.
According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method of any one of the first aspect to the third aspect.
According to a fifth aspect of embodiments of the present invention, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as in any of the first to third aspects.
In the scheme of the embodiment of the invention, the text characteristic and the sound source characteristic are fused, so that the sound source characteristic with reference value can be learned during the training of the conversation attribute prediction model, whether the conversation attribute prediction model carries out more accurate prediction on the machine speaking or not, and the user experience of the speech interaction is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.
FIG. 1 is a flow diagram of the steps of a model training method according to one embodiment of the present invention.
Fig. 2A is a schematic diagram of a dialog property prediction model in the embodiment of fig. 1.
FIG. 2B is a flowchart illustrating steps of the model training method of FIG. 2A.
Fig. 3 is a flowchart illustrating steps of a dialog recognition method according to another embodiment of the present invention.
Fig. 4 is a flowchart illustrating steps of a voice interaction method according to another embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described in detail below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.
The following further describes specific implementation of the embodiments of the present invention with reference to the drawings.
FIG. 1 is an exemplary flow chart of a model training method according to an embodiment of the present application. The solution of the present embodiment may be applied to any suitable electronic device with data processing capability, including but not limited to: server, mobile terminal (such as mobile phone, PAD, etc.), PC, etc. For example, in a model training (training) phase, a codec model may be trained based on training samples with a computing device (e.g., a data center) configured with a CPU (example of a processing unit) + GPU (example of an acceleration unit) architecture. Computing devices such as data centers may be deployed in cloud servers such as a private cloud, or a hybrid cloud. Accordingly, in the inference (inference) phase, the inference operation may also be performed by using a computing device configured with a CPU (example of processing unit) + GPU (example of acceleration unit) architecture.
The model training method of the embodiment comprises the following steps:
s110: and acquiring a text sample and a sound source information sample corresponding to the voice data and a conversation attribute label.
In particular, the speech data may be collected by a speech collection module, such as a microphone array. The voice acquisition module can be arranged in the terminal equipment, and can judge the position relation between a sound source sending sound waves and the voice acquisition module based on the sound wave characteristics of the acquired voice data, so as to determine a sound source information sample. The sound wave characteristics may be prosodic, spectral, and qualitative characteristics, such as sound wave intensity, sound wave propagation direction, pitch, fundamental frequency, energy, formants, duration, mel-frequency cepstral coefficients, and the like.
Accordingly, in one example, the text samples can be Speech Recognition processing, such as Automatic Speech Recognition (ASR), performed on the Speech data, e.g., based on input of the Speech data into a pre-trained Speech Recognition model, resulting in text samples. It should be appreciated that both the text samples and the source information samples of the speech data correspond to speech data, i.e., the speech data utilized to determine the text samples is at least partially identical to the speech data that determines the acoustic characteristics of the source information samples.
In addition, the dialog attribute tag indicates whether the voice data come from a man-machine dialog body (for example, a user), and if so, the natural language processing is performed on the text of the voice data collected by the terminal device so as to generate a corresponding instruction, for example, the corresponding instruction may include a reply text of the voice data; if not, the voice data collected by the terminal device can be ignored, in other words, the terminal device does not react to the voice data. The execution subject for ignoring the voice data or generating the voice data may be a terminal device or a background server, for example, the background server is deployed with a conversation attribute prediction model, and generates the instruction based on a prediction result of the conversation attribute prediction model.
The voice data may be collected based on the set tag when the tag is labeled, for example, when the user speaks towards the terminal device, the conversation attribute of the collected voice data may be marked as man-machine conversation, and when the user does not speak towards the terminal device, the conversation attribute of the collected voice data may be marked as non-man-machine conversation.
S120: and at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics.
It should be understood that other features, such as speech features, may be fused in addition to text features and sound source features. When fusing each feature, a feature vector or a feature matrix of the text feature and a feature vector or a feature matrix of the sound source feature can be determined. For the fusion process, the feature vector or feature matrix of the text feature and the feature vector or feature matrix of the sound source feature may be spliced, or the feature vector or feature matrix of the text feature and the feature vector or feature matrix of the sound source feature may be weighted (e.g., added), that is, elements in the feature vector or feature matrix are weighted.
S130: and training a conversation attribute prediction model based on the fusion features as input and the conversation attribute labels as supervision conditions.
The dialogue attribute model herein refers to a supervised class prediction model, which is a classification model that outputs a probability that a dialogue attribute belongs to a man-machine dialogue based on input voice data. When training the dialogue attribute model, the model is trained by inputting the fusion features into the dialogue attribute model and using the dialogue attribute labels as supervision conditions.
In the scheme of the embodiment of the invention, the text characteristic and the sound source characteristic are fused, so that the sound source characteristic with reference value can be learned during the training of the conversation attribute prediction model, whether the conversation attribute prediction model carries out more accurate prediction on the machine speaking or not, and the user experience of the speech interaction is improved.
In one possible implementation manner, obtaining a text sample corresponding to voice data includes: extracting a voice information sample from voice data; and inputting the voice information sample into the text recognition model to obtain a text sample.
The voice information is input into the text recognition model to obtain the text sample, so that the acquisition of the text sample is independent of the acquisition of the sound source sample, namely the text sample and the sound source sample can be acquired in parallel, and the voice data processing efficiency is improved.
In one possible implementation, obtaining a sound source information sample includes: sound source information samples are extracted from the speech data.
The sound source information sample here may be a kind of sound information. And judging whether an object which can be interacted exists at present or not by analyzing the sound information. Specifically, whether the voice information includes recognizable voice is analyzed, and if the voice information includes recognizable voice, it indicates that there is a person in the interactive range of the voice interaction entity, that is, existence of the interaction object is present. Specifically, whether the voice information contains the user voice or not is analyzed, the specific semantics of the user voice is analyzed, whether the user voice contains the willingness of interacting with the voice interaction entity or not is identified, and the voice interaction can be represented as human-computer interaction. For example, if the specific semantics of the user's voice indicate that the user is talking to others, then there is no interactive intent on the user at present. If the specific semantics of the user voice indicate that the user performs a human-computer interaction, the human-computer interaction here may be a voice interaction, for example, if the user asks the voice interaction entity for "what is now," then the user currently has a willingness to interact.
By extracting sound source sample information from the voice data, blank useless information in the voice data can be eliminated, and the voice data processing can be more efficient.
In one possible implementation manner, fusing at least the text feature of the text sample and the sound source feature of the sound source information sample to obtain a fused feature, including: and fusing the voice characteristics of the voice information sample, the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics.
The sound source feature refers to a feature corresponding to sound generation spatial position information, the speech feature refers to a feature corresponding to speech physical attribute information, and the text information refers to a text feature corresponding to speech. By fusing the three, the accuracy of judging voice interaction by fusing features can be improved.
In one possible implementation, the model training method further includes: acquiring a user face information sample corresponding to voice data; at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics, wherein the fusion characteristics comprise: and fusing the facial features of the user facial information samples, the text features of the text samples and the sound source features of the sound source information samples to obtain fused features.
Here, the user face information sample may be image information of face orientation. And judging whether an object which can be interacted currently exists or not through the analysis of the image information. Specifically, whether facial features exist in the image information is analyzed, and if the facial features exist, it is indicated that a user, i.e., an object that can be interacted, exists in the visual range of the voice interaction entity. Further, in the process of analyzing whether the facial features exist, in order to ensure the correctness of the analysis result, the facial features also need to be detected, and the false recognition of the virtual images such as photos, images and the like as people is eliminated. The facial features herein refer to facial features corresponding to facial information obtained by a video sensor of a specific voice interaction entity in a physical space, and whether a user performs voice interaction towards the specific voice interaction entity can be determined through the facial features. On the basis of the text characteristics and the sound source characteristics, the facial characteristics are fused, and the accuracy of judging voice interaction by fusing the characteristics can be improved.
Because the output of the prediction model is the probability value which comprises the probability threshold value and represents the probability condition that the voice interaction entity judges that the voice is the user, the probability threshold value can be reduced by fusing the face information, namely fusing the probability value with the negative correlation between the face and the orientation angle of the voice interaction entity, namely judging whether the voice is the user interaction under the lower probability threshold value. For example, if no face information sample is introduced, the prediction model determines that the probability threshold of user interaction is 0.7, that is, if the probability of the voice data after prediction processing is greater than or equal to 0.7, it is determined that the user performs human-computer interaction. When the face information sample is introduced, the probability threshold value of the prediction model for judging the user interaction is adjusted to be 0.6, namely, if the probability of the voice data is more than or equal to 0.7 after the voice data is subjected to prediction processing, the user is judged to carry out the human-computer interaction.
In one possible implementation, the model training method further includes: acquiring a user face information sample corresponding to voice data; at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics, wherein the fusion characteristics comprise: and fusing the facial features of the user facial information samples, the text features of the text samples, the sound source features of the sound source information samples and the text confidence degrees of the text samples to obtain fused features.
Usually, an error occurs in text information extracted from speech data, that is, an error occurs in text sentence information when converting speech information into text information. At this time, by introducing text confidence, the probability of error occurrence can be limited to an interval which can meet the recognition accuracy of voice interaction. On the basis of the text characteristics, the sound source characteristics and the face characteristics, the confidence coefficient of the text is fused, and the accuracy of judging voice interaction by fusing the characteristics can be improved.
In a preferred approach, the user behavior is further analyzed if it cannot be concluded from the user speech whether the user currently has an intention to interact, e.g. the user is humming a song, whose specific semantic speech interaction entities cannot understand the recognition, or no user speech is currently present. That is, image information is acquired by a camera (e.g., provided in the acquisition apparatus), and then whether a user action is included in the image information is analyzed. And when the image information contains the user action, judging the interaction intention according to the user action. Specifically, the specific meaning of the user action is analyzed to distinguish whether the user action contains the willingness to interact with the voice interaction entity. For example, if the specific meaning of the user action indicates that the user is busy doing something unrelated to the voice interaction entity, e.g., the user is typing, then the user is currently not willing to interact. The user currently has an interactive willingness if the specific meaning of the user action indicates that the user is acting on the voice interaction entity, e.g., the user waving his hand over the voice interaction entity indicates that the voice interaction entity is coming up. In the actual voice interaction, if a user actively sends an interaction request, namely sends a sound or action containing an interaction meaning to a voice interaction entity, the user can be directly regarded as having an interaction intention; if the user behavior clearly indicates that the user is busy with something else, it can be directly considered that the user has no willingness to interact.
Fig. 2A is a schematic diagram of a dialog property prediction model in the embodiment of fig. 1. The dialogue property prediction model in this example comprises an encoder 21, a decoder 22, a language model 23, a classification module 30, a threshold adjustment module 40, and an acquisition device 50. Therein, classification module 30 is implemented as a feature fusion layer and a classification layer, which may be a neural network layer such as a CNN classification layer.
FIG. 2B is an exemplary flow chart of a model training method according to an embodiment of the present application. The dialog recognition method of the embodiment includes:
s210: and acquiring voice data and user face information corresponding to the voice data.
The user face information corresponding to the voice data can be obtained through a camera on the voice interaction entity (for example, the acquisition device 50), and the user face information can be used as an input parameter for model training, but not necessarily, only the voice data can be used as a parameter for model training. The user face information may be processed to derive facial features (e.g., facial feature vectors).
In addition, the voice data may include voice information and sound source information. That is, the speech data in the text includes, in addition to the dialogue information exchanged by the user, the environmental noise and the position information of the sound source physical space. By inputting the voice data into the encoder 21, it is possible to eliminate environmental noise and useless voice information.
S220: the speech information in the speech data is input to the encoder 21, and the speech characteristics of the speech information are obtained.
For example, a vector (embedding) of speech information is input to the encoder 21, and the encoded speech feature is obtained.
S230: the speech features of the speech information are input into the decoder 22 to obtain the semantic features of the speech information.
For example, the encoded speech features are input to the decoder 22 to obtain semantic features.
S240: and carrying out normalization processing on the semantic features to obtain the text features of the text sample.
For example, a neural network layer formed by a linear operator and a softmax operator of a machine model training framework is adopted to normalize semantic features (e.g., semantic feature vectors) to obtain text features of text samples.
S250: and performing characteristic extraction on the sound source information in the voice data to obtain the sound source characteristics of the sound source information.
In particular, sound source information may be extracted based on a microphone array, e.g. calculating a sound source direction based on a microphone array consisting of a plurality of microphones. Further, a time difference at which the plurality of microphones receive the sound wave of the sound source is calculated, and a distance difference between a plurality of distances between the plurality of microphones and the sound source is calculated based on the time difference and the sound wave velocity. Then, based on the position and distance relationship and the distance difference between the plurality of microphones, the azimuth of the sound source with respect to the microphone array is calculated.
Further, the multiple sound wave information (from the same sound source) received by each of the multiple microphones is matched (in other words, aligned) based on the time difference of the sound waves received by the multiple microphones from the sound source, so that the voiceprint information of each of the multiple sound wave information is aligned, it is understood that the confidence of the voiceprint feature is higher for the aligned voiceprint information compared to a single microphone (an example of microphones in the same spatial location).
For example, if the plurality of microphones includes a first microphone and a second microphone, and the time difference between the first microphone and the second microphone receiving the sound wave of the sound source is 0.1s, the sound wave curve (including the time dimension and the amplitude dimension) received by the first microphone may be shifted by 0.1s in the time dimension to be aligned with the sound wave curve received by the second microphone.
S260: and inputting the text features of the text sample into the preset language model 23 to obtain the confidence of the text sample.
Here, the text features may be processed by the preset language model 23 to obtain a confidence thereof, which indicates the reference accuracy of the text features.
S270: and fusing the voice characteristics of the voice information, the text characteristics of the text sample and the confidence coefficient of the text characteristics to obtain a fusion result.
Specifically, the above-described fusion process may be performed based on the feature fusion layer in the classification module 30, and the fusion process includes a matrix splicing process or a weighting process of each of the speech features, the text features, and the confidence coefficients, in which corresponding elements in each matrix are weighted.
In the classification layer, the fusion result can be firstly input into the self-attention mechanism layer, then the output based on the attention mechanism layer is input into the cross attention mechanism layer, processed and then output to the convolutional neural network classification layer. For example, the speech features, the text features, the confidence degrees of the language texts and the sound source features of the speech information are spliced, and the splicing result is connected to the classification module (for example, the splicing result is input to the differential attention mechanism layer firstly and then is input to the CNN classification layer, the sound source features are fused in the input of the classification layer, and the reliability of the classification module is improved.
The Attention Mechanism (Attention Mechanism) here stems from the study of human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing. The above mechanism is commonly referred to as an attention mechanism. Different parts of the human retina have different degrees of information processing ability, i.e., acuity (Acuity), and only the fovea has the strongest Acuity. In order to make reasonable use of limited visual information processing resources, a human needs to select a specific portion in a visual region and then focus on it.
S270: and judging the dialog attribute indicated by the probability output from the classification module by using a preset probability threshold value adjusted based on the facial features.
For example, the probability value of the human-computer interaction prediction output by the convolutional neural network classification layer is combined with the prediction probability threshold value of the user face information, and the probability threshold value of the voice data which is the human-computer interaction dialog is determined. The voice data is obtained through a microphone array, and the user face information is obtained through a camera on the voice interaction entity.
If the voice interaction entity (an example of the capturing device) is a video obtaining device, that is, the voice interaction entity can obtain information of the face of the user, the prediction probability threshold of the face information of the user can be calculated according to the orientation angle of the face of the user.
The acquisition device may be a terminal device such as an embedded device, an internet of things device, or a non-embedded device such as a desktop computer or a server. An embedded operating system, such as a real-time operating system, may be installed in the embedded device to perform communication with the dialogue server through a network communication model. As the internet of things device, the collection device may be implemented as a smart device such as a smart appliance including, but not limited to, a smart watch, a smart speaker, a smart air conditioner, a smart doorbell, etc., and the smart device is capable of implementing a smart conversation such as voice interaction, computer vision interaction, etc., with a user through the human-computer interaction module, and performing initial processing based on a conversation instruction of the user and sending to the conversation server for further processing, or directly forwarding to the conversation server for further processing.
Further, the dialog server invokes a dialog model (natural language processing model) in the dialog server for reply text prediction based on the dialog attributes (e.g., where the dialog attributes indicate human-computer interaction) or determines the reply text based on the text features.
FIG. 3 is an exemplary flow diagram of a dialog recognition method according to an embodiment of the present application. The dialog recognition method of the embodiment includes:
s310: and acquiring text information and sound source information corresponding to the voice data.
S320: and at least fusing the text characteristic of the text information and the sound source characteristic of the sound source information to obtain a fusion characteristic.
S330: and inputting the fusion features into a conversation attribute prediction model to obtain a conversation attribute prediction result. The dialog attribute prediction model may be trained according to the method of the embodiment shown in fig. 1.
The dialogue attribute refers to classifying voice data, and may be classified into a voice interaction attribute, a user interaction attribute, and an environmental noise attribute, for example.
FIG. 4 is an exemplary flowchart of a voice interaction method according to an embodiment of the present application. The voice interaction method of the embodiment comprises the following steps:
s410: and sending the acquired voice data.
S420: a dialog attribute prediction result is received. The dialog attribute prediction result is determined based on the method according to fig. 3.
S430: based on the dialog attribute prediction result, dialog attributes of the speech data are determined.
In one possible implementation, determining the dialog attributes of the speech data based on the dialog attribute prediction result includes: and comparing the conversation attribute probability indicated by the conversation attribute prediction result with a preset probability threshold value, and judging the conversation attribute of the voice data.
In one possible implementation manner, the voice interaction method further includes: acquiring user face information corresponding to voice data; based on the user face information, a preset probability threshold is determined.
In one possible implementation, determining the preset probability threshold based on the user face information includes: determining a human-computer interaction direction indicated by the user face information; and determining the probability threshold of the conversation attribute as the man-machine interaction conversation based on the matching degree between the man-machine interaction direction and the reference man-machine interaction direction, wherein the matching degree is inversely related to the probability threshold.
In other words, the human-computer interaction prediction probability value obtained by the voice data is combined with the user face information prediction probability threshold, so that the probability threshold for determining that the voice data is a human-computer interaction session can be reduced, for example, when the user face information prediction probability threshold is not introduced, the probability threshold for determining that the voice data is the human-computer interaction session is 0.7, and when the user face information prediction probability threshold is introduced, the probability threshold for determining that the voice data is the human-computer interaction session is adjusted by 0.6. Correspondingly, under the condition of judging human-computer interaction conversation, inputting the text characteristics of the text sample into the natural language processing model to obtain a corresponding conversation reply text, and under the condition of judging non-human-computer interaction conversation, discarding the text characteristics of the text sample.
In one possible implementation manner, the voice interaction method further includes: and judging whether to perform voice recognition on the voice data according to the dialogue attribute of the voice data.
Referring to fig. 5, a schematic structural diagram of an electronic device according to another embodiment of the present invention is shown, and the specific embodiment of the present invention does not limit the specific implementation of the electronic device.
As shown in fig. 5, the electronic device may include: a processor (processor) 502, a communication interface 504, a memory 506 storing a program 510, and a communication bus 508.
The processor, the communication interface, and the memory communicate with each other via a communication bus.
And the communication interface is used for communicating with other electronic equipment or servers.
And the processor is used for executing the program, and particularly can execute the relevant steps in the method embodiment.
In particular, the program may include program code comprising computer operating instructions.
The processor may be a processor CPU, or an application specific integrated circuit ASIC (application specific integrated circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program may specifically be adapted to cause the processor to perform the operations of: acquiring a text sample and a sound source information sample corresponding to voice data and a conversation attribute label; at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics; and training a conversation attribute prediction model based on the fusion features as input and the conversation attribute labels as supervision conditions.
Alternatively, the program may be specifically adapted to cause a processor to perform the operations of: acquiring text information and sound source information corresponding to the voice data; at least fusing the text characteristics of the text information and the sound source characteristics of the sound source information to obtain fusion characteristics; and inputting the fusion features into a conversation attribute prediction model to obtain a conversation attribute prediction result.
Alternatively, the program may be specifically adapted to cause a processor to perform the operations of: sending the acquired voice data; receiving a conversation attribute prediction result; determining a dialog attribute of the speech data based on the dialog attribute prediction result.
The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims. The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, respectively. Of course, the functionality of the various elements may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media (transmyedia) such as modulated data signals and carrier waves.
The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional identical elements in the process, method, article, or apparatus comprising the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular transactions or implement particular abstract data types. The application may also be practiced in distributed computing environments where transactions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims (12)

1. A model training method, comprising:
acquiring a text sample and a sound source information sample corresponding to voice data and a conversation attribute label;
at least fusing the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fusion characteristics;
training a dialog attribute prediction model based on the fused features as input and the dialog attribute labels as supervision conditions.
2. The method of claim 1, wherein the obtaining a text sample corresponding to the voice data comprises:
extracting a voice information sample from voice data;
and inputting the voice information sample into a text recognition model to obtain a text sample.
3. The method of claim 2, wherein said obtaining a sound source information sample comprises:
extracting the sound source information samples from the voice data.
4. The method according to claim 2, wherein said fusing at least the text feature of the text sample and the sound source feature of the sound source information sample to obtain a fused feature comprises:
and fusing the voice characteristics of the voice information sample, the text characteristics of the text sample and the sound source characteristics of the sound source information sample to obtain fused characteristics.
5. The method of claim 1, wherein the method further comprises:
acquiring a user face information sample corresponding to the voice data;
the at least fusing the text features of the text sample and the sound source features of the sound source information sample to obtain fused features, comprising:
and fusing the facial features of the user facial information samples, the text features of the text samples and the sound source features of the sound source information samples to obtain fused features.
6. A dialog recognition method, comprising:
acquiring text information and sound source information corresponding to the voice data;
at least fusing the text features of the text information and the sound source features of the sound source information to obtain fusion features;
inputting the fusion features into a conversation attribute prediction model to obtain a conversation attribute prediction result, wherein the conversation attribute prediction model is obtained by training according to the method of any one of claims 1-5.
7. A voice interaction method, comprising:
sending the acquired voice data;
receiving a dialog attribute prediction result, the dialog attribute prediction result being determined based on the method of claim 6;
determining dialog attributes of the speech data based on the dialog attribute prediction result.
8. The method of claim 7, wherein the determining the dialog attributes of the speech data based on the dialog attribute prediction results comprises:
and comparing the dialog attribute probability indicated by the dialog attribute prediction result with a preset probability threshold value, and judging the dialog attribute of the voice data.
9. The method of claim 7, wherein the method further comprises:
acquiring user face information corresponding to the voice data;
determining the preset probability threshold based on the user face information.
10. The method of claim 9, wherein the determining the preset probability threshold based on the user facial information comprises:
determining a human-computer interaction direction indicated by the user face information;
and determining the probability threshold of the conversation attribute as the man-machine interaction conversation based on the matching degree between the man-machine interaction direction and the reference man-machine interaction direction, wherein the matching degree is inversely related to the probability threshold.
11. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction which causes the processor to execute the corresponding operation of the method according to any one of claims 1-11.
12. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1-10.
CN202210782504.8A 2022-07-05 2022-07-05 Model training, dialogue recognition and voice interaction method, device and storage medium Pending CN115424617A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210782504.8A CN115424617A (en) 2022-07-05 2022-07-05 Model training, dialogue recognition and voice interaction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210782504.8A CN115424617A (en) 2022-07-05 2022-07-05 Model training, dialogue recognition and voice interaction method, device and storage medium

Publications (1)

Publication Number Publication Date
CN115424617A true CN115424617A (en) 2022-12-02

Family

ID=84196569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210782504.8A Pending CN115424617A (en) 2022-07-05 2022-07-05 Model training, dialogue recognition and voice interaction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN115424617A (en)

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110428808B (en) Voice recognition method and device
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
US20220165288A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
US11769492B2 (en) Voice conversation analysis method and apparatus using artificial intelligence
CN110853617B (en) Model training method, language identification method, device and equipment
CN111564164A (en) Multi-mode emotion recognition method and device
CN111309883A (en) Man-machine conversation method based on artificial intelligence, model training method and device
JP7178394B2 (en) Methods, apparatus, apparatus, and media for processing audio signals
CN114913590B (en) Data emotion recognition method, device and equipment and readable storage medium
CN111383138B (en) Restaurant data processing method, device, computer equipment and storage medium
CN113571078A (en) Noise suppression method, device, medium, and electronic apparatus
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
CN112910761B (en) Instant messaging method, device, equipment, storage medium and program product
CN113886644A (en) Digital human video generation method and device, electronic equipment and storage medium
CN115424617A (en) Model training, dialogue recognition and voice interaction method, device and storage medium
US11670294B2 (en) Method of generating wakeup model and electronic device therefor
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
CN111899718A (en) Method, apparatus, device and medium for recognizing synthesized speech
CN111951807A (en) Voice content detection method, apparatus, medium, and system thereof
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment
WO2022222056A1 (en) Synthetic speech detection
US20220399016A1 (en) Presence-based application invocation
CN116959421B (en) Method and device for processing audio data, audio data processing equipment and medium
B. Wong et al. Toward Speech Articulation Detection through Smartphone

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 311121 room 801, building 2, No. 2699, yuhangtang Road, Cangqian street, Yuhang District, Hangzhou, Zhejiang Province

Applicant after: Zhejiang Aikesi Elf Artificial Intelligence Technology Co.,Ltd.

Address before: 311121 room 801, building 2, No. 2699, yuhangtang Road, Cangqian street, Yuhang District, Hangzhou, Zhejiang Province

Applicant before: Zhejiang Maojing Artificial Intelligence Technology Co.,Ltd.

CB02 Change of applicant information