WO2020007129A1 - 基于语音交互的上下文获取方法及设备 - Google Patents

基于语音交互的上下文获取方法及设备 Download PDF

Info

Publication number
WO2020007129A1
WO2020007129A1 PCT/CN2019/087203 CN2019087203W WO2020007129A1 WO 2020007129 A1 WO2020007129 A1 WO 2020007129A1 CN 2019087203 W CN2019087203 W CN 2019087203W WO 2020007129 A1 WO2020007129 A1 WO 2020007129A1
Authority
WO
WIPO (PCT)
Prior art keywords
conversation
face
voice
user
database
Prior art date
Application number
PCT/CN2019/087203
Other languages
English (en)
French (fr)
Inventor
梁阳
刘昆
乔爽爽
林湘粤
韩超
朱名发
郭江亮
李旭
刘俊
李硕
尹世明
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Priority to KR1020197034483A priority Critical patent/KR20200004826A/ko
Priority to EP19802029.9A priority patent/EP3617946B1/en
Priority to JP2019563817A priority patent/JP6968908B2/ja
Publication of WO2020007129A1 publication Critical patent/WO2020007129A1/zh
Priority to US16/936,967 priority patent/US20210012777A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • G06V40/173Classification, e.g. identification face re-identification, e.g. recognising unknown faces across different face tracks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • H04L12/1831Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/42221Conversation recording systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • H04M3/569Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants using the instant speaker's algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/50Telephonic communication in combination with video communication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/30Aspects of automatic or semi-automatic exchanges related to audio recordings in general
    • H04M2203/301Management of recordings
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/60Aspects of automatic or semi-automatic exchanges related to security aspects in telephonic communication systems
    • H04M2203/6045Identity confirmation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2203/00Aspects of automatic or semi-automatic exchanges
    • H04M2203/60Aspects of automatic or semi-automatic exchanges related to security aspects in telephonic communication systems
    • H04M2203/6054Biometric subscriber identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/567Multimedia conference systems

Definitions

  • Embodiments of the present invention relate to the field of voice interaction technologies, and in particular, to a method and device for acquiring context based on voice interaction.
  • intelligent voice interaction is an interaction mode based on voice input. Users can input their own requests through voice, and the product can respond to the corresponding content according to the intention of the request.
  • Embodiments of the present invention provide a method and a device for acquiring a context based on a voice interaction, so as to overcome a problem of high error rate of acquiring a context of a voice conversation.
  • an embodiment of the present invention provides a method for acquiring context based on voice interaction, including:
  • the first corresponding to the second facial feature is obtained from the facial database User identification, wherein the first face feature is a user's face feature, and the second face feature is a face feature of a user in a conversation state stored in a face database;
  • the context of the voice interaction is determined according to the current conversation and the saved conversation, and after the voice end point of the conversation is obtained , Storing the current conversation in the voice database.
  • the method further includes:
  • the conversation is stored in a voice database in association with the second user identifier, and the face feature of the target user is associated with the second user identifier in a face database.
  • determining the context of the voice interaction according to the current conversation and the existing conversation includes:
  • the context of the voice interaction is determined according to the current conversation and the existing conversation.
  • time interval is less than the preset interval, it means that the previous dialogue and the current dialogue are more likely to be contextual dialogues. If the time interval is greater than or equal to the preset interval, it means that the dialogue is the last time that the user focused on a topic Dialogue cannot be counted as this contextual dialogue.
  • the method further includes:
  • the time interval is greater than or equal to the preset interval, it indicates that the dialog is the user's previous dialog on a topic and cannot be counted as the current context dialog. Therefore, the first user identifier and the corresponding existing conversation stored in the voice database are deleted, so that the data in the voice database can be kept as newer data.
  • the method further includes:
  • the user identifiers and face features stored in association can be deleted in batches, which improves the deletion efficiency, enables the data in the face database to be kept as newer data, and avoids redundancy of the face database.
  • the extracting the facial features of each user in the scene image includes:
  • a plurality of the face pictures are sequentially input into a preset face feature model, and a face feature of each user that is sequentially output by the face feature model is obtained.
  • the method before the plurality of face regions are sequentially input into a preset face feature model, the method further includes:
  • the initial face feature model includes an input layer, a feature layer, a classification layer, and an output layer;
  • the classification layer in the initial facial feature model is deleted to obtain the preset facial feature model.
  • an initial face feature model is obtained.
  • the classification layer in the initial face feature model is deleted to obtain the preset face feature model. Because the classification layer is deleted, the In the preset face feature model, when a face picture is obtained by matting from a scene image, the face picture is input into the face feature model, and the face feature model can directly output face features instead of outputting Classification results.
  • the face feature model is a deep convolutional neural network model
  • the feature layer includes a convolution layer, a pooling layer, and a fully connected layer.
  • an embodiment of the present invention provides a voice interaction-based context acquisition device, including:
  • An extraction module configured to acquire a scene image collected by the image acquisition device at the voice starting point of the conversation, and extract facial features of each user in the scene image;
  • a matching module configured to obtain a second face feature from the face database if it is determined that a second face feature matching the first face feature exists according to the face feature of each user and the face database;
  • An obtaining module configured to determine a context of a voice interaction according to the current conversation and the stored conversation if it is determined that a stored conversation corresponding to the first user identifier is stored in the voice database, and obtain the current conversation After the voice end of the conversation, the current conversation is stored in the voice database.
  • the matching module is further configured to:
  • parameters including the facial features of each user are analyzed to obtain A target user in a conversation state, and generating a second user identifier of the target user;
  • the conversation is stored in a voice database in association with the second user identifier, and the face feature of the target user is associated with the second user identifier in a face database.
  • the obtaining module is specifically configured to:
  • the context of the voice interaction is determined according to the current conversation and the existing conversation.
  • the obtaining module is further configured to:
  • the matching module is further configured to:
  • the extraction module is specifically configured to:
  • a plurality of the face pictures are sequentially input into a preset face feature model, and a face feature of each user that is sequentially output by the face feature model is obtained.
  • the method further includes: a modeling module;
  • the modeling module is configured to, before sequentially inputting the plurality of face regions into a preset face feature model,
  • the initial face feature model includes an input layer, a feature layer, a classification layer, and an output layer;
  • the classification layer in the initial facial feature model is deleted to obtain the preset facial feature model.
  • the face feature model is a deep convolutional neural network model
  • the feature layer includes a convolution layer, a pooling layer, and a fully connected layer.
  • an embodiment of the present invention provides a voice interaction-based context acquisition device, including: at least one processor and a memory;
  • the memory stores computer execution instructions
  • the at least one processor executes computer-executable instructions stored in the memory, so that the at least one processor executes the voice interaction-based context acquisition method according to the first aspect or various possible designs of the first aspect.
  • an embodiment of the present invention provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer-executable instructions.
  • the processor executes the computer-executable instructions, the first aspect or the first aspect is implemented.
  • the method and device for acquiring a context based on voice interaction provided by this embodiment, by acquiring a scene image collected by an image acquisition device at a voice starting point of the conversation, and extracting a face feature of each user in the scene image;
  • the face feature and the face database determine that there is a second face feature matching the first face feature, then obtain a first user identifier corresponding to the second face feature from the face database, where the first face
  • the feature is a user's face feature
  • the second face feature is the face feature of the user in the conversation state stored in the face database, and the user is accurately identified through face recognition; if it is determined that the voice database is stored in the voice database, If there is a saved conversation corresponding to the first user ID, the context of the voice interaction is determined according to the conversation and the saved conversation, and after the voice end point of the conversation is obtained, the conversation is stored in the voice database, and the user ID is used.
  • FIG. 1 is a system architecture diagram of a voice interaction-based context acquisition method according to an embodiment of the present invention
  • FIG. 2 is a first flowchart of a method for acquiring a context based on voice interaction according to an embodiment of the present invention
  • FIG. 3 is a second flowchart of a method for acquiring a context based on voice interaction according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a face feature model according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a voice interaction-based context acquisition device according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a hardware structure of a context acquisition device based on a voice interaction according to an embodiment of the present invention.
  • FIG. 1 is a system architecture diagram of a voice interaction-based context acquisition method according to an embodiment of the present invention.
  • the system includes a terminal 110 and a server 120.
  • the terminal 110 may be a device with a voice interaction function, such as a story machine, a mobile phone, a tablet, a vehicle-mounted terminal, a welcome robot, a police robot, and the like.
  • the implementation manner of the terminal 110 is not particularly limited in this embodiment, as long as the terminal 110 can perform voice interaction with a user.
  • the terminal 110 further includes an image acquisition device, and the image acquisition device may collect an image of a user who has a conversation with the terminal 110.
  • the image acquisition device may be a camera, a video camera, or the like.
  • the server 120 can provide various online services, and can provide corresponding question and answer results for users' question and answer.
  • the embodiment of the present invention is also applicable.
  • the process of dialogue between multiple users involved in this embodiment and the terminal 110 may be as follows: when user A talks with the terminal 110, in the dialogue gap between the user A and the terminal 110, the user B inserts again to communicate with the terminal 110.
  • the terminal 110 has a conversation. At this time, there is a user A and a user B alternately having a conversation with the terminal 110, thereby forming a multi-person conversation scene.
  • the user is identified based on the voiceprint, and the user's context can be obtained.
  • the context of the user A and the context of the user B can be obtained, thereby reducing Get the error rate of the context.
  • the context is used to feedback the question and answer results to the user to improve the user experience.
  • the execution subject of this embodiment of the present invention may be the above-mentioned server. After acquiring the conversation input by the user, the terminal sends the conversation to the server, and the server returns a question and answer result of the conversation. Those skilled in the art can understand that when the function of the terminal is sufficiently powerful, the terminal may also feedback the question and answer result by itself after acquiring the conversation.
  • the following uses a server as an execution subject to describe in detail a method for acquiring a context based on a voice interaction provided by an embodiment of the present invention.
  • FIG. 2 is a first flowchart of a voice interaction-based context acquisition method according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
  • S201 Acquire a scene image collected by an image acquisition device at a voice start point of the conversation, and extract a facial feature of each user in the scene image.
  • voice endpoint detection technology is a very important technology, which is often also called voice activity detection (VAD).
  • VAD voice activity detection
  • Voice endpoint detection refers to finding the start point and end point of speech in the speech part in a continuous sound signal.
  • the executor of the voice activity detection technology may be the terminal described above, or the terminal may send voice to the server in real time for the server to execute.
  • the current conversation and the stored conversation in this embodiment refer to a continuous voice input by the user to the terminal, that is, a sentence.
  • the "dialog” can be understood as an action performed.
  • the "dialogue” in this embodiment may also be expressed as a noun in some scenarios.
  • the part of speech of the "conversation" can be determined according to the language description scenario.
  • a scene image collected by the image acquisition device at the speech start point is acquired. That is, it is determined that when a user interacts with the terminal and speaks to the terminal, the scene image collected in the current scene is acquired. If there are multiple people facing the terminal, there is a dialogue in the current scene image, there are users who face the terminal microphone and the mouth shape is speaking, and there may also be side or other orientations of the microphone relative to the terminal. user.
  • the facial features of each user in the scene image are extracted.
  • the facial features can be extracted through a facial feature model.
  • each user is used as a unit to extract facial features for the user.
  • the scene image is subjected to matting processing to obtain a face picture of each face; a plurality of face pictures are sequentially input into a preset face feature model, and a person of each user who sequentially outputs the face feature model is obtained. Face features.
  • the face feature may be a multi-dimensional feature, such as a multi-dimensional vector, and a vector of each dimension represents a feature, such as an eyebrow feature, an eye feature, a nose feature, etc., which are not described in detail in this embodiment.
  • the terminal may also schedule the server according to the load of each server, that is, the server with a lighter load performs the steps of this embodiment.
  • the facial features of each user are matched with the facial features in the face database to determine whether the first facial feature of a user can be matched with the first feature in the face database. Two face features match.
  • the matching in this embodiment can be understood as the two face features with the highest similarity under the premise that the similarity of the face features is greater than a preset value, and the similarity can be the cosine similarity of the two face features.
  • a first user identifier corresponding to the second face feature is obtained from the face database , And then execute S204, S205, and S206 in this order.
  • the face database stores the face features and corresponding user identifiers in a conversation state in association.
  • S205 Determine the context of the voice interaction according to the conversation and the existing conversation, and after obtaining the voice end point of the conversation, store the conversation in a voice database;
  • a user's face feature When a user's face feature can be matched with a second face feature in a conversation state (mouth opening state) in the face database, it is determined whether a stored conversation corresponding to the first user identifier is stored in the voice database.
  • the voice database stores a user ID and a corresponding conversation in association.
  • a stored conversation corresponding to the first user ID is stored in the voice database, it means that the conversation is not the first sentence of the user's input to the terminal within a preset time period, and the voice interaction is determined based on the conversation and the stored conversation
  • the context of the conversation that is, the context of this conversation is determined in the existing conversation.
  • natural language understanding can be combined to obtain the existing conversations related to the conversation, that is, to obtain the context.
  • the stored conversation corresponding to the first user ID is not stored in the voice database, it means that the conversation is the first sentence input by the user to the terminal within a preset time period, and the preset time period is a preset time before the current time Segment, such as half an hour before the current moment. At this time, it is considered that the conversation does not have context, and the conversation is stored in the voice database in association with the first user identifier.
  • the voice database and the face database may also be combined into a database, that is, a user identifier, corresponding facial features, and user dialogue are stored in a database in association.
  • facial features and corresponding user conversations may also be directly stored in the database.
  • the stored conversation corresponding to the second facial feature is obtained from the database, and according to the current
  • the dialogue and the saved dialogue determine the context of the voice interaction, and after obtaining the voice end point of the dialogue, the dialogue is stored in the voice database.
  • S207 Analyze parameters including the facial features of each user, obtain a target user in a conversation state, and generate a second user identifier of the target user.
  • the parameters including the facial features of each user are analyzed to obtain a target user in a conversation state, and a second user identifier of the target user is generated, and the user identifier may be a number, a letter, or the like or a combination thereof.
  • the user identification of the target user can also be generated by a hash algorithm.
  • the implementation manner of the user identifier is not particularly limited.
  • the facial features of the target user are stored in the face database in association with the second user identification, and the conversation is stored in the voice database in association with the second user identification, so that the user can
  • the context can be obtained in the existing conversation based on the content in the face database and the voice database.
  • the method for acquiring a context based on voice interaction obtains a scene image collected by an image acquisition device at a voice starting point of the conversation, and extracts a facial feature of each user in the scene image.
  • the face feature and the face database determine that there is a second face feature matching the first face feature, and then obtain a first user identifier corresponding to the second face feature from the face database, where the first face feature is The facial features of one user, and the second facial features are the facial features of the user in the conversation state stored in the face database, and the user's identity is accurately identified through face recognition; if it is determined that the voice database stores the first
  • the context of the voice interaction is determined according to the conversation and the saved conversation, and after the voice end point of the conversation is obtained, the conversation is stored in a voice database, which can be obtained through the user ID
  • This conversation belongs to the same user ’s existing conversation, and gets the top and bottom of the voice interaction based on the conversation of the same user To avoid the conversation as
  • FIG. 3 is a second flowchart of a voice interaction-based context acquisition method according to an embodiment of the present invention. As shown in Figure 3, the method includes:
  • the voice database stores a user ID and each sentence corresponding to the user ID, that is, the user ID is stored in association with at least one conversation of the user.
  • each conversation is stored, it corresponds to the time at which the voice start point and the voice end point of the conversation are stored.
  • the voice start point and the voice end point of the previous conversation corresponding to the first user identifier are obtained from the voice database according to the first user identifier.
  • the time interval between the voice end point of the previous conversation and the voice start point of the conversation is obtained.
  • the preset interval may be 10 minutes, 30 minutes, and so on.
  • the implementation method is not particularly limited.
  • the time interval is greater than or equal to the preset interval, it indicates that the dialog is the user's previous dialog on a topic and cannot be counted as the current context dialog. Therefore, the first user identifier and the corresponding stored conversation stored in the voice database are deleted, and there is no context for the conversation.
  • the associated first user identifier and the corresponding stored conversation are deleted from the voice database
  • the associated first user identifier and the corresponding face feature may also be deleted from the face database.
  • the two may also be deleted asynchronously, and the third user identifier and the corresponding facial feature that are not matched in the face database within a preset time period may be deleted.
  • the user identifiers and face features stored in association can be deleted in batches, which improves the deletion efficiency.
  • each time a user's conversation is acquired the above operation is performed, so that multiple conversations of each user stored in the voice database are conversations with a time interval less than a preset interval. Therefore, the context of the conversation is obtained based on all the existing conversations of the user and the conversation. For example, the user's current conversation and all existing conversations can be used as the context of voice interaction, or the conversation of the same user can be obtained from all existing conversations based on natural language understanding.
  • the context of the conversation can be judged more accurately, and the accuracy of context acquisition is improved.
  • the embodiment of the present invention obtains a face feature of each user by using a face feature model.
  • the following uses a detailed embodiment to explain the process of constructing a facial feature model.
  • FIG. 4 is a schematic structural diagram of a face feature model according to an embodiment of the present invention.
  • the face feature model can use Deep Convolutional Neural Networks (Deep CNN).
  • the model includes an input layer, a feature layer, a classification layer, and an output layer.
  • the feature layer includes a convolution layer, a pooling layer, and a fully connected layer. Among them, there can be multiple alternate convolution layers and pooling layers in the feature layer.
  • a face training sample is obtained, and the face training sample includes a face picture and a label.
  • the label is a classification result of each feature in a pre-calibrated face picture, and the label may be a vector in the form of a matrix.
  • the face picture is input from the input layer, a vector consisting of a matrix is input, and then the convolution layer uses a convolution kernel with different weights to scan and convolve the original image or feature map, and extract various meanings from it.
  • Features and output to the feature map The pooling layer is sandwiched between consecutive convolutional layers. It is used to compress the amount of data and parameters to reduce overfitting, that is, to reduce the dimension of the feature map. Main features. All neurons between the two layers have the right to reconnect, usually the fully connected layer is at the tail of the convolutional neural network.
  • the final feature passes the classification layer and outputs the result.
  • an initial face feature model is obtained.
  • the classification layer in the initial face feature model is deleted to obtain the preset face feature model. Because the classification layer is deleted, the In the preset face feature model, when a face picture is obtained by matting from a scene image, the face picture is input into the face feature model, and the face feature model can directly output face features instead of outputting Classification results.
  • This embodiment uses a deep convolutional neural network model to extract facial features and perform identity recognition, which can accurately distinguish the source of a conversation, find the conversation context of each person, and improve the conversation experience in a multi-person scenario.
  • FIG. 5 is a schematic structural diagram of a voice interaction-based context acquisition device according to an embodiment of the present invention.
  • the voice interaction-based context acquisition device 50 includes an extraction module 501, a matching module 502, and an acquisition module 503.
  • a modeling module 504 is further included.
  • An extraction module 501 configured to acquire a scene image collected by an image acquisition device at a voice starting point of the conversation, and extract a facial feature of each user in the scene image;
  • the matching module 502 is configured to obtain the second person from the face database if it is determined that a second face feature matching the first face feature exists according to the face feature of each user and the face database.
  • An obtaining module 503 is configured to determine a context of a voice interaction according to the current conversation and the stored conversation if it is determined that a stored conversation corresponding to the first user identifier is stored in the voice database, and obtain the local conversation. After the voice end of the second conversation, the current conversation is stored in the voice database.
  • the matching module 502 is further configured to:
  • parameters including the facial features of each user are analyzed to obtain A target user in a conversation state, and generating a second user identifier of the target user;
  • the conversation is stored in a voice database in association with the second user identifier, and the face feature of the target user is associated with the second user identifier in a face database.
  • the obtaining module 503 is specifically configured to:
  • the context of the voice interaction is determined according to the current conversation and the existing conversation.
  • the obtaining module 503 is further configured to:
  • the matching module 502 is further configured to:
  • the extraction module 501 is specifically configured to:
  • a plurality of the face pictures are sequentially input into a preset face feature model, and a face feature of each user that is sequentially output by the face feature model is obtained.
  • the modeling module 504 is configured to obtain a face training sample before the multiple face regions are sequentially input into a preset face feature model, and the face training sample includes a face picture and a label;
  • the initial face feature model includes an input layer, a feature layer, a classification layer, and an output layer;
  • the classification layer in the initial facial feature model is deleted to obtain the preset facial feature model.
  • the face feature model is a deep convolutional neural network model
  • the feature layer includes a convolution layer, a pooling layer, and a fully connected layer.
  • the voice interaction-based context acquisition device provided in this embodiment has implementation principles and technical effects similar to those of the foregoing method embodiments, and is not described herein again in this embodiment.
  • FIG. 6 is a schematic diagram of a hardware structure of a context acquisition device based on a voice interaction according to an embodiment of the present invention.
  • the voice interaction-based context acquisition device 60 includes: at least one processor 601 and a memory 602.
  • the voice interaction context obtaining device 60 further includes a communication component 603.
  • the processor 601, the memory 602, and the communication component 603 are connected through a bus 604.
  • At least one processor 601 executes the computer execution instructions stored in the memory 602, so that the at least one processor 601 executes the above context-based method for acquiring a voice interaction.
  • the communication component 603 can perform data interaction with other devices.
  • the processor may be a central processing unit (English: Central Processing Unit, CPU for short), or other general-purpose processors, digital signal processors (English: Digital Signal) Processor (abbreviated as DSP), application specific integrated circuit (English: Application Specific Integrated Circuit, abbreviated as ASIC), etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the invention can be directly embodied as being executed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the memory may include high-speed RAM memory, and may also include non-volatile storage NVM, such as at least one disk memory.
  • the bus may be an Industry Standard Architecture (ISA) bus, an External Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus.
  • ISA Industry Standard Architecture
  • PCI External Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like.
  • address bus a data bus
  • control bus a control bus
  • the bus in the drawings of the present application is not limited to only one bus or one type of bus.
  • the present application also provides a computer-readable storage medium that stores computer-executable instructions.
  • the processor executes the computer-executable instructions, the method for acquiring a context based on voice interaction as described above is implemented.
  • the foregoing computer-readable storage medium may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable and removable Programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable and removable Programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • a readable storage medium may be any available medium that can be accessed by a general purpose or special purpose computer.
  • An exemplary readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium.
  • the readable storage medium may also be part of the processor.
  • the processor and the readable storage medium may be located in Application Specific Integrated Circuits (Application Specific Integrated Circuits, ASIC for short).
  • ASIC Application Specific Integrated Circuits
  • the processor and the readable storage medium may reside in the device as discrete components.
  • the division of the unit is only a kind of logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present invention is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in various embodiments of the present invention.
  • the foregoing storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes .
  • a person of ordinary skill in the art may understand that all or part of the steps of implementing the foregoing method embodiments may be implemented by a program instructing related hardware.
  • the aforementioned program may be stored in a computer-readable storage medium.
  • the steps including the foregoing method embodiments are performed; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)
  • Telephonic Communication Services (AREA)
  • Collating Specific Patterns (AREA)

Abstract

本发明实施例提供一种基于语音交互的上下文获取方法及设备,该方法包括:获取图像采集装置在本次对话的语音起点采集的场景图像,并提取场景图像中每个用户的人脸特征;若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从人脸数据库中获取第二人脸特征对应的第一用户标识,其中,第一人脸特征为一个用户的人脸特征,第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征;若确定语音数据库中存储有第一用户标识对应的已存对话,则根据本次对话与已存对话确定语音交互的上下文,并在得到本次对话的语音终点后,将本次对话存储至语音数据库中。本实施例可以提高获取语音交互的上下文的准确率。

Description

基于语音交互的上下文获取方法及设备
本申请要求于2018年07月02日提交中国专利局、申请号为201810709792.8、申请人为北京百度网讯科技有限公司、发明名称为“基于语音交互的上下文获取方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及语音交互技术领域,尤其涉及一种基于语音交互的上下文获取方法及设备。
背景技术
随着人工智能技术的发展,智能语音交互产品的研发和使用备受关注。其中,智能语音交互是基于语音输入的一种交互模式,用户可以通过语音来输入自己的请求,该产品可以根据请求的意图,响应相应的内容。
现有技术中,在智能服务机器人的应用场景中,例如:迎宾机器人,警务机器人等,往往存在多个人同时与智能服务机器人交互的场景。在多人与机器人对话时,如果不能识别对话内容的来源,则无法准确的获取对话上下文,从而无法向用户提供准确的服务,造成糟糕的对话体验。目前,在假设同一用户的对话内容中不会有不同主题的内容,且两个用户的对话内容的主题是没有交叠的前提下,通过自然语言理解来根据对话含义进行身份识别,以获取同一用户的对话上下文。
然而,在实际应用时基于自然语言理解的假设并不总是成立的,导致获取语音对话上下文的错误率较高。
发明内容
本发明实施例提供一种基于语音交互的上下文获取方法及设备,以克服获取语音对话上下文的错误率较高的问题。
第一方面,本发明实施例提供一种基于语音交互的上下文获取方法,包括:
获取图像采集装置在本次对话的语音起点采集的场景图像,并提取所述场景图像中每个用户的人脸特征;
若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从所述人脸数据库中获取所述第二人脸特征对应的第一用户标识,其中,所述第一人脸特征为一个用户的人脸特征,所述第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征;
若确定语音数据库中存储有所述第一用户标识对应的已存对话,则根据所述本次对话与所述已存对话确定语音交互的上下文,并在得到所述本次对话的语音终点后,将所述本次对话存储至所述语音数据库中。
在一种可能的设计中,若根据每个用户的人脸特征以及人脸数据库,确定不存在与第一人脸特征匹配的第二人脸特征,所述方法还包括:
对所述每个用户的人脸特征在内的参数进行分析,获取处于对话状态的目标用户,并生成所述目标用户的第二用户标识;
在检测到语音终点时,将本次对话与第二用户标识关联存储到语音数据库中,以及将所述目标用户的人脸特征与所述第二用户标识关联存储到人脸数据库中。
通过将本次对话与第二用户标识关联存储到语音数据库中,以及将所述目标用户的人脸特征与所述第二用户标识关联存储到人脸数据库中,以便该用户再次与终端进行语音交互时,能够基于人脸数据库和语音数据库中的内容在已存对话中获取上下文。通过将人脸数据库和语音数据库单独设置,便于人脸数据库和语音数据库的单独存储和维护。
在一种可能的设计中,所述根据所述本次对话与所述已存对话确定语音交互的上下文,包括:
根据所述第一用户标识从所述语音数据库中获取所述第一用户标识对应的上一对话的语音起点和语音终点;
若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔小于预设间隔,则根据所述本次对话与所述已存对话确定语音交互的上下文。
若该时间间隔小于预设间隔,则说明上一次对话与本次对话为上下文对话的可能性较高,若该时间间隔大于或等于预设间隔,则说明该对话为用户针对一主题的上一次对话,并不能算作本次上下文对话。通过判断上一对话的语音终点与本次对话的语音起点之间的时间间隔是否小于预设间隔,能够更加准确的判断本次对话的上下文,提高了上下文获取的准确率。
在一种可能的设计中,若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔大于或等于预设间隔,所述方法还包括:
在所述语音数据库中删除关联存储的所述第一用户标识和对应的已存对话。
若该时间间隔大于或等于预设间隔,则说明该对话为用户针对一主题的上一次对话,并不能算作本次上下文对话。由此,在语音数据库中删除关联存储的第一用户标识和对应的已存对话,使得语音数据库中的数据能够保持为较新的数据。
在一种可能的设计中,所述方法还包括:
将所述人脸数据库中在预设时间段内未匹配的第三用户标识以及对应的人脸特征删除。
通过该删除方式,可以对关联存储的用户标识和人脸特征进行批量删除,提高了删除效率,使得人脸数据库中的数据能够保持为较新的数据,避免人脸数据库的冗余化。
在一种可能的设计中,所述提取所述场景图像中每个用户的人脸特征,包括:
对所述场景图像进行抠图处理,得到每个人脸的人脸图片;
将多个所述人脸图片依次输入至预设的人脸特征模型中,获取所述人脸特征模型依次输出的每个用户的人脸特征。
通过人脸特征模型来获取用户的人脸特征,不仅处理速度较快,而且准确度高。
在一种可能的设计中,所述将所述多个人脸区域依次输入至预设的人脸特征模型中之前,所述方法还包括:
获取人脸训练样本,所述人脸训练样本包括人脸图片和标签;
根据所述人脸训练样本,得到训练后的初始的人脸特征模型;所述初始的人脸特征模型包括输入层、特征层、分类层以及输出层;
将所述初始的人脸特征模型中的分类层删除,得到所述预设的人脸特征模型。
通过上述的模型训练过程,得到了初始的人脸特征模型,将该初始的人脸特征模型中的分类层删除,得到该预设的人脸特征模型,由于删除了分类层,所以在使用该预设的人脸特征模型时,在从场景图像中抠图得到人脸图片时,将该人脸图片输入至人脸特征模型中,该人脸特征模型能够直接输出人脸特征,而不是输出分类结果。
在一种可能的设计中,所述人脸特征模型为深度卷积神经网络模型,所述特征层包括卷积层、池化层以及全连接层。
利用这种具有卷积、池化操作的深度神经网络模型,可以对图像的变形、模糊、噪声等具有较高的鲁棒性,对于分类任务具有更高的可泛化性。
第二方面,本发明实施例提供一种基于语音交互的上下文获取设备,包括:
提取模块,用于获取图像采集装置在本次对话的语音起点采集的场景图像,并提取所述场景图像中每个用户的人脸特征;
匹配模块,用于若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从所述人脸数据库中获取所述第二人脸特征对应的第一用户标识,其中,所述第一人脸特征为一个用户的人脸特征,所述第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征;
获取模块,用于若确定语音数据库中存储有所述第一用户标识对应的已存对话,则根据所述本次对话与所述已存对话确定语音交互的上下文,并在得到所述本次对话的语音终点后,将所述本次对话存储至所述语音数据库中。
在一种可能的设计中,所述匹配模块还用于,
若根据每个用户的人脸特征以及人脸数据库,确定不存在与第一人脸特征匹配的第二人脸特征,对所述每个用户的人脸特征在内的参数进行分析,获取处于对话状态的目标用户,并生成所述目标用户的第二用户标识;
在检测到语音终点时,将本次对话与第二用户标识关联存储到语音数据库中,以及将所述目标用户的人脸特征与所述第二用户标识关联存储到人脸数据库中。
在一种可能的设计中,所述获取模块具体用于:
根据所述第一用户标识从所述语音数据库中获取所述第一用户标识对应的上一对话的语音起点和语音终点;
若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔小于预设间隔,则根据所述本次对话与所述已存对话确定语音交互的上下文。
在一种可能的设计中,所述获取模块还用于:
若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔大于或等于预设间隔,在所述语音数据库中删除关联存储的所述第一用户标识和对应的已存对话。
在一种可能的设计中,所述匹配模块还用于:
将所述人脸数据库中在预设时间段内未匹配的第三用户标识以及对应的人脸特征删除。
在一种可能的设计中,所述提取模块具体用于:
对所述场景图像进行抠图处理,得到每个人脸的人脸图片;
将多个所述人脸图片依次输入至预设的人脸特征模型中,获取所述人脸特征模型依次输出的每个用户的人脸特征。
在一种可能的设计中,还包括:建模模块;
所述建模模块用于在将所述多个人脸区域依次输入至预设的人脸特征模型中之前,
获取人脸训练样本,所述人脸训练样本包括人脸图片和标签;
根据所述人脸训练样本,得到训练后的初始的人脸特征模型;所述初始的人脸特征模型包括输入层、特征层、分类层以及输出层;
将所述初始的人脸特征模型中的分类层删除,得到所述预设的人脸特征模型。
在一种可能的设计中,所述人脸特征模型为深度卷积神经网络模型,所述特征层包括卷积层、池化层以及全连接层。
第三方面,本发明实施例提供一种基于语音交互的上下文获取设备,包括:至少一个处理器和存储器;
所述存储器存储计算机执行指令;
所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面或第一方面的各种可能的设计所述的基于语音交互的上下文获取方法。
第四方面,本发明实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面或第一方面各种可能的设计所述的基于语音交互的上下文获取方法。
本实施例提供的基于语音交互的上下文获取方法及设备,通过获取图像采集装置在本次对话的语音起点采集的场景图像,并提取场景图像中每个用户的人脸特征;若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从人脸数据库中获取第二人脸特征对应的第一用户标识,其中,第一人脸特征为一个用户的人脸特征,第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征,通过人脸识别实现了准确的对用户进行身份识别;若确定语音数据库中存储有第一用户标识对应的已存对话,则根据本次对话与已存对话确定语音交互的上下文,并在得到本次对话的语音终点后,将本次对话存储至语音数据库中,通过用户标识能够获取与本次对话属于同一用户的已存对话,根据同一用户的对话来获取语音交互的上下文,避免了将不同用户的对话作为上下文,提高了获取上下文的准确率。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的基于语音交互的上下文获取方法的系统架构图;
图2为本发明实施例提供的基于语音交互的上下文获取方法的流程图一;
图3为本发明实施例提供的基于语音交互的上下文获取方法的流程图二;
图4为本发明实施例提供的人脸特征模型的结构示意图;
图5为本发明实施例提供的基于语音交互的上下文获取设备的结构示意图;
图6为本发明实施例提供的基于语音交互的上下文获取设备的硬件结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1为本发明实施例提供的基于语音交互的上下文获取方法的系统架构图。如图1所示,该系统包括终端110和服务器120。该终端110可以为故事机、手机、平板、车载终端、迎宾机器人、警务机器人等具有语音交互功能的设备。
本实施例对终端110的实现方式不做特别限制,只要该终端110能够与用户进行语音交互即可。在本实施例中,该终端110还包括图像采集装置,该图像采集装置可以采集与终端110进行对话的用户的图像。该图像采集装置可以为照相机、摄像机等。该服务器120可以提供各种线上服务,能够针对用户的问答提供相应的问答结果。
对于多个用户与终端110进行对话的过程,本发明实施例也同样适用。其中,本实施例中所涉及的多个用户与终端110进行对话的过程可以为:当用户A与终端110进行对话时,在用户A与终端110的对话间隙中,用户B又插进来与终端110进行对话,此时,就存在用户A与用户B交替与终端110进行对话,由此形成了多人对话场景。
本发明实施例基于声纹来对用户进行身份识别,能够获取用户的上下文,例如能够在用户A与用户B同时与终端的交互过程中,获取用户A的上下文以及用户B的上下文,从而降低了获取上下文的错误率。在获取到同一用户语音交互的上下文之后,结合上下文来向用户反馈问答结果,提高用户体验。
本发明实施例的执行主体可以为上述的服务器,该终端在获取了用户输入的对话后,向服务器发送该对话,由服务器返回该对话的问答结果。本领域技术人员可以理解,当终端的功能足够强大时,也可以由终端在获取到对话后,自行反馈问答结果。下面以服务器作为执行主体,来详细说明本发明实施例提供的基于语音交互的上下文获取方法。
图2为本发明实施例提供的基于语音交互的上下文获取方法的流程图一,如图2所示,该方法包括:
S201、获取图像采集装置在本次对话的语音起点采集的场景图像,并提取所述场景图像中每个用户的人脸特征。
随着人机交互技术的发展,语音识别技术显示出其重要性。在语音识别系统中, 语音端点检测技术是非常重要的一项技术,通常也称为语音活动性检测技术(voice activity detection,VAD)。语音端点检测是指在连续声音信号中找出语音部分的语音起点和语音终点。对于语音活动性检测技术的具体实现方式,本实施例此处不做特别限制。其中,该语音活动性检测技术的执行者可以为上述的终端,也可以为终端向服务器实时发送语音,由服务器来执行。
本实施例中的本次对话和已存对话是指用户向终端输入的一条连续的语音,即一句话。在描述进行对话时,该“对话”可以理解为执行的动作。本实施例的“对话”在一些场景中还可以表示为名词。对于“对话”的词性,可根据语言描述场景来确定。
当检测到本次对话的语音起点时,获取图像采集装置在语音起点采集的场景图像。即确定有用户与终端进行语音交互向终端说话时,获取当前场景下采集的场景图像。若有多个人面向该终端,则由于存在对话,所以当前场景图像中存在面对该终端麦克风,且口型为说话口型的用户,同时也可能存在相对于终端的麦克风侧向或其它朝向的用户。
在得到该场景图像后,提取场景图像中每个用户的人脸特征,例如可以通过人脸特征模型来提取人脸特征。
在提取过程中,以每个用户作为单位,提取针对该用户的人脸特征。具体地,对场景图像进行抠图处理,得到每个人脸的人脸图片;将多个人脸图片依次输入至预设的人脸特征模型中,获取人脸特征模型依次输出的每个用户的人脸特征。
该人脸特征可以为多维特征,例如可以为多维向量,每个维度的向量表示一个特征,例如眉毛特征、眼睛特征、鼻子特征等,本实施例此处不再一一赘述。
在本实施例中,终端还可以根据每个服务器的负载,对服务器进行调度,即由负载较轻的服务器来执行本实施例的步骤。
S202、根据每个用户的人脸特征以及人脸数据库,判断是否存在与第一人脸特征匹配的第二人脸特征,第一人脸特征为一个用户的人脸特征,第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征,若是,则执行S203,若否,则执行S207;
S203、从人脸数据库中获取第二人脸特征对应的第一用户标识。
在得到每个用户的人脸特征后,将每个用户的人脸特征与人脸数据库中的人脸特征进行匹配,判断是否有一个用户的第一人脸特征可以与人脸数据库中的第二人脸特征匹配。
本领域技术人员可以理解,当有一个用户正向面向麦克风时,其它用户则无法正向面对麦克风,因此采集的场景图像中与麦克风处于对话状态的用户为一个,因此,可以判断是否存在一个用户的第一人脸特征与第二人脸特征匹配。本实施例中的匹配可以理解为在人脸特征的相似度大于预设值的前提下,相似度最高的两个人脸特征,该相似度可以为两个人脸特征的余弦相似度。
在存在一个用户的人脸特征能够与人脸数据库中的处于对话状态(张嘴说话状态)的第二人脸特征进行匹配时,从人脸数据库中获取第二人脸特征对应的第一用户标识,然后依次执行S204、S205以及S206。其中,人脸数据库中关联存储有处于对话状态的人脸特征和对应的用户标识。
在不存在一个用户的人脸特征能够与人脸数据库中的处于对话状态(张嘴说话状 态)的第二人脸特征进行匹配时,则依次执行S207和S208。
S204、判断语音数据库中是否存储有第一用户标识对应的已存对话,若是,则执行S205,若否,则执行S206;
S205、根据本次对话与已存对话确定语音交互的上下文,并在得到本次对话的语音终点后,将本次对话存储至语音数据库中;
S206、将本次对话与第一用户标识关联存储到语音数据库中。
在存在一个用户的人脸特征能够与人脸数据库中的处于对话状态(张嘴说话状态)的第二人脸特征进行匹配时,判断语音数据库中是否存储有第一用户标识对应的已存对话。其中语音数据库中关联存储有用户标识与对应的对话。
若语音数据库中存储有第一用户标识对应的已存对话,则说明本次对话并不是预设时间段内用户向终端输入的第一句语音,则根据本次对话与已存对话确定语音交互的上下文,即在该已存对话中确定本次对话的上下文。此时,在有限数量的对话中,可以结合自然语言理解来获取与本次对话相关的已存对话,即获取上下文。在得到本次对话的语音终点后,将本次对话存储至语音数据库中,并建立本次对话与语音数据库中第一用户标识的关联关系。
若语音数据库中没有存储第一用户标识对应的已存对话,则说明本次对话是用户在预设时间段内向终端输入的第一句语音,该预设时间段为当前时刻之前的预设时间段,例如当前时刻之前的半小时。此时,认为本次对话并不具备上下文,则将本次对话与第一用户标识关联存储到语音数据库中。
可选地,在本实施例中,还可以将语音数据库和人脸数据库合为一个数据库,即在一个数据库中关联存储有用户标识、对应的人脸特征以及用户对话。可选地,还可以在数据库中直接关联存储人脸特征以及对应的用户对话。
此时,若根据每个用户的人脸特征以及数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从数据库中获取第二人脸特征对应的已存对话,根据本次对话与已存对话确定语音交互的上下文,并在得到本次对话的语音终点后,将本次对话存储至语音数据库中。
在本实施例中,通过将人脸数据库和语音数据库单独设置,便于人脸数据库和语音数据库的单独存储和维护。
S207、对每个用户的人脸特征在内的参数进行分析,获取处于对话状态的目标用户,并生成目标用户的第二用户标识。
S208、在检测到语音终点时,将目标用户的人脸特征与第二用户标识关联存储到人脸数据库中,并将本次对话与第二用户标识关联存储到语音数据库中。
在不存在一个用户的人脸特征能够与人脸数据库中的处于对话状态(张嘴说话状态)的第二人脸特征进行匹配时,则说明当前的用户在此之前从未与终端进行过语音交互,此时,对每个用户的人脸特征在内的参数进行分析,得到处于对话状态的目标用户,生成该目标用户的第二用户标识,该用户标识可以为数字、字母等或其组合。再例如,还可以通过哈希算法来生成目标用户的用户标识。本实施例对用户标识的实现方式不做特别限制。
由此,在检测到语音终点时,将目标用户的人脸特征与第二用户标识关联存储到 人脸数据库中,并将本次对话与第二用户标识关联存储到语音数据库中,以便该用户再次与终端进行语音交互时,能够基于人脸数据库和语音数据库中的内容在已存对话中获取上下文。
本实施例提供的基于语音交互的上下文获取方法,通过获取图像采集装置在本次对话的语音起点采集的场景图像,并提取场景图像中每个用户的人脸特征;若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从人脸数据库中获取第二人脸特征对应的第一用户标识,其中,第一人脸特征为一个用户的人脸特征,第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征,通过人脸识别实现了准确的对用户进行身份识别;若确定语音数据库中存储有第一用户标识对应的已存对话,则根据本次对话与已存对话确定语音交互的上下文,并在得到本次对话的语音终点后,将本次对话存储至语音数据库中,通过用户标识能够获取与本次对话属于同一用户的已存对话,根据同一用户的对话来获取语音交互的上下文,避免了将不同用户的对话作为上下文,提高了获取上下文的准确率。
下面来说明确定语音交互的上下文的实现方式。图3为本发明实施例提供的基于语音交互的上下文获取方法的流程图二。如图3所示,该方法包括:
S301、根据第一用户标识从语音数据库中获取第一用户标识对应的上一对话的语音起点和语音终点;
S302、判断上一对话的语音终点与本次对话的语音起点之间的时间间隔是否小于预设间隔,若是,则执行S303,若否,则执行S304;
S303、根据本次对话与已存对话确定语音交互的上下文;
S304、在语音数据库中删除关联存储的第一用户标识和对应的已存对话。
在具体实现过程中,语音数据库中存储有用户标识以及该用户标识对应的每句话,即该用户标识与用户的至少一个对话关联存储。其中,每个对话在存储时,会对应存储该对话的语音起点的时间和语音终点的时间。
在根据目标用户的人脸特征获取到第一用户标识之后,根据第一用户标识从语音数据库中获取第一用户标识对应的上一对话的语音起点和语音终点。
然后根据上一对话的语音终点的发生时间和本次对话的语音起点的发生时间,获取上一对话的语音终点与本次对话的语音起点之间的时间间隔。
若该时间间隔小于预设间隔,则说明上一次对话与本次对话为上下文对话的可能性较高,例如该预设间隔可以为10分钟、30分钟等,本实施例对该预设间隔的实现方式不做特别限制。
若该时间间隔大于或等于预设间隔,则说明该对话为用户针对一主题的上一次对话,并不能算作本次上下文对话。由此,在语音数据库中删除关联存储的第一用户标识和对应的已存对话,本次对话并不存在上下文。
可选地,在语音数据库中删除关联存储的第一用户标识和对应的已存对话时,还可以在人脸数据库中删除关联存储的第一用户标识和对应的人脸特征。
可选地,二者也可以不同步删除,可以将人脸数据库中在预设时间段内未匹配的第三用户标识以及对应的人脸特征删除。通过该删除方式,可以对关联存储的用户标识和人脸特征进行批量删除,提高了删除效率。
本领域技术人员可以理解,在每获取一个用户的对话时,都会进行上述的操作,从而在语音数据库中存储的每个用户的多个对话都是时间间隔小于预设间隔的对话。因此,基于该用户的所有的已存对话和本次对话来获取本次对话的上下文。例如,可以该用户的本次对话以及所有的已存对话作为语音交互的上下文,也可以针对同一用户的对话,基于自然语言理解,在所有已存对话中获取本次对话的上下文。
在本实施例中,通过判断上一对话的语音终点与本次对话的语音起点之间的时间间隔是否小于预设间隔,能够更加准确的判断本次对话的上下文,提高了上下文获取的准确率。
在上述的实施例中,本发明实施例通过人脸特征模型来获取每个用户的人脸特征。下面采用详细的实施例来说明构建人脸特征模型的过程。
图4为本发明实施例提供的人脸特征模型的结构示意图。如图4所示,该人脸特征模型可以采用深度卷积神经网络(Deep Convolutional Neural Networks,Deep CNN)。该模型包括输入层、特征层、分类层以及输出层。可选地,该特征层包括卷积层、池化层、全连接层。其中,在特征层中可以多个交替的卷积层和池化层。
在具体实现过程中,对于不同的使用场景,基于该人脸特征模型,可以设计不同深度、不同数量神经元、不同卷积池化组织方式构成的深度神经网络模型。
在训练该模型时,获取人脸训练样本,该人脸训练样本包括人脸图片和标签。其中,标签为预先标定的人脸图片中的各特征的分类结果,该标签可以为矩阵形式的向量。
将该人脸图片从输入层输入,输入实际为矩阵组成的向量,然后卷积层利用权值不同的卷积核对原始图像或特征图(feature map)进行扫描卷积,从中提取各种意义的特征,并输出至特征图中,池化层夹在连续的卷积层中间,用于压缩数据和参数的量,减小过拟合,即对特征图进行降维操作,保留特征图中的主要特征。两层之间所有神经元都有权重连接,通常全连接层在卷积神经网络尾部。最后特征经过分类层之后输出结果。
当模型的输出与标签之间的误差值小于预先设定的符合业务要求的阈值时,停止训练。利用这种具有卷积、池化操作的深度神经网络模型,可以对图像的变形、模糊、噪声等具有较高的鲁棒性,对于分类任务具有更高的可泛化性。
通过上述的模型训练过程,得到了初始的人脸特征模型,将该初始的人脸特征模型中的分类层删除,得到该预设的人脸特征模型,由于删除了分类层,所以在使用该预设的人脸特征模型时,在从场景图像中抠图得到人脸图片时,将该人脸图片输入至人脸特征模型中,该人脸特征模型能够直接输出人脸特征,而不是输出分类结果。
本实施例通过使用深度卷积神经网络模型提取人脸特征,进行身份识别,能够准确的区分对话的来源,找到每个人的对话上下文,提高多人场景下的对话体验。
图5为本发明实施例提供的基于语音交互的上下文获取设备的结构示意图。如图5所示,该基于语音交互的上下文获取设备50包括:提取模块501、匹配模块502以及获取模块503。可选地,还包括建模模块504。
提取模块501,用于获取图像采集装置在本次对话的语音起点采集的场景图像,并提取所述场景图像中每个用户的人脸特征;
匹配模块502,用于若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从所述人脸数据库中获取所述第二人脸特征对应的第一用户标识,其中,所述第一人脸特征为一个用户的人脸特征,所述第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征;
获取模块503,用于若确定语音数据库中存储有所述第一用户标识对应的已存对话,则根据所述本次对话与所述已存对话确定语音交互的上下文,并在得到所述本次对话的语音终点后,将所述本次对话存储至所述语音数据库中。
可选地,所述匹配模块502还用于,
若根据每个用户的人脸特征以及人脸数据库,确定不存在与第一人脸特征匹配的第二人脸特征,对所述每个用户的人脸特征在内的参数进行分析,获取处于对话状态的目标用户,并生成所述目标用户的第二用户标识;
在检测到语音终点时,将本次对话与第二用户标识关联存储到语音数据库中,以及将所述目标用户的人脸特征与所述第二用户标识关联存储到人脸数据库中。
可选地,所述获取模块503具体用于:
根据所述第一用户标识从所述语音数据库中获取所述第一用户标识对应的上一对话的语音起点和语音终点;
若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔小于预设间隔,则根据所述本次对话与所述已存对话确定语音交互的上下文。
可选地,所述获取模块503还用于:
若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔大于或等于预设间隔,在所述语音数据库中删除关联存储的所述第一用户标识和对应的已存对话。
可选地,所述匹配模块502还用于:
将所述人脸数据库中在预设时间段内未匹配的第三用户标识以及对应的人脸特征删除。
可选地,所述提取模块501具体用于:
对所述场景图像进行抠图处理,得到每个人脸的人脸图片;
将多个所述人脸图片依次输入至预设的人脸特征模型中,获取所述人脸特征模型依次输出的每个用户的人脸特征。
所述建模模块504用于在将所述多个人脸区域依次输入至预设的人脸特征模型中之前,获取人脸训练样本,所述人脸训练样本包括人脸图片和标签;
根据所述人脸训练样本,得到训练后的初始的人脸特征模型;所述初始的人脸特征模型包括输入层、特征层、分类层以及输出层;
将所述初始的人脸特征模型中的分类层删除,得到所述预设的人脸特征模型。
可选地,所述人脸特征模型为深度卷积神经网络模型,所述特征层包括卷积层、池化层以及全连接层。
本实施例提供的基于语音交互的上下文获取设备,其实现原理和技术效果与上述的方法实施例类似,本实施例此处不再赘述。
图6为本发明实施例提供的基于语音交互的上下文获取设备的硬件结构示意图。 如图6所示,该基于语音交互的上下文获取设备60包括:至少一个处理器601和存储器602。可选地,该语音交互的上下文获取设备60还包括通信部件603。其中,处理器601、存储器602以及通信部件603通过总线604连接。
在具体实现过程中,至少一个处理器601执行所述存储器602存储的计算机执行指令,使得至少一个处理器601执行如上的基于语音交互的上下文获取方法。
通信部件603可以与其它设备进行数据交互。
处理器601的具体实现过程可参见上述方法实施例,其实现原理和技术效果类似,本实施例此处不再赘述。
在上述的6所示的实施例中,应理解,处理器可以是中央处理单元(英文:Central Processing Unit,简称:CPU),还可以是其他通用处理器、数字信号处理器(英文:Digital Signal Processor,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合发明所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器。
总线可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,本申请附图中的总线并不限定仅有一根总线或一种类型的总线。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上所述的基于语音交互的上下文获取方法。
上述的计算机可读存储介质,上述可读存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。可读存储介质可以是通用或专用计算机能够存取的任何可用介质。
一种示例性的可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称:ASIC)中。当然,处理器和可读存储介质也可以作为分立组件存在于设备中。
所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到 多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (18)

  1. 一种基于语音交互的上下文获取方法,其特征在于,包括:
    获取图像采集装置在本次对话的语音起点采集的场景图像,并提取所述场景图像中每个用户的人脸特征;
    若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从所述人脸数据库中获取所述第二人脸特征对应的第一用户标识,其中,所述第一人脸特征为一个用户的人脸特征,所述第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征;
    若确定语音数据库中存储有所述第一用户标识对应的已存对话,则根据所述本次对话与所述已存对话确定语音交互的上下文,并在得到所述本次对话的语音终点后,将所述本次对话存储至所述语音数据库中。
  2. 根据权利要求1所述的方法,其特征在于,若根据每个用户的人脸特征以及人脸数据库,确定不存在与第一人脸特征匹配的第二人脸特征,所述方法还包括:
    对所述每个用户的人脸特征在内的参数进行分析,获取处于对话状态的目标用户,并生成所述目标用户的第二用户标识;
    在检测到语音终点时,将本次对话与第二用户标识关联存储到语音数据库中,以及将所述目标用户的人脸特征与所述第二用户标识关联存储到人脸数据库中。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述本次对话与所述已存对话确定语音交互的上下文,包括:
    根据所述第一用户标识从所述语音数据库中获取所述第一用户标识对应的上一对话的语音起点和语音终点;
    若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔小于预设间隔,则根据所述本次对话与所述已存对话确定语音交互的上下文。
  4. 根据权利要求3所述的方法,其特征在于,若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔大于或等于预设间隔,所述方法还包括:
    在所述语音数据库中删除关联存储的所述第一用户标识和对应的已存对话。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将所述人脸数据库中在预设时间段内未匹配的第三用户标识以及对应的人脸特征删除。
  6. 根据权利要求1所述的方法,其特征在于,所述提取所述场景图像中每个用户的人脸特征,包括:
    对所述场景图像进行抠图处理,得到每个人脸的人脸图片;
    将多个所述人脸图片依次输入至预设的人脸特征模型中,获取所述人脸特征模型依次输出的每个用户的人脸特征。
  7. 根据权利要求6所述的方法,其特征在于,所述将所述多个人脸区域依次输入至预设的人脸特征模型中之前,所述方法还包括:
    获取人脸训练样本,所述人脸训练样本包括人脸图片和标签;
    根据所述人脸训练样本,得到训练后的初始的人脸特征模型;所述初始的人脸特 征模型包括输入层、特征层、分类层以及输出层;
    将所述初始的人脸特征模型中的分类层删除,得到所述预设的人脸特征模型。
  8. 根据权利要求7所述的方法,其特征在于,所述人脸特征模型为深度卷积神经网络模型,所述特征层包括卷积层、池化层以及全连接层。
  9. 一种基于语音交互的上下文获取设备,其特征在于,包括:
    提取模块,用于获取图像采集装置在本次对话的语音起点采集的场景图像,并提取所述场景图像中每个用户的人脸特征;
    匹配模块,用于若根据每个用户的人脸特征以及人脸数据库,确定存在与第一人脸特征匹配的第二人脸特征,则从所述人脸数据库中获取所述第二人脸特征对应的第一用户标识,其中,所述第一人脸特征为一个用户的人脸特征,所述第二人脸特征为人脸数据库中存储的处于对话状态的用户的人脸特征;
    获取模块,用于若确定语音数据库中存储有所述第一用户标识对应的已存对话,则根据所述本次对话与所述已存对话确定语音交互的上下文,并在得到所述本次对话的语音终点后,将所述本次对话存储至所述语音数据库中。
  10. 根据权利要求9所述的设备,其特征在于,所述匹配模块还用于,
    若根据每个用户的人脸特征以及人脸数据库,确定不存在与第一人脸特征匹配的第二人脸特征,对所述每个用户的人脸特征在内的参数进行分析,获取处于对话状态的目标用户,并生成所述目标用户的第二用户标识;
    在检测到语音终点时,将本次对话与第二用户标识关联存储到语音数据库中,以及将所述目标用户的人脸特征与所述第二用户标识关联存储到人脸数据库中。
  11. 根据权利要求9所述的设备,其特征在于,所述获取模块具体用于:
    根据所述第一用户标识从所述语音数据库中获取所述第一用户标识对应的上一对话的语音起点和语音终点;
    若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔小于预设间隔,则根据所述本次对话与所述已存对话确定语音交互的上下文。
  12. 根据权利要求11所述的设备,其特征在于,所述获取模块还用于:
    若确定所述上一对话的语音终点与所述本次对话的语音起点之间的时间间隔大于或等于预设间隔,在所述语音数据库中删除关联存储的所述第一用户标识和对应的已存对话。
  13. 根据权利要求9所述的设备,其特征在于,所述匹配模块还用于:
    将所述人脸数据库中在预设时间段内未匹配的第三用户标识以及对应的人脸特征删除。
  14. 根据权利要求9所述的设备,其特征在于,所述提取模块具体用于:
    对所述场景图像进行抠图处理,得到每个人脸的人脸图片;
    将多个所述人脸图片依次输入至预设的人脸特征模型中,获取所述人脸特征模型依次输出的每个用户的人脸特征。
  15. 根据权利要求14所述的设备,其特征在于,还包括:建模模块;
    所述建模模块用于在将所述多个人脸区域依次输入至预设的人脸特征模型中之前,
    获取人脸训练样本,所述人脸训练样本包括人脸图片和标签;
    根据所述人脸训练样本,得到训练后的初始的人脸特征模型;所述初始的人脸特征模型包括输入层、特征层、分类层以及输出层;
    将所述初始的人脸特征模型中的分类层删除,得到所述预设的人脸特征模型。
  16. 根据权利要求15所述的设备,其特征在于,所述人脸特征模型为深度卷积神经网络模型,所述特征层包括卷积层、池化层以及全连接层。
  17. 一种基于语音交互的上下文获取设备,其特征在于,包括:至少一个处理器和存储器;
    所述存储器存储计算机执行指令;
    所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求1至8任一项所述的基于语音交互的上下文获取方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至8任一项所述的基于语音交互的上下文获取方法。
PCT/CN2019/087203 2018-07-02 2019-05-16 基于语音交互的上下文获取方法及设备 WO2020007129A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020197034483A KR20200004826A (ko) 2018-07-02 2019-05-16 음성 대화 기반 콘텍스트 획득 방법 및 기기
EP19802029.9A EP3617946B1 (en) 2018-07-02 2019-05-16 Context acquisition method and device based on voice interaction
JP2019563817A JP6968908B2 (ja) 2018-07-02 2019-05-16 コンテキスト取得方法及びコンテキスト取得デバイス
US16/936,967 US20210012777A1 (en) 2018-07-02 2020-07-23 Context acquiring method and device based on voice interaction

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810709792.8 2018-07-02
CN201810709792.8A CN108920639B (zh) 2018-07-02 2018-07-02 基于语音交互的上下文获取方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/936,967 Continuation US20210012777A1 (en) 2018-07-02 2020-07-23 Context acquiring method and device based on voice interaction

Publications (1)

Publication Number Publication Date
WO2020007129A1 true WO2020007129A1 (zh) 2020-01-09

Family

ID=64424805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087203 WO2020007129A1 (zh) 2018-07-02 2019-05-16 基于语音交互的上下文获取方法及设备

Country Status (6)

Country Link
US (1) US20210012777A1 (zh)
EP (1) EP3617946B1 (zh)
JP (1) JP6968908B2 (zh)
KR (1) KR20200004826A (zh)
CN (1) CN108920639B (zh)
WO (1) WO2020007129A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10553203B2 (en) 2017-11-09 2020-02-04 International Business Machines Corporation Training data optimization for voice enablement of applications
US10565982B2 (en) 2017-11-09 2020-02-18 International Business Machines Corporation Training data optimization in a service computing system for voice enablement of applications

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920639B (zh) * 2018-07-02 2022-01-18 北京百度网讯科技有限公司 基于语音交互的上下文获取方法及设备
CN109559761A (zh) * 2018-12-21 2019-04-02 广东工业大学 一种基于深度语音特征的脑卒中风险预测方法
CN109462546A (zh) * 2018-12-28 2019-03-12 苏州思必驰信息科技有限公司 一种语音对话历史消息记录方法、装置及系统
CN111475206B (zh) * 2019-01-04 2023-04-11 优奈柯恩(北京)科技有限公司 用于唤醒可穿戴设备的方法及装置
CN110210307B (zh) * 2019-04-30 2023-11-28 中国银联股份有限公司 人脸样本库部署方法、基于人脸识别业务处理方法及装置
CN110223718B (zh) * 2019-06-18 2021-07-16 联想(北京)有限公司 一种数据处理方法、装置及存储介质
CN110825765B (zh) * 2019-10-23 2022-10-04 中国建设银行股份有限公司 一种人脸识别的方法和装置
CN112598840A (zh) * 2020-12-16 2021-04-02 广州云从鼎望科技有限公司 基于人脸识别和语音交互的通行设备控制方法、装置、机器可读介质及设备
CN114356275B (zh) * 2021-12-06 2023-12-29 上海小度技术有限公司 交互控制方法、装置、智能语音设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379107A1 (en) * 2015-06-24 2016-12-29 Baidu Online Network Technology (Beijing) Co., Ltd. Human-computer interactive method based on artificial intelligence and terminal device
CN106683680A (zh) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 说话人识别方法及装置、计算机设备及计算机可读介质
CN106782563A (zh) * 2016-12-28 2017-05-31 上海百芝龙网络科技有限公司 一种智能家居语音交互系统
CN108920639A (zh) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 基于语音交互的上下文获取方法及设备

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001331799A (ja) * 2000-03-16 2001-11-30 Toshiba Corp 画像処理装置および画像処理方法
US20030154084A1 (en) * 2002-02-14 2003-08-14 Koninklijke Philips Electronics N.V. Method and system for person identification using video-speech matching
US9053750B2 (en) * 2011-06-17 2015-06-09 At&T Intellectual Property I, L.P. Speaker association with a visual representation of spoken content
US9318129B2 (en) * 2011-07-18 2016-04-19 At&T Intellectual Property I, Lp System and method for enhancing speech activity detection using facial feature detection
JP5845686B2 (ja) * 2011-07-26 2016-01-20 ソニー株式会社 情報処理装置、フレーズ出力方法及びプログラム
US9214157B2 (en) * 2011-12-06 2015-12-15 At&T Intellectual Property I, L.P. System and method for machine-mediated human-human conversation
US10509829B2 (en) * 2015-01-21 2019-12-17 Microsoft Technology Licensing, Llc Contextual search using natural language
TWI526879B (zh) * 2015-01-30 2016-03-21 原相科技股份有限公司 互動系統、遙控器及其運作方法
WO2016173326A1 (zh) * 2015-04-30 2016-11-03 北京贝虎机器人技术有限公司 基于主题的交互系统及方法
US10521354B2 (en) * 2015-06-17 2019-12-31 Intel Corporation Computing apparatus and method with persistent memory
KR20170000748A (ko) * 2015-06-24 2017-01-03 삼성전자주식회사 얼굴 인식 방법 및 장치
EP3312762B1 (en) * 2016-10-18 2023-03-01 Axis AB Method and system for tracking an object in a defined area
CN108154153B (zh) * 2016-12-02 2022-02-22 北京市商汤科技开发有限公司 场景分析方法和系统、电子设备
CN106782545B (zh) * 2016-12-16 2019-07-16 广州视源电子科技股份有限公司 一种将音视频数据转化成文字记录的系统和方法
CN107086041A (zh) * 2017-03-27 2017-08-22 竹间智能科技(上海)有限公司 基于加密计算的语音情感分析方法及装置
CN107799126B (zh) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 基于有监督机器学习的语音端点检测方法及装置
CN107808145B (zh) * 2017-11-13 2021-03-30 河南大学 基于多模态智能机器人的交互身份鉴别与跟踪方法及系统
CN108172225A (zh) * 2017-12-27 2018-06-15 浪潮金融信息技术有限公司 语音交互方法及机器人、计算机可读存储介质、终端
CN110309691B (zh) * 2018-03-27 2022-12-27 腾讯科技(深圳)有限公司 一种人脸识别方法、装置、服务器及存储介质
CN108920640B (zh) * 2018-07-02 2020-12-22 北京百度网讯科技有限公司 基于语音交互的上下文获取方法及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379107A1 (en) * 2015-06-24 2016-12-29 Baidu Online Network Technology (Beijing) Co., Ltd. Human-computer interactive method based on artificial intelligence and terminal device
CN106782563A (zh) * 2016-12-28 2017-05-31 上海百芝龙网络科技有限公司 一种智能家居语音交互系统
CN106683680A (zh) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 说话人识别方法及装置、计算机设备及计算机可读介质
CN108920639A (zh) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 基于语音交互的上下文获取方法及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3617946A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10553203B2 (en) 2017-11-09 2020-02-04 International Business Machines Corporation Training data optimization for voice enablement of applications
US10565982B2 (en) 2017-11-09 2020-02-18 International Business Machines Corporation Training data optimization in a service computing system for voice enablement of applications

Also Published As

Publication number Publication date
EP3617946B1 (en) 2024-01-03
JP2020529033A (ja) 2020-10-01
CN108920639B (zh) 2022-01-18
US20210012777A1 (en) 2021-01-14
JP6968908B2 (ja) 2021-11-17
CN108920639A (zh) 2018-11-30
EP3617946A1 (en) 2020-03-04
KR20200004826A (ko) 2020-01-14
EP3617946A4 (en) 2020-12-30

Similar Documents

Publication Publication Date Title
WO2020007129A1 (zh) 基于语音交互的上下文获取方法及设备
CN111488433B (zh) 一种适用于银行的提升现场体验感的人工智能交互系统
CN108920640B (zh) 基于语音交互的上下文获取方法及设备
US10270736B2 (en) Account adding method, terminal, server, and computer storage medium
WO2020140665A1 (zh) 双录视频质量检测方法、装置、计算机设备和存储介质
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
JP6951712B2 (ja) 対話装置、対話システム、対話方法、およびプログラム
WO2019000832A1 (zh) 一种声纹创建与注册方法及装置
US20200005673A1 (en) Method, apparatus, device and system for sign language translation
CN112233698B (zh) 人物情绪识别方法、装置、终端设备及存储介质
WO2019134580A1 (zh) 一种用于管理游戏用户的方法与设备
JP6732703B2 (ja) 感情インタラクションモデル学習装置、感情認識装置、感情インタラクションモデル学習方法、感情認識方法、およびプログラム
WO2020253128A1 (zh) 基于语音识别的通信服务方法、装置、计算机设备及存储介质
WO2022174699A1 (zh) 图像更新方法、装置、电子设备及计算机可读介质
US20230206928A1 (en) Audio processing method and apparatus
CN108986825A (zh) 基于语音交互的上下文获取方法及设备
WO2022257452A1 (zh) 表情回复方法、装置、设备及存储介质
CN111144369A (zh) 一种人脸属性识别方法和装置
CN112632248A (zh) 问答方法、装置、计算机设备和存储介质
CN113434670A (zh) 话术文本生成方法、装置、计算机设备和存储介质
CN114138960A (zh) 用户意图识别方法、装置、设备及介质
CN112151027B (zh) 基于数字人的特定人询问方法、装置和存储介质
CN111506183A (zh) 一种智能终端及用户交互方法
CN109961152B (zh) 虚拟偶像的个性化互动方法、系统、终端设备及存储介质
US20210166685A1 (en) Speech processing apparatus and speech processing method

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019563817

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20197034483

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2019802029

Country of ref document: EP

Effective date: 20191122

NENP Non-entry into the national phase

Ref country code: DE