WO2021104110A1 - 一种语音匹配方法及相关设备 - Google Patents

一种语音匹配方法及相关设备 Download PDF

Info

Publication number
WO2021104110A1
WO2021104110A1 PCT/CN2020/129464 CN2020129464W WO2021104110A1 WO 2021104110 A1 WO2021104110 A1 WO 2021104110A1 CN 2020129464 W CN2020129464 W CN 2020129464W WO 2021104110 A1 WO2021104110 A1 WO 2021104110A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
user
recognized
lip motion
Prior art date
Application number
PCT/CN2020/129464
Other languages
English (en)
French (fr)
Inventor
刘恒
李志刚
于明雨
车慧敏
张红蕾
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20893436.4A priority Critical patent/EP4047598B1/en
Priority to US17/780,384 priority patent/US20230008363A1/en
Publication of WO2021104110A1 publication Critical patent/WO2021104110A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present invention relates to the technical field of human-computer interaction, in particular to a voice matching method and related equipment.
  • Human-Computer Interaction mainly studies the information exchange between humans and computers. It mainly includes two parts: human-computer and computer-to-human information exchange. It is a comprehensive discipline closely related to cognitive psychology, ergonomics, multimedia technology, and virtual reality technology.
  • multi-modal interaction devices are interactive devices that have multiple interaction modes such as voice interaction, somatosensory interaction, and touch interaction in parallel.
  • Human-computer interaction based on multi-modal interactive devices collect user information through multiple tracking modules (face, gesture, posture, voice, and rhythm) in the interactive device, and form a virtual user expression module after understanding, processing, and management , Interactive dialogue with the computer can greatly enhance the user's interactive experience.
  • the robot "what is the weather tomorrow"; after the robot receives the voice command, it will only answer the result of semantic recognition based on the received voice information: "It will be sunny tomorrow with 2-3 breeze.” It will not consider asking questions at all. The mood and feelings of Tom, therefore, it is impossible to realize intelligent and personalized human-computer interaction.
  • the embodiments of the present invention provide a voice matching method, a neural network training method, and related equipment to improve voice matching efficiency and human-computer interaction experience in a multi-person scenario.
  • an embodiment of the present invention provides a voice matching method, which may include: acquiring audio data and video data; extracting voice information to be recognized from the audio data, and the voice information to be recognized includes a target time period
  • the lip motion information of N users is extracted from the video data, and the lip motion information of each user in the lip motion information of the N users includes the corresponding user in the target An image sequence of lip movement within a time period, where N is an integer greater than 1.
  • audio data and video data can be collected, and the voice information in the audio data and the lip operation information in the video data can be matched to determine the waiting information.
  • the target user to which the recognized voice information belongs That is, in a multi-person scene, the voice feature is matched with the lip motion features of multiple users to identify the specific user who sent a certain piece of voice information to be recognized, so that further control or operation can be performed based on the recognition result .
  • the embodiments of the invention do not rely on human voice (voiceprint is easily affected by physical condition, age, emotion, etc.), and is not subject to environmental interference (such as Noise interference in the environment, etc.), strong anti-interference ability, high recognition efficiency and accuracy.
  • the voice information to be recognized includes the voice waveform sequence in a specific time period
  • the lip motion information of N users includes the lip motion image sequence of multiple users in the same scene in the time period. (That is, the video of the lip movement), which is convenient for subsequent feature extraction and feature matching.
  • the voice information to be recognized and the lip motion information of N users are used as the input of the target feature matching model, and the lip motion information of the N users is separated from the voice information to be recognized.
  • the matching degree of is used as the output of the target feature matching model, and then the target user to which the voice information to be recognized belongs is determined according to the matching degree.
  • the target feature matching model is a neural network model.
  • the target feature matching model includes a first model, a second model, and a third model; the voice information to be recognized and the lip motion information of the N users are input
  • obtaining the matching degree between the lip motion information of the N users and the voice information to be recognized includes: inputting the voice information to be recognized into the first model Obtaining a voice feature, the voice feature is a K-dimensional voice feature, and K is an integer greater than 0; inputting the lip motion information of the N users into the second model to obtain N image sequence features;
  • Each of the N image sequence features is a K-dimensional image sequence feature; the voice features and the N image sequence features are input into a third model to obtain the N image sequence features The degree of matching between each and the voice feature.
  • the voice information to be recognized and the lip motion information of N users are used for feature extraction (also can be considered as a dimensionality reduction process) by using the first model and the second model respectively, so After the voice information to be recognized and the lip motion information of N users are extracted through the feature extraction of the first model and the second network, the features of the same dimension can be obtained, so that different types of information can achieve feature normalization Effect. That is, after the feature extraction process of the above-mentioned network, different types of raw data (voice information to be recognized and lip movement information of N users) can be transformed into non-dimensional index values (that is, K in the embodiment of the invention). Dimensional voice features and N image sequence features), each index value is at the same quantitative level, and comprehensive evaluation and analysis (that is, feature matching in the embodiment of the present invention) can be performed.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training; optionally, the M voice information includes voice information that matches the lip motion information of the trained user and (M-1) Voice information that does not match the lip motion information of the training user.
  • the lip movement information of a certain training user, as well as the matching voice information and multiple unmatched voice information are input as the target feature matching model, and based on the above M voice information and the training user
  • the actual matching degree of the lip motion information is used as the label, and the target feature matching model obtained by training the initial neural network model. For example, the matching degree corresponding to a perfect match is the label of 1, and the matching degree corresponding to the non-matching is the label of 0 .
  • the method further includes: determining user information of the target user, the user information including character attribute information, facial expression information corresponding to the voice information to be recognized, and One or more of the environmental information corresponding to the recognized voice information; based on the user information, a control instruction matching the user information is generated.
  • the user's attribute information such as gender, age, personality, etc.
  • facial expression information For example, the target user sends out the expression corresponding to the voice information to be recognized
  • the corresponding environmental information such as the target user is currently in an office environment, home environment, or entertainment environment, etc.
  • the intelligent machine is controlled to send a voice or operation matching the facial expression data and character attribute information to the target user, including the voice of the robot, the turning of the robot's head, and the content of the robot's reply.
  • the extracting the lip motion information of N users from the video data includes: recognizing the N face regions in the video data based on a face recognition algorithm, and extracting the Lip motion videos in each face region in the N face regions; determining the lip motion information of the N users based on the lip motion videos in each face region.
  • the face area is first identified from the video data, and then the lip motion video in each face area is extracted based on the face area, and then the lips of N users are determined according to the lip motion video
  • the motion information is the sequence of motion images corresponding to the user's lips.
  • the extracting the voice information to be recognized from the audio data includes: identifying audio data of different spectrums in the audio data based on a spectrum recognition algorithm, and combining the audio data of the target spectrum The data is recognized as the voice information to be recognized. Since the frequency spectrum corresponding to the sounds emitted by different users is generally different, the embodiment of the present invention first recognizes the audio data of different frequency spectrums from the audio data, and then recognizes the audio data of the target spectrum as the to-be-recognized audio data. Voice information, and then realize the function of extracting the voice information to be recognized from the audio data.
  • an embodiment of the present invention provides a neural network training method, which may include:
  • training samples including lip movement information of the training user and M pieces of voice information, the M pieces of voice information including voice information matching the lip movement information of the training user, and (M-1) Pieces of voice information that do not match the lip movement information of the training user;
  • the initialized neural network is trained to obtain the target feature matching model.
  • the lip motion information of a certain training user as well as the matching voice information and multiple unmatched voice information are used as the input of the initialized neural network, and based on the above M voice information and the training
  • the actual matching degree of the user's lip motion information is used as the label, and the target feature matching model obtained by training the above-mentioned initial neural network model.
  • the matching degree corresponding to a perfect match is the label 1
  • the matching degree corresponding to the non-matching is the label Is 0, when the matching degree between the lip motion information of the training user calculated by the initialized neural network after training and the M voice information is closer to the M tags, then the initialized neural network after training The closer to the target feature matching model.
  • the lip motion information of the training user includes a lip motion image sequence of the training user
  • the M voice information includes a lip motion image sequence that matches the training user's lip motion image sequence.
  • the lip motion information of the training user and the M voice information are used as training input, and the lip motion information of the training user is respectively the same as the M voice information.
  • the matching degree between is M labels, and the initialized neural network is trained to obtain the target feature matching model, including:
  • the calculated matching degrees between the M voice information and the lip motion information of the trained user are compared with the M tags, and the initialized neural network is trained to obtain a target feature matching model.
  • the target feature matching model includes a first model, a second model, and a third model
  • the matching degree includes:
  • each of the M voice features is a K-dimensional voice feature, and K is an integer greater than 0;
  • the M voice features and the image sequence features of the training user are input into a third model, and the matching degrees between the M voice features and the image sequence features of the training user are calculated.
  • an embodiment of the present invention provides a smart device, which may include a processor, a microphone and a camera coupled with the processor:
  • the microphone is used to collect audio data
  • the camera is used to collect video data, and the audio data and the video data are collected for the same scene;
  • the lip motion information of N users is extracted from the video data, and the lip motion information of each user in the lip motion information of the N users includes the lip motion of the corresponding user in the target time period Image sequence of, N is an integer greater than 1;
  • the voice information to be recognized and the lip motion information of the N users are input into the target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized. suitability;
  • the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model; the processor is specifically configured to:
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third model to obtain the matching degrees between the N image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training; optionally, the M voice information includes voice information that matches the lip motion information of the trained user.
  • the processor is further configured to:
  • the user information includes one or more of character attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the processor is specifically configured to:
  • the processor is specifically configured to:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • an embodiment of the present invention provides a smart device, which may include a processor and a microphone, a camera, and a neural network processor coupled with the processor:
  • the microphone is used to collect audio data
  • the camera is used to collect video data, and the audio data and the video data are collected for the same scene;
  • the processor is used for:
  • the lip motion information of N users is extracted from the video data, and the lip motion information of each user in the lip motion information of the N users includes the lip motion of the corresponding user in the target time period Image sequence of, N is an integer greater than 1;
  • the neural network processor is used for:
  • the voice information to be recognized and the lip motion information of the N users are input into the target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized. suitability;
  • the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model; the neural network processor is specifically configured to:
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third model to obtain the matching degrees between the N image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training; optionally, the M voice information includes voice information that matches the lip motion information of the trained user.
  • the processor is further configured to:
  • the user information includes one or more of character attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the processor is specifically configured to:
  • the processor is specifically configured to:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • an embodiment of the present invention provides a voice matching device, which may include:
  • the acquiring unit is used to acquire audio data and video data
  • the first extraction unit is configured to extract voice information to be recognized from the audio data, where the voice information to be recognized includes a voice waveform sequence in a target time period;
  • the second extraction unit is configured to extract lip motion information of N users from the video data, and the lip motion information of each user in the lip motion information of the N users includes the lip motion information of the corresponding user in the The image sequence of the lip movement in the target time period, where N is an integer greater than 1;
  • the matching unit is configured to input the voice information to be recognized and the lip motion information of the N users into a target feature matching model to obtain the lip motion information of the N users and the lip motion information of the N users respectively.
  • the determining unit is configured to determine the user corresponding to the lip motion information of the user with the highest matching degree as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first network, a second network, and a third network; the matching unit is specifically configured to:
  • a voice feature where the voice feature is a K-dimensional voice feature, and K is an integer greater than 0;
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third network, and the matching degrees between the N image sequence features and the voice features are obtained.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training; optionally, the M voice information includes voice information that matches the lip motion information of the trained user and (M-1) Voice information that does not match the lip motion information of the training user.
  • the device further includes:
  • the determining unit is configured to determine user information of the target user, the user information including character attribute information, facial expression information corresponding to the voice information to be recognized, and among the environmental information corresponding to the voice information to be recognized.
  • the control unit is configured to generate a control instruction matching the user information based on the user information.
  • the first extraction unit is specifically configured to:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • the second extraction unit is specifically configured to:
  • an embodiment of the present invention provides a neural network training device, which may include:
  • the acquiring unit is configured to acquire training samples, the training samples including lip movement information of the training user and M pieces of voice information, the M pieces of voice information including voice information matching the lip movement information of the training user, and (M-1) pieces of voice information that do not match the lip movement information of the training user;
  • the training unit is configured to use the lip motion information of the training user and the M voice information as training input, and use the matching degree between the lip motion information of the training user and the M voice information as M tags are used to train the initialized neural network to obtain the target feature matching model.
  • the lip motion information of the training user includes the lip motion image sequence of the training user; optionally, the M voice information includes a lip motion image sequence of the training user.
  • the training unit is specifically used for:
  • the calculated matching degrees between the M voice information and the lip motion information of the trained user are compared with the M tags, and the initialized neural network is trained to obtain a target feature matching model.
  • the target feature matching model includes a first model, a second model, and a third model; the training unit is specifically used for:
  • each of the M voice features is a K-dimensional voice feature, and K is an integer greater than 0;
  • the calculated matching degrees between the M voice features and the image sequence features of the trained user are compared with the M tags, and the initialized neural network is trained to obtain a target feature matching model.
  • an embodiment of the present invention provides a voice matching method, which may include:
  • the voice information to be recognized includes the voice waveform sequence in the target time period, and the lip motion information of each user in the N users' lip motion information
  • the part movement information includes a corresponding image sequence of the user's lip movement in the target time period, and N is an integer greater than 1;
  • the voice information to be recognized and the lip motion information of the N users are input into the target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized. suitability;
  • the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model
  • the voice information to be recognized and the lip motion information of the N users are input into a target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized.
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third model to obtain the matching degrees between the N image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M labels is the feature matching model obtained by training.
  • the method further includes:
  • the user information includes one or more of character attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the method further includes: extracting lip motion information of N users from video data; further, said extracting lip motion information of N users from the video data, include:
  • the method further includes: extracting the voice information to be recognized from the audio data; further, the extracting the voice information to be recognized from the audio data includes:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • an embodiment of the present invention provides a service device, which may include a processor; the processor is configured to:
  • the voice information to be recognized includes the voice waveform sequence in the target time period, and the lip motion information of each user in the N users' lip motion information
  • the part movement information includes a corresponding image sequence of the user's lip movement in the target time period, and N is an integer greater than 1;
  • the voice information to be recognized and the lip motion information of the N users are input into the target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized. suitability;
  • the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model; the processor is specifically configured to:
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third model to obtain the matching degrees between the N image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M labels is the feature matching model obtained by training.
  • the server further includes a processor; the processor is configured to:
  • the user information includes one or more of character attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the server further includes a processor; the processor is further configured to:
  • the server further includes a processor; the processor is further configured to:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • the processor is a neural network processor; optionally, the function performed by the processor may be completed by a plurality of different processors in cooperation with each other, that is, the processor may It is composed of multiple processors with different functions.
  • an embodiment of the present application also provides a voice matching device, including a processor and a memory, the memory is used to store a program, the processor executes the program stored in the memory, and when the program stored in the memory When executed, the processor is caused to implement any method shown in the first aspect or any method shown in the seventh aspect.
  • an embodiment of the present application also provides a neural network training device, including a processor and a memory, the memory is used to store a program, the processor executes the program stored in the memory, and when the memory stores When the program of is executed, the processor is caused to implement any one of the methods shown in the second aspect.
  • an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable medium is used to store program code, and the program code includes a computer-readable Any one of the methods described in the seven aspects.
  • the embodiments of the present application also provide a computer program product containing instructions.
  • the computer program product When the computer program product is run on a computer, the computer can execute the aforementioned first, second, or seventh aspects. Any one of the methods.
  • a chip in a thirteenth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface, and executes the instructions of the first, second, or seventh aspects. Any one of the methods mentioned.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is configured to execute any one of the methods described in the first aspect, the second aspect, or the seventh aspect.
  • this application provides a chip system that includes a processor for supporting smart devices to implement the functions involved in the first aspect, or for supporting a voice matching device to implement the first aspect or The functions involved in the seventh aspect.
  • the chip system further includes a memory, and the memory is used to store necessary program instructions and data for the smart device.
  • the chip system can be composed of chips, or include chips and other discrete devices.
  • the present application provides a chip system, which includes a processor, and is used to support a neural network training device to implement the functions involved in the second aspect.
  • the chip system further includes a memory, and the memory is used to store program instructions and data necessary for the training device of the neural network.
  • the chip system can be composed of chips, or include chips and other discrete devices.
  • an electronic device which includes any one of the voice matching devices in the fifth aspect.
  • an electronic device which includes any one of the neural network training devices in the sixth aspect.
  • a cloud server is provided, and the cloud server includes any one of the service devices in the eighth aspect.
  • a server in a nineteenth aspect, includes any one of the service devices in the eighth aspect.
  • FIG. 1 is a schematic diagram of a scene of interaction between a robot and multiple people according to an embodiment of the present invention.
  • Fig. 2 is a schematic diagram of a scene of interaction between a smart speaker and multiple persons provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a scenario where a smart phone/smart watch interacts with multiple people according to an embodiment of the present invention.
  • FIG. 4 provides a system architecture 100 according to an embodiment of the present invention.
  • Fig. 5 is a schematic diagram of a convolutional neural network provided by an embodiment of the present invention.
  • Fig. 6 is a hardware structure diagram of a neural network processor provided by an embodiment of the present invention.
  • Fig. 7A is a schematic flowchart of a neural network training method provided by an embodiment of the present invention.
  • FIG. 7B is a schematic diagram of training sample sampling provided by an embodiment of the present invention.
  • FIG. 8A is a schematic flowchart of a voice matching method provided by an embodiment of the present invention.
  • Fig. 8B is an example diagram of a sound waveform provided by an embodiment of the present invention.
  • FIG. 8C is a schematic diagram of using Mel frequency cepstrum coefficients for speech feature extraction according to an embodiment of the present invention.
  • FIG. 8D is a schematic diagram of a scene of interaction between a robot and family members according to an embodiment of the present invention.
  • FIG. 9A is a schematic flowchart of another voice matching method provided by an embodiment of the present invention.
  • FIG. 9B is a schematic structural diagram of a first model and a second model provided by an embodiment of the present invention.
  • FIG. 9C is a schematic structural diagram of a third model provided by an embodiment of the present invention.
  • FIG. 9D is a schematic diagram of a robot interaction scene provided by an embodiment of the present invention.
  • FIG. 9E is a schematic diagram of a functional module and the overall process provided by an embodiment of the present invention.
  • FIG. 9F is a structural diagram of a voice matching system provided by an embodiment of the present invention.
  • Fig. 10A is a schematic structural diagram of a smart device provided by an embodiment of the present invention.
  • FIG. 10B is a schematic structural diagram of another smart device provided by an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a voice matching device provided by an embodiment of the present invention.
  • Fig. 12 is a schematic structural diagram of a neural network training device provided by an embodiment of the present invention.
  • Fig. 13 is a schematic structural diagram of another smart device provided by an embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of a service device provided by an embodiment of the present invention.
  • Figure 15 is another voice matching system provided by an embodiment of the present invention.
  • component used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the component may be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program, and/or a computer running on a processor.
  • the application running on the computing device and the computing device can be components.
  • One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed among two or more computers.
  • these components can be executed from various computer readable media having various data structures stored thereon.
  • a component can be based on a signal having one or more data packets (for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through signals). Communicate through local and/or remote processes.
  • data packets for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through signals.
  • Bitmap Also known as raster graphics or bitmap, it is an image represented by a pixel array (Pixel-array/Dot-matrix dot matrix). According to the bit depth, bitmaps can be divided into 1, 4, 8, 16, 24, and 32-bit images. The more information bits used by each pixel, the more colors available, the more realistic the color performance, and the greater the amount of corresponding data. For example, a pixel bitmap with a bit depth of 1 has only two possible values (black and white), so it is also called a binary bitmap. An image with a bit depth of 8 has 28 (that is, 256) possible values. A grayscale mode image with a bit depth of 8 has 256 possible gray values. The RGB image is composed of three color channels.
  • RGB image Each channel in an 8-bit/channel RGB image has 256 possible values, which means that the image has more than 16 million possible color values.
  • RGB image with 8 bits per channel bpc
  • the bitmap represented by the 24-bit RGB combined data bits is usually called a true color bitmap.
  • Speech recognition technology also known as automatic speech recognition, whose goal is to convert the vocabulary content of human speech into computer-readable input, such as keystrokes, binary codes, or character sequences.
  • Voiceprint is a sound wave spectrum that carries speech information displayed by an electroacoustic instrument. It is a biological feature composed of more than a hundred characteristic dimensions such as wavelength, frequency, and intensity. Voiceprint recognition achieves the purpose of distinguishing unknown sounds by analyzing the characteristics of one or more speech signals. Simply put, it is a technology to distinguish whether a certain sentence is spoken by a certain person. Through the voiceprint, the identity of the speaker can be determined and targeted answers can be given.
  • Mel Frequency Cepstrum Coefficient (Mel-Frequency Cepstrum) is based on the non-linear mel scale of sound frequency. The linear transformation of the energy spectrum. Mel Frequency Cepstral Coefficient (MFCC) is widely used in speech recognition functions.
  • Multi-way cross-Entropy Loss describes the distance between two probability distributions. The smaller the cross-entropy, the closer the two are.
  • a neural network can be composed of neural units, which can refer to an arithmetic unit that takes x s and intercept 1 as inputs, and the output of the arithmetic unit can be
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • DNN looks complicated, it is not complicated as far as the work of each layer is concerned.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as It should be noted that there is no W parameter in the input layer.
  • more hidden layers make the network more capable of portraying complex situations in the real world.
  • a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network (CNN, convolutional neuron network) is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can be connected to only part of the neighboring neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size.
  • the convolution kernel can obtain reasonable weights through learning.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
  • Convolutional neural networks can use backpropagation (BP) algorithms to modify the size of the parameters in the convolutional neural network during the training process, so that the reconstruction error loss of the convolutional neural network becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the convolutional neural network are updated by backpropagating the error loss information, so that the error loss can be converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal parameters of the convolutional neural network, such as the weight matrix.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • voice matching and recognition in a multi-person scene includes a variety of technical solutions.
  • the following exemplarily lists the following two commonly used solutions. among them,
  • Voiceprint is a sound wave spectrum that carries verbal information displayed by an electroacoustic instrument. It is a biological feature composed of more than a hundred characteristic dimensions such as wavelength, frequency, and intensity. Voiceprint recognition achieves the purpose of distinguishing unknown sounds by analyzing the characteristics of one or more speech signals. Simply put, it is a technology to distinguish whether a certain sentence is spoken by a certain person. Through the voiceprint, the identity of the speaker can be determined and targeted answers can be given. It is mainly divided into two phases: registration phase and verification phase.
  • the registration stage establish the corresponding voiceprint model according to the voiceprint characteristics of the speaker's voice
  • the verification stage receive the voice of the speaker, extract its voiceprint features and match with the registered voiceprint model, if the matching is successful, then The proof is the original registered speaker.
  • voiceprint recognition has some shortcomings.
  • the voice of the same person is volatile and easily affected by physical condition, age, emotion, etc.; for example, different microphones and channels have an impact on recognition performance; for example, environmental noise interferes with recognition;
  • Another example is that in the case of mixed speakers, it is not easy to extract the voiceprint features of a person.
  • Sound source localization technology is a technology that uses acoustic and electronic devices to receive target sound field information to determine the location of the target sound source.
  • the sound source localization of the microphone array refers to the use of the microphone array to pick up the sound source signal.
  • one or more sound source planes or spatial coordinates are determined in the spatial domain to obtain the sound source position .
  • One step further controls the beam of the microphone array to aim at the speaker.
  • the movement of the microphone array brought about by the robot movement is the main difference between the robot hearing and the traditional sound source localization technology.
  • the moving microphone array will face an instant-changing acoustic environment, which requires the sound source localization system to have high real-time performance. .
  • Most sound source localization systems now have a large number of sensors, which leads to high computational complexity of the algorithm. A small number of microphones and low-complexity positioning algorithms need to be further explored.
  • the technical problems to be solved by this application include the following aspects: how to accurately distinguish between different speakers when robots or other smart devices interact with multiple users face-to-face (multi-person teaching, games, family daily life, etc.) Sexual personalized interaction.
  • the voice matching method provided by the embodiments of the present application can be applied to human-computer interaction scenarios of smart devices such as smart robots, smart speakers, smart wearable devices, and smart phones.
  • the following exemplarily enumerate the human-computer interaction scenarios applied by the voice matching method in this application, which may include the following three scenarios.
  • Scenario 1 Human-computer interaction through robots:
  • Figure 1 is a schematic diagram of a robot interacting with multiple people according to an embodiment of the present invention.
  • the application scenario includes an intelligent robot and multiple users (in Figure 1 a group of friends, user A, user B, User C, User D, and User E are attending a party as an example).
  • user C in a group of friends wants to ask the smart robot to help bring up a glass of juice, then user C sends a voice request "Little Wisdom, Please Help Give me a glass of juice, thank you.”
  • the smart robot Xiaozhi first needs to collect the voice data and video data in the current party scene.
  • the voice data contains the voice request from user C above "Xiaozhi Xiaozhi, please help me get a glass of juice, thank you",
  • the video data contains the lip motion information of all users in the same time period of the voice request, and based on the voice information to be recognized and the lip motion information of multiple users, it is processed and analyzed, and the voice request is determined to be issued.
  • Which of the users is User A, User B, User C, User D, and User E.
  • the intelligent robot "Xiaozhi" controls itself according to the judgment result to send the designated juice or the juice selected according to the judgment to the user In front of user C.
  • Figure 2 is a schematic diagram of a smart speaker interacting with multiple people according to an embodiment of the present invention.
  • the application scenario includes a smart speaker and multiple users (in Figure 1 a group of children, child A, child B , Child C, Child D, Child E, Child F are playing games on the playground as an example), suppose, at this time, Child B wants to ask the smart speaker to help click a song he likes to invigorate the atmosphere, then Child B sends a voice request "Xiaozhi Xiaozhi, please play the "Little Sun” I heard last time.
  • the voice data contains the voice request from the above-mentioned Xiaopeng B "Xiaozhi Xiaozhi, please play a song I listened to last time "Little Sun” before
  • the video data contains the lip motion information of all children in the same time period of the voice request, and is processed based on the above-mentioned voice information to be recognized and the lip motion information of multiple children Analyze and determine which of the child A, child B, child C, child D, child E, and child F is the person who made the above voice request.
  • the smart speaker "Xiao Zhi" searches for the child according to the judgment result.
  • B’s play record of "Little Sun", and control the playback of the song "Little Sun”.
  • Scenario 3 Human-computer interaction of smart phone/smart watch:
  • FIG. 3 is a schematic diagram of a smart phone/smart watch interacting with multiple people according to an embodiment of the present invention.
  • the application scenario includes a smart phone/smart watch and multiple users (in FIG. Classmates, classmate A, classmate B, and classmate C are eating together for example).
  • classmate A, classmate B, and classmate C want to order a meal by voice through a smart phone or smart watch, then at this time, they can use a certain classmate B Order by using your smart phone or smart watch (there may be discounts for ordering, and it is convenient to place an order), but three people are required to give voice ordering instructions to classmate B’s smart phone or smart watch, for example, classmate B faces himself ’S smartphone said: “Xiaozhi Xiaozhi, I want to order a sweet and sour pork ribs from restaurant XX. Classmate A said to classmate B’s smartphone: “Xiaozhi Xiaozhi, I want to order a pickled fish from restaurant XXXX.
  • the smart phone first needs to collect the voice data and video data of the current table sharing scene.
  • the voice data includes the above-mentioned classmate A, classmate B, and classmate C, respectively
  • the video data contains the lip movement information of all three students during the same time period when the three students issued the above voice commands.
  • the smart phone is based on the above-mentioned voice information to be recognized (three ordering commands).
  • the application scenarios in Figs. 1, 2 and 3 are only a few exemplary implementations in the embodiment of the present invention, and the application scenarios in the embodiment of the present invention include but are not limited to the above application scenarios.
  • the voice matching method in this application can also be applied to scenarios such as the interaction between a smart computer and a multi-person meeting, the interaction between a smart computer and a multi-player game, or the interaction between a smart RV and multiple people. Other scenarios and examples will not be listed one by one. And repeat.
  • Any neural network training method provided in this application involves the fusion processing of computer hearing and vision. It can be specifically applied to data processing methods such as data training, machine learning, and deep learning. Lip motion information and M voice information) to perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained target feature matching model; and, any one provided by this application
  • the voice matching method can use the above-mentioned trained target feature matching model to input input data (such as the voice information to be recognized in this application and the lip movement information of N users) into the trained target feature matching model , To obtain output data (such as the matching degree between the lip motion information of the N users in this application and the voice information to be recognized).
  • neural network training method and the voice matching method provided by the embodiments of the present application are inventions based on the same concept, and can also be understood as two parts in a system, or an overall process. Two stages: such as model training stage and model application stage.
  • FIG. 4 is a system architecture 100 provided by an embodiment of the present invention.
  • the data collection device 160 is used to collect training data.
  • the data collection device 160 may include a microphone and a camera.
  • the training data (that is, the input data on the model training side) in the embodiment of the present invention may include: video sample data and voice sample data, which are respectively the lip movement information and M voice information of the training user in the embodiment of the present invention, where The M pieces of voice information may include voice information that matches the lip movement information of the training user.
  • the video sample data is the lip motion image sequence of a trained user when the voice is: "Today's weather is particularly good, where are we going to play?", while the voice sample data contains the above-mentioned training user's voice "Today's weather is special”. Okay, where are we going to play?” voice waveform sequence (as voice positive samples) and (M-1) other voice waveform sequences (as voice negative samples).
  • the above-mentioned video sample data and audio sample data may be collected by the data collection device 160 or downloaded from the cloud.
  • FIG. 1 is only an exemplary architecture and does not limit this.
  • the data collection device 160 stores the training data in the database 130, and the training device 120 trains based on the training data maintained in the database 130 to obtain the target feature matching model/rule 101 (the target feature matching model 101 here is an embodiment of the present invention).
  • the target feature matching model in, for example, is a model trained through the above-mentioned training phase, and can be used as a neural network model for feature matching between speech and lip motion trajectories).
  • the target feature matching model/rule 101 can be used to implement any voice matching method provided by the embodiment of the present invention, that is, the data
  • the audio data and video data acquired by the collection device 160 are pre-processed and input into the target feature matching model/rule 101 to obtain the matching between the image sequence features of the lip motion of multiple users and the voice features to be recognized. Degree/confidence degree.
  • the target feature matching model/rule 101 in the embodiment of the present invention may specifically be a spatiotemporal convolutional network (STCNN).
  • STCNN spatiotemporal convolutional network
  • the spatiotemporal convolutional network may be obtained by training a convolutional neural network.
  • the training data maintained in the database 130 may not all come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target feature matching model/rule 101 based on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training. The above description should not be used as Limitations of the embodiments of the present invention.
  • the target feature matching model/rule 101 is trained according to the training device 120.
  • the target feature matching model/rule 101 can be called an audiovisual cross convolutional neural network (V&A Cross CNN) in the embodiment of the present invention.
  • V&A Cross CNN audiovisual cross convolutional neural network
  • the target feature matching model provided by the embodiment of the present invention may include: a first model, a second model, and a third model.
  • the first model is used for speech feature extraction
  • the second model is used for multiple users (this In the application, it is the extraction of the image sequence features of the lip motion of the N users
  • the third model is used for the calculation of the matching degree/confidence between the above-mentioned voice features and the image sequence features of the N users.
  • the first model, the second model, and the third model can all be convolutional neural networks, which can be understood as the target feature matching model/rule 101 itself.
  • the spatio-temporal convolutional neural network contains multiple independent networks, such as the first model, second model and third model mentioned above,
  • the target feature matching model/rule 101 obtained by training according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 1, which can be a terminal, such as a mobile phone terminal, or a tablet.
  • Computers, laptops, augmented reality (AR)/virtual reality (VR), smart wearable devices, smart robots, vehicle-mounted terminals, etc. can also be servers or clouds.
  • the execution device 110 is configured with an I/O interface 112 for data interaction with external devices, and the user can use the client device 140 (the client device in this application may also include data collection devices such as microphones and cameras) Input data to the I/O interface 112.
  • the input data may include: voice information to be recognized and lip movement information of N users, which are respectively
  • the speech waveform sequence in the target time period and the lip motion information of each user in the lip motion information of the N users include a corresponding image sequence of the user's lip motion in the target time period.
  • the input data here may be input by a user or provided by a related database, and it may be specifically different according to different application scenarios, which is not specifically limited in the embodiment of the present invention.
  • the client device 140 and the execution device 110 may be on the same device, and the data collection device 160, the database 130, and the training device 120 may also be on the same device as the execution device 110 and the client device 140.
  • the robot extracts the collected audio data and video data through the client device 140 (including a microphone, a camera, and a processor) to obtain the voice information to be recognized and the lips of N users.
  • the execution device 110 inside the robot can be used to further perform feature matching between the extracted voice information and the lip movement information, and finally output the result to the client device 140, which is analyzed by the processor in the client device 140 Obtain the target users to which the voice information to be recognized belongs to the N users.
  • the equipment on the model training side data acquisition equipment 160, database 130 and training equipment 120
  • the robot has the ability to achieve model training or model update optimization.
  • Function At this time, the robot has both the function of the model training side and the function of the model application side; when in the cloud, it can be considered that the robot side has only the function of the model application side.
  • the client device 140 and the execution device 110 may not be on the same device, that is, collecting audio data and video data, and extracting the voice information to be recognized and the lip motion information of N users may be performed by the client device 140 (for example, Smartphones, smart robots, etc.), and the process of feature matching between the voice information to be recognized and the lip motion information of N users can be executed by the execution device 110 (such as a cloud server, a server, etc.).
  • the collection of audio data and video data is performed by the client device 140, and the voice information to be recognized and the lip motion information of N users are extracted, as well as the voice information to be recognized and the lip motions of N users.
  • the process of feature matching between information is all completed by the execution device 110.
  • the user can manually set input data, and the manual setting can be operated through the interface provided by the I/O interface 112.
  • the client device 140 can automatically send input data to the I/O interface 112. If the client device 140 is required to automatically send the input data and the user's authorization is required, the user can set the corresponding authority in the client device 140.
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal (for example, a microphone, a camera) to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store it in Database 130.
  • a data collection terminal for example, a microphone, a camera
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in the database 130.
  • the preprocessing module 113 is used to preprocess the input data (such as the voice data) received by the I/O interface 112.
  • the preprocessing module 113 may be used to preprocess the voice data, for example, Extract the voice information to be recognized from the voice data.
  • the preprocessing module 114 is used for preprocessing according to the input data received by the I/O interface 112, such as (the video data).
  • the preprocessing module 114 may be used for preprocessing the video data, For example, extracting the lip motion information of N users corresponding to the above-mentioned voice information to be recognized from the video data.
  • the execution device 110 When the execution device 110 preprocesses the input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing .
  • the data, instructions, etc. obtained by corresponding processing may also be stored in the data storage system 150.
  • the I/O interface 112 will output the result, such as the matching degree between the lip motion information of the N users in the embodiment of the present invention and the voice information to be recognized, or the target user ID of the highest matching degree among them. It returns to the client device 140, and the client device 140 determines the user information of the target user based on the above-mentioned matching degree, and generates a control instruction matching the user information based on the user information.
  • the training device 120 can generate corresponding target feature matching models/rules 101 based on different training data for different goals or tasks, and the corresponding target feature matching models/rules 101 can be used to achieve The above goals or the completion of the above tasks, so as to provide users with the desired results.
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
  • the convolutional neural network CNN is a deep neural network with a convolutional structure.
  • the network is a deep learning architecture.
  • the deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network responds to overlapping regions in the input image.
  • FIG. 5 is a schematic diagram of a convolutional neural network provided by an embodiment of the present invention.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220, and a neural network layer 230, where the pooling layer is optional.
  • the convolutional layer/pooling layer 120 may include layers 221-226 as shown in the example.
  • layer 221 is a convolutional layer
  • layer 222 is a pooling layer
  • layer 223 is a convolutional layer
  • layer 224 is The layer is a pooling layer
  • 225 is a convolutional layer
  • 226 is a pooling layer
  • 221 and 222 are convolutional layers
  • 223 is a pooling layer
  • 224 and 225 are convolutional layers
  • 226 is a convolutional layer. Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels..., it depends on the value of stride), so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolution output of a single depth dimension, but in most cases, a single weight matrix is not used, but multiple weight matrices with the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Fuzzy... the dimensions of the multiple weight matrices are the same, and the dimension of the feature map extracted by the weight matrix of the same dimension is also the same, and then the extracted feature maps of the same dimension are merged to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 200 to make correct predictions.
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After processing by the convolutional layer/pooling layer 220, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of required class outputs. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 1) and an output layer 240. The parameters contained in the multiple hidden layers can be based on specific task types. Relevant training data of, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 240 After the multiple hidden layers in the neural network layer 230, that is, the final layer of the entire convolutional neural network 200 is the output layer 240.
  • the output layer 240 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 200 shown in FIG. 5 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, more The two convolutional layers/pooling layers are in parallel, and the respectively extracted features are input to the full neural network layer 230 for processing.
  • the normalization layer in this application can in principle be performed after or before any layer in the above CNN, and the feature matrix output by the previous layer is used as input, and its output can also be As the input of any functional layer in CNN.
  • the normalization layer is generally performed after the convolutional layer, and the feature matrix output by the previous convolutional layer is used as the input matrix.
  • Figure 6 is a hardware structure diagram of a neural network processor provided by an embodiment of the present invention, in which,
  • the neural network processor NPU 302 is mounted on the CPU (such as the Host CPU) 301 as a coprocessor, and the Host CPU 301 assigns tasks.
  • the CPU 301 may be located in the client device 140, and is used to extract the voice information to be recognized and the lip motion information of N users from the voice data and video data; and the NPU 302 can be located in the calculation module 111, and is used to perform feature extraction and feature matching on the voice information to be recognized extracted by the CPU 301 and the lip motion information of N users, so as to send the matching result to the CPU 301 for further
  • the calculation processing of will not be described in detail here.
  • the foregoing CPU and NPU may be located in different devices, and different settings may be made according to the actual requirements of the product.
  • the NPU is located on the cloud server, and the CPU is located on the user equipment (such as smart phones, smart robots); or, both the CPU and NPU are located on the client equipment (such as smart phones, smart robots, etc.).
  • the core part of the NPU 302 is the arithmetic circuit 3023, and the controller 3024 controls the arithmetic circuit 3023 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 3023 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 3023 is a two-dimensional systolic array. The arithmetic circuit 3023 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 3023 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 3022 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches matrix A data and matrix B from the input memory 3021 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator 3028.
  • the unified memory 3026 is used to store input data and output data.
  • the weight data directly access the controller 30212 Direct Memory Access Controller through the storage unit, and the DMAC is transferred to the weight memory 3022.
  • the input data is also transferred to the unified memory 3026 through the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1210, which is used for the interaction of the AXI bus with the DMAC and the instruction fetch buffer 3029.
  • the bus interface unit 1210 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 3029 to obtain instructions from the external memory, and is also used for the storage unit access controller 30212 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 3026 or to transfer the weight data to the weight memory 3022 or to transfer the input data to the input memory 3021.
  • the vector calculation unit 3027 has multiple arithmetic processing units, if necessary, further processing the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 3027 can store the processed output vector in the unified buffer 3026.
  • the vector calculation unit 3027 may apply a nonlinear function to the output of the arithmetic circuit 3023, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 3027 generates normalized values, combined values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 3023, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 3029 connected to the controller 3024 is used to store instructions used by the controller 3024;
  • the unified memory 3026, the input memory 3021, the weight memory 3022, and the fetch memory 3029 are all On-Chip memories.
  • the external memory is private to the NPU hardware architecture.
  • the lip motion information of the training user is all performed by the relevant functional units in the neural network processor 302 (NPU) mentioned above. It may be implemented by the above-mentioned processor 301 and the above-mentioned neural network processor 302 in cooperation, which will not be detailed here.
  • the following describes the embodiments of the neural network training method and voice matching method provided in this application from the model training side and the model application side in combination with the above application scenarios, system architecture, convolutional neural network structure, and neural network processor structure. As well as specific analysis and resolution of the technical problems raised in this application.
  • FIG. 7A is a schematic flowchart of a neural network training method provided by an embodiment of the present invention. This method can be applied to the application scenarios and system architectures described in FIG. 1, FIG. 2 or FIG. 3, specifically It can be applied to the above-mentioned training device 120 of FIG. 4. In the following, description will be made with reference to FIG. 7A taking the execution subject as the training device 120 in FIG. The method may include the following steps S701-S702.
  • S701 Obtain training samples, where the training samples include lip movement information of the trained user and M pieces of voice information.
  • the M voice messages include the voice message of "Hello, my name is Xiaofang, from Hunan, China, how about you?" as a positive voice sample, And other voice messages, such as "Hello, did you have dinner tonight", "The weather is really good today, where did you go to play?” "Help find a tourist route from Hunan to Beijing?” and other voice messages as negative samples .
  • the M pieces of voice information include voice information that matches the lip movement information of the training user and (M-1) pieces of voice information that do not match the lip movement information of the training user.
  • the lip motion information of the training user includes a lip motion image sequence of the training user
  • the M voice information includes a voice waveform sequence that matches the lip motion image sequence of the training user And (M-1) speech waveform sequences that do not match the lip motion image sequence of the training user.
  • the above lip movement information is the voice message of the user Xiaofang: "Hello, my name is Xiaofang, from Hunan, China, how about you?" Type video), and the above M voice information includes the voice waveform sequence of positive samples of the voice and the voice waveform sequence of M-1 negative samples. It can be understood that the foregoing M pieces of voice information may also include multiple positive samples and negative samples, that is, the number of positive samples and negative samples is not specifically limited, as long as they are included.
  • FIG. 7B is a schematic diagram of a training sample sampling provided by an embodiment of the present invention.
  • the initialized neural network model can be trained based on a self-supervised training method without additional labeling.
  • the sampling strategy of the training process is shown in Figure 7B: the solid rectangle shows the audio segment corresponding to the speaker face above (that is, the positive sample), and the dotted rectangle shows the audio segment that does not match (that is, the negative sample.
  • the negative sample is generated by the positive sample).
  • the audio offset ⁇ t is obtained).
  • S702 Use the lip motion information of the training user and the M voice information as training input, and use the matching degree between the lip motion information of the training user and the M voice information as M tags , Train the initialized neural network to get the target feature matching model.
  • the above-mentioned training user s lip movement information and other negative sample voice messages “Hello, have you eaten tonight”, "The weather is really good today, where did you go to play?” "Help find the travel from Hunan to Beijing
  • the target feature matching model can be used to match the voice information to be recognized with multiple
  • the matching relationship between the lip motion information of each user is used to implement any voice matching method in this application.
  • the lip motion information of the training user and the M voice information are used as training input, and the lip motion information of the training user is respectively the same as the M voice information.
  • the matching degree between M labels is performed on the initialized neural network to obtain a target feature matching model, including: inputting the lip motion information of the trained user and the M voice information into the initialized neural network In the network, the degree of matching between the M voice information and the lip movement information of the training user is calculated; and the calculated M voice information is different from the lip movement information of the training user.
  • the matching degree between the two is compared with the M tags, and the initialized neural network is trained to obtain a target feature matching model.
  • the target feature matching model includes a first model, a second model, and a third model; the lip motion information of the training user and the M voice information are input to the training user.
  • calculating the matching degree between the M voice information and the lip motion information of the training user includes: inputting the M voice information into the first model, Obtain M voice features, each of the M voice features is a K-dimensional voice feature, and K is an integer greater than 0; input the lip motion information of the training user into the second model , Obtain the image sequence feature of the training user, the image sequence feature of the training user is a K-dimensional image sequence feature; input the M voice features and the image sequence feature of the training user into the third model, and calculate Obtain the matching degree between the M voice features and the image sequence features of the training user.
  • the lip motion information of a certain training user as well as the matching voice information and multiple unmatched voice information are used as the input of the initialized neural network, and based on the above M voice information and the training
  • the actual matching degree of the user's lip motion information is used as the label, and the target feature matching model obtained by training the above-mentioned initial neural network model.
  • the matching degree corresponding to a perfect match is the label 1
  • the matching degree corresponding to the non-matching is the label Is 0, when the matching degree between the lip motion information of the training user calculated by the initialized neural network after training and the M voice information is closer to the M tags, then the initialized neural network after training The closer to the target feature matching model.
  • Figure 8A is a schematic flowchart of a voice matching method provided by an embodiment of the present invention.
  • the method can be applied to the application scenarios and system architectures described in Figure 1, Figure 2 or Figure 3 above, and can be specifically Applied to the client device 140 and the execution device 110 of FIG. 4, it can be understood that the client device 140 and the execution device 110 can be on the same physical device, such as a smart robot, a smart phone, a smart terminal, a smart wearable device, etc. .
  • the following describes an example in which the execution subject is a smart device including the client device 140 and the execution device 110 described above with reference to FIG. 8A.
  • the method may include the following steps S801-S805.
  • Step S801 Acquire audio data and video data.
  • audio and video data is acquired.
  • the smart device acquires voice data through a microphone and video data through a camera.
  • the original audio data and video data in a certain period of time, that is, the audio data source and the video data source.
  • the audio data and the video data are collected for the same time period in the same scene, that is, the audio data is the audio data corresponding to the video data.
  • the robot obtains voice data in a certain time period in a certain scene collected by a microphone through the processor, and collects video data in this time period in the scene through a camera.
  • Step S802 Extract the voice information to be recognized from the audio data.
  • the voice information to be recognized includes a voice waveform sequence in the target time period. That is, in the embodiment of the present invention, the voice information to be recognized includes a voice waveform sequence in a specific time period.
  • the audio data format is non-compressed wav format, and wav is one of the most common sound file formats. It is a standard digital audio file that can record various mono or stereo sound information, and Can ensure that the sound is not distorted.
  • the sound quality of the sound restored from the wav file depends on the sample size of the sound card. The higher the sampling frequency, the better the sound quality, but the greater the overhead, the larger the wav file, as shown in Figure 8B. As shown, FIG. 8B is an example diagram of a sound waveform provided by an embodiment of the present invention.
  • the voice information is the voice information to be recognized above.
  • the audio data includes multiple pieces of voice information spoken by a certain user, and the smart device only needs to recognize a certain piece of voice information, the piece of voice information is the voice information to be recognized.
  • the smart device extracts audio features from the audio data acquired by the microphone array in S801.
  • FIG. 8C is a method of using Mel frequency cepstral coefficients for voice feature extraction according to an embodiment of the present invention.
  • MFCC Mel Frequency Cepstral Coefficient
  • how to extract the voice information to be recognized from the audio data includes: based on a spectrum recognition algorithm, recognizing audio data of different spectrums in the audio data, and combining the audio data of the target spectrum Recognized as the voice information to be recognized. Since the frequency spectrum corresponding to the sounds emitted by different users is generally different, the embodiment of the present invention first recognizes the audio data of different frequency spectrums from the audio data, and then recognizes the audio data of the target spectrum as the to-be-recognized audio data. Voice information, and then realize the function of extracting the voice information to be recognized from the audio data.
  • Step S803 Extract lip motion information of N users from the video data.
  • the lip motion information of each user in the lip motion information of the N users includes an image sequence of the corresponding user's lip motion in the target time period, and N is an integer greater than 1. That is, the lip video of each user extracted from the original video data, that is, the continuous lip motion image sequence, contains the continuous lip-shape change characteristics of the corresponding user.
  • the users may include humans, or other robots, biochemical people, toy pets, real pets, etc. that can send out voice messages.
  • the format of each frame of image in the video data collected by the camera is a 24-bit BMP bitmap.
  • the BMP image file (Bitmap-File) format is the image file storage format adopted by Windows, and the 24-bit image is used 3 bytes store the color value, each byte represents a color, arranged in red (R), green (R), blue (B), and convert the RGB color image into a grayscale image.
  • the smart device obtains at least one face area based on the face recognition algorithm from the video data collected by the above-mentioned camera, and further assigns a face ID to each face area based on the face recognition algorithm (described in Figure 8D). Take a scene as an example.
  • Figure 8D is a schematic diagram of a scene of interaction between a robot and family members according to an embodiment of the present invention.
  • the obtained video can extract 6 face IDs), and the video sequence stream of the mouth area is extracted.
  • Frame rate frame number (Frames)/time (Time)
  • the unit is frame per second (f/s, frames per second, fps)).
  • Nine consecutive image frames form a 0.3 second video stream.
  • Each channel is a 60 ⁇ 100 grayscale image of the oral cavity area (2d spatial feature). In this way, the N image sequences corresponding to the lip motion within 0.3s are used as the input of the video feature, and 0.3s is the target time period.
  • specifically how to extract the lip motion information of N users from the video data includes: based on a face recognition algorithm, recognizing the N face regions in the video data, and extracting the Lip motion videos in each face region in the N face regions; determining the lip motion information of the N users based on the lip motion videos in each face region.
  • the face area is first identified from the video data, and then the lip motion video in each face area is extracted based on the face area, and then the lips of N users are determined according to the lip motion video
  • the motion information is the sequence of motion images corresponding to the user's lips.
  • the smart device acquires at least one face region from the video data acquired by the camera based on the face recognition algorithm, and further extracts the video sequence stream of the mouth region using each face region as a unit.
  • Step S804 Input the voice information to be recognized and the lip motion information of the N users into the target feature matching model to obtain the lip motion information of the N users and the voice information to be recognized respectively The degree of match between.
  • the voice waveform sequence in the target time period and the image sequence of each of the N users' lip movements in the target time period are respectively used as the input of the audio feature and the input of the video feature, and input to the target feature matching model.
  • Step S805 Determine the user corresponding to the lip motion information of the user with the highest matching degree as the target user to which the voice information to be recognized belongs.
  • matching the aforementioned speech information to be recognized with the aforementioned N user lip movement information that is, the speech patterns of the N users
  • match the speech speech patterns that are most likely to emit the aforementioned speech information to be recognized that is, Is the lip movement information.
  • audio data and video data can be collected, and the voice information in the audio data and the lip operation information in the video data can be matched to determine the waiting information.
  • the target user to which the recognized voice information belongs That is, in a multi-person scene, the voice feature is matched with the lip motion features of multiple users to identify the specific user who sent a certain piece of voice information to be recognized, so that further control or operation can be performed based on the recognition result .
  • the embodiments of the invention do not rely on human voice (voiceprint is easily affected by physical condition, age, emotion, etc.), and is not subject to environmental interference (such as Noise interference in the environment, etc.), strong anti-interference ability, high recognition efficiency and accuracy.
  • the voice information to be recognized includes the voice waveform sequence in a specific time period
  • the lip motion information of N users includes the lip motion image sequence of multiple users in the same scene in the time period. (That is, the video of the lip movement), which is convenient for subsequent feature extraction and feature matching.
  • the voice information to be recognized and the lip motion information of N users are used as the input of the target feature matching model, and the lip motion information of the N users is separated from the voice information to be recognized.
  • the matching degree of is used as the output of the target feature matching model, and then the target user to which the voice information to be recognized belongs is determined according to the matching degree.
  • the target feature matching model is a neural network model.
  • Figure 9A is a schematic flow chart of another voice matching method provided by an embodiment of the present invention.
  • This method can be applied to the application scenarios and system architectures described in Figure 1, Figure 2 or Figure 3, as well as specific It can be applied to the client device 140 and the execution device 110 of FIG. 4 above. It is understood that the client device 140 and the execution device 110 can be on the same physical device, such as a smart robot, a smart phone, a smart terminal, or a smart wearable device. Wait.
  • the following describes an example in which the execution subject is a smart device including the client device 140 and the execution device 110 with reference to FIG. 9A.
  • the method may include the following steps S901-S905. Optionally, it may also include step S906-step S907.
  • Step S901 Acquire audio data and video data, where the audio data and the video data are collected for the same scene.
  • Step S902 Extract the voice information to be recognized from the audio data.
  • Step S903 Extract lip motion information of N users from the video data, where N is an integer greater than 1;
  • step S901-step S903 For the functions of step S901-step S903, reference may be made to the related description of step S801-step S803 above, which will not be repeated here.
  • Step S904 Input the voice information to be recognized and the lip motion information of the N users into the target feature matching model to obtain the lip motion information of the N users and the voice information to be recognized respectively The degree of match between.
  • two neural network encoders can be used to extract features of input sequences of different modalities layer by layer to obtain high-level feature expressions.
  • the target feature matching model includes a first model, a second model, and a third model; the voice information to be recognized and the lip motion information of the N users are input into the target feature matching model , Obtaining the matching degree between the lip motion information of the N users and the voice information to be recognized, including the following steps S904-A to S904-C:
  • Step S904-A Input the voice information to be recognized into the first model to obtain voice features, where the voice features are K-dimensional voice features, and M is an integer greater than 0;
  • Step S904-B Input the lip motion information of the N users into the second model to obtain N image sequence features; each of the N image sequence features is a K-dimensional image Sequence characteristics.
  • the smart device takes the extracted video sequence and audio sequence as input, and characterizes the input video sequence and audio sequence through the first model and the second model, respectively.
  • Normalization that is, the video feature and audio feature are extracted into features of the same dimension, that is, both are K-dimensional features, so as to facilitate subsequent feature matching, and further determine the face ID of the voice information to be recognized matching the lip motion information.
  • the process is as follows:
  • the target feature matching network in this application is composed of two parts, namely a feature analysis sub-network and a feature matching sub-network.
  • the first model and the second model in the embodiments of the invention are both feature analysis sub-networks, and the third model is feature matching. Subnet. among them,
  • FIG. 9B is a schematic structural diagram of a first model and a second model provided by an embodiment of the present invention. Its function is to normalize the video sequence stream and audio features, and the normalized network
  • the type is STCNN (Spatial-temporal Convolutional Network)
  • the second model on the left handles the video stream
  • the specific composition is 3 3D convolutional layers plus pooling layer (conv+pooling layer), followed by 2 fully connected layers (FC layer)
  • the output is a 64-dimensional feature representation of a continuous video sequence.
  • the first model on the right handles the audio stream.
  • the specific composition is 2 3D convolution and pooling layers (conv+pooling layer), 2 3D convolution layers (conv layer) and a fully connected layer (FC layer), audio and video
  • STCNN spatio-temporal convolutional network
  • the output is a 64-dimensional feature representation of a continuous speech sequence, that is, K is equal to 64.
  • Step S904-C Input the voice feature and the N image sequence features into a third model to obtain the matching degree between the N image sequence features and the voice feature.
  • FIG. 9C is a schematic structural diagram of a third model provided by an embodiment of the present invention.
  • the model training process is on the left, and Multi-way cross-Entropy Loss is used as the loss.
  • Function the basic network part is composed of 1 splicing layer (concat layer), 2 FC layers and 1 softmax layer (Softmax function: normalized exponential function).
  • end-to-end means: there used to be some data processing systems or learning systems that required multiple stages of processing. Then end-to-end deep learning ignores all these different stages and replaces it with a single neural network).
  • end-to-end means: there used to be some data processing systems or learning systems that required multiple stages of processing.
  • end-to-end deep learning ignores all these different stages and replaces it with a single neural network.
  • end-to-end means: there used to be some data processing systems or learning systems that required multiple stages of processing.
  • end-to-end deep learning ignores all these different stages and replaces it with a single neural network.
  • FIG. 9C its function is to compare all the facial lip sequence features in the field of view with the separated audio sequence features at the same time to find the best matching face ID.
  • the right side of Figure 9C is the model application process. Based on the equivalence of audio and video features (referring to the 64-dimensional features obtained after the audio and video data passes through the feature analysis network), the 1-to-N problem is transformed into N-pairs 1. During inference, video and audio use N:1 as input (N segments of video, 1 segment of audio). The input is the 64-dimensional feature of audio and video output by the feature analysis sub-network in Step3, and the output is the most similar audio and video pair.
  • Step S905 Determine the user corresponding to the lip motion information of the user with the highest matching degree as the target user to which the voice information to be recognized belongs.
  • Step S906 Determine user information of the target user.
  • the user information includes one of character attribute information, facial expression information corresponding to the voice information to be recognized, and environment information corresponding to the voice information to be recognized Or multiple.
  • Step S907 Based on the user information, a control instruction matching the user information is generated.
  • the user after determining which target user in the current scene is the specific voice information to be recognized, the user’s attribute information (such as gender, age, personality, etc.), face Expression information (for example, the expression corresponding to the voice information to be recognized by the target user) and corresponding environmental information (for example, the target user is currently in an office environment, home environment, or entertainment environment, etc.) to determine the control that matches the above user information Instructions (such as voice instructions, operating instructions, etc.).
  • the intelligent machine is controlled to send a voice or operation matching the facial expression data and character attribute information to the target user, including the voice of the robot, the turning of the robot's head, and the content of the robot's reply.
  • FIG. 9D is a schematic diagram of a robot interaction scene provided by an embodiment of the present invention.
  • the most similar audio and video pairs are the video including Tom’s face and the video sent by Tom. voice. Therefore, in this step, it can be determined that the face ID that issued the instruction is the face ID corresponding to Tom.
  • the smart device obtains the user's detailed user profile (user profile, or user portrait) according to the determined face ID and the knowledge graph stored in the storage module;
  • the knowledge graph includes the user's demographic characteristics and dialogue History (in the embodiment of the present invention, including in the scene example shown in Figure 9D, a contextual conversation between sister Chris (tell him there will be a rainstorm tomorrow) and younger brother Jerry (arguing to go with him)), the user Likes
  • the smart device uses the facial expression network to obtain the latest real-time facial expression data according to the face area corresponding to the face ID;
  • the smart device converts the horizontal and vertical angle values of the mechanical structure of the robot according to the bounding box/bbox of the face and the width of the camera field of view (as shown in the scene example in Figure 9F). Give the steering gear control system to drive the robot to turn to the user.
  • the specific conversion formula is:
  • Fig. 9E is a schematic diagram of a functional module and an overall process provided by an embodiment of the present invention, including multi-modal input, feature extraction, feature matching, matching result post-processing, dialogue and control system processing.
  • the focus of the embodiment of the present invention lies in the feature extraction module, the multi-modal feature matching network and the post-processing module.
  • the multi-modal input module includes a user-side terminal device (a robot is taken as an example in the embodiment of the present invention).
  • a camera which is used to obtain images/videos.
  • the obtained images/videos include information of multiple persons;
  • the microphone is used to pick up voice, in the present invention, the acquired image/video includes information of multiple people;
  • the neural network processor, the feature extraction module can be a part of the processing module (not shown in the figure above) of the user-side terminal device, and can be implemented in software, such as a piece of code, used in the video data obtained from the camera, based on the face
  • the recognition algorithm obtains at least one face area, and further uses each face area as a unit to extract the video sequence stream of the mouth area; the feature extraction module is also used to extract audio features from the audio data obtained by the microphone array;
  • the multi-modal feature matching module is used to extract the video sequence stream and audio features extracted by the feature extraction module, and use V&A Cross CNN to obtain mouth shape and voice matching face ID;
  • the post-processing module of the matching result is used for the face ID obtained by the multi-modal feature matching module and the knowledge graph stored in the storage module (not shown in the figure above.
  • the storage module can be a cloud server or deployed on a robot). (Or part of it on the cloud and part of it on the robot) to obtain the user’s detailed user profile (user profile, or user profile);
  • the matching result post-processing module is also used to obtain the latest real-time facial expression data based on the facial region corresponding to the face ID and acquired by the feature extraction module, using the facial expression network;
  • the matching result post-processing module is also used to convert the movement angle of the mechanical structure of the robot according to the bounding box/bbox of the face and a preset algorithm;
  • the processing module of the dialogue and control system includes a dialogue module and a control module.
  • the user of the dialogue module obtains a voice response according to the user portrait and facial expression data obtained by the post-processing module of the matching result; the control module is used to obtain the answer from the post-processing module according to the matching result
  • the movement angle is controlled by a steering gear (a steering gear is also called a servo motor, which was first used on ships to realize its steering function. Since its rotation angle can be continuously controlled by a program, it is widely used in various joint motions of robots) to drive the robot. Turn to that user.
  • the embodiment of the present invention also provides another voice matching method, which can be applied to the application scenarios and system architectures described in FIG. 1, FIG. 2 or FIG. 3, and can be specifically applied to the execution device 110 of FIG. 4 above. It can be understood that, at this time, the client device 140 and the execution device 110 may not be on the same physical device, as shown in FIG. 9F.
  • FIG. 9F is an architecture diagram of a voice matching system provided by an embodiment of the present invention.
  • the client device 140 is a smart robot, a smart phone, a smart speaker, a smart wearable device, etc., as an audio data and video data collection device, and further, it can also be used as the voice information to be recognized and the lips of N users.
  • Information extraction equipment; and the matching between the extracted voice information to be recognized and the lip information of N users can be performed on the server/service device/service device/cloud service device where the execution device 110 is located.
  • the aforementioned extraction of the voice information to be recognized and the lip information of the N users may also be performed on the device side where the execution device 110 is located, which is not specifically limited in the embodiment of the present invention.
  • the following describes an example in which the execution subject is a cloud service device including the execution device 110 described above with reference to FIG. 9F.
  • the method may include the following steps S1001-step S1003.
  • Step S1001 Acquire voice information to be recognized and lip motion information of N users;
  • the voice information to be recognized includes a voice waveform sequence within a target time period, and each of the lip motion information of the N users
  • the user's lip movement information includes a corresponding image sequence of the user's lip movement in the target time period, and N is an integer greater than 1.
  • Step S1002 Input the voice information to be recognized and the lip motion information of the N users into the target feature matching model, and obtain the lip motion information of the N users and the voice information to be recognized respectively The degree of match between.
  • Step S1003 Determine the user corresponding to the lip motion information of the user with the highest matching degree as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model
  • the voice information to be recognized and the lip motion information of the N users are input into a target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized.
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third model to obtain the matching degrees between the N image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M labels is the feature matching model obtained by training.
  • the method further includes:
  • the user information includes one or more of character attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the method further includes: extracting lip motion information of N users from video data; further, said extracting lip motion information of N users from the video data, include:
  • the method further includes: extracting the voice information to be recognized from the audio data; further, the extracting the voice information to be recognized from the audio data includes:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • FIG. 10A is a schematic structural diagram of a smart device provided by an embodiment of the present invention
  • FIG. 10A is a functional principle diagram of a smart device provided by an embodiment of the present invention.
  • the smart device may be a smart robot, a smart phone, a smart speaker, a smart wearable device, etc.
  • the smart device 40A may include a processor 401A, a microphone 402A, a camera 403A, and a neural network processor 404A coupled to the processor 401A; wherein,
  • Microphone 402A used to collect audio data
  • the camera 403A is used to collect video data, and the audio data and the video data are collected for the same scene;
  • the processor 401A is configured to obtain the audio data and the video data; extract the voice information to be recognized from the audio data; extract the lip motion information of N users from the video data, where N is greater than 1. Integer
  • the neural network processor 404A is configured to determine the target user to which the voice information to be recognized belongs from among the N users based on the voice information to be recognized and the lip movement information of the N users.
  • the voice information to be recognized includes a voice waveform sequence within a target time period; the lip motion information of each user in the lip motion information of the N users includes the corresponding user An image sequence of lip movement in the target time period.
  • the neural network processor 404A is specifically configured to: input the voice information to be recognized and the lip motion information of the N users into the target feature matching model to obtain the N The matching degree between the lip motion information of each user and the voice information to be recognized; the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs .
  • the target feature matching model includes a first model, a second model, and a third model;
  • the neural network processor 404A is specifically configured to: input the voice information to be recognized into the In the first model, a voice feature is obtained, the voice feature is a K-dimensional voice feature, and K is an integer greater than 0;
  • the lip motion information of the N users is input into the second model to obtain N images
  • Sequence features each of the N image sequence features is a K-dimensional image sequence feature;
  • the voice features and the N image sequence features are input into the third model to obtain the N image sequence features The degree of matching between the image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training, wherein the M voice information includes voice information matched with the lip motion information of the trained user.
  • the processor 401A is further configured to: determine user information of the target user, where the user information includes character attribute information, facial expression information corresponding to the voice information to be recognized, and One or more of the environmental information corresponding to the voice information to be recognized; based on the user information, a control instruction matching the user information is generated.
  • the processor 401A is specifically configured to: based on a face recognition algorithm, identify N face regions in the video data, and extract the lips of each face region in the N face regions. A motion video; determining the lip motion information of the N users based on the lip motion video in each face area.
  • the processor 401A is specifically configured to: based on a spectrum recognition algorithm, recognize audio data of different spectrums in the audio data, and recognize the audio data of the target spectrum as the voice to be recognized information.
  • Each unit in FIG. 10A can be implemented by software, hardware, or a combination thereof.
  • the hardware-implemented units can include circuits and electric furnaces, arithmetic circuits, or analog circuits.
  • a unit implemented in software may include program instructions, which is regarded as a software product, is stored in a memory, and can be run by a processor to implement related functions. For details, refer to the previous introduction.
  • FIG. 10B is a schematic structural diagram of another smart device provided by an embodiment of the present invention
  • FIG. 10B is a schematic functional principle diagram of a smart device provided by an embodiment of the present invention.
  • the smart device may be a smart robot, a smart phone, a smart speaker, a smart wearable device, etc.
  • the smart device 40B may include a processor 401B, a microphone 402B and a camera 403B coupled to the processor 401B; wherein,
  • Microphone 402B used to collect audio data
  • the camera 403B is used to collect video data, and the audio data and the video data are collected for the same scene;
  • Processor 401B used for:
  • the target user to which the voice information to be recognized belongs is determined from the N users.
  • the voice information to be recognized includes a voice waveform sequence within a target time period; the lip motion information of each user in the lip motion information of the N users includes the corresponding user An image sequence of lip movement in the target time period.
  • the processor 404B is specifically configured to: input the voice information to be recognized and the lip motion information of the N users into the target feature matching model to obtain the N users The matching degree between the lip motion information of the lip motion information and the voice information to be recognized; the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model;
  • the neural network processor 404B is specifically configured to: input the voice information to be recognized into the In the first model, a voice feature is obtained, the voice feature is a K-dimensional voice feature, and K is an integer greater than 0;
  • the lip motion information of the N users is input into the second model to obtain N images
  • Sequence features each of the N image sequence features is a K-dimensional image sequence feature;
  • the voice features and the N image sequence features are input into the third model to obtain the N image sequence features The degree of matching between the image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training, wherein the M voice information includes voice information matched with the lip motion information of the trained user.
  • the processor 401B is further configured to: determine user information of the target user, the user information including character attribute information, facial expression information corresponding to the voice information to be recognized, and One or more of the environmental information corresponding to the voice information to be recognized; based on the user information, a control instruction matching the user information is generated.
  • the processor 401B is specifically configured to: based on a face recognition algorithm, identify N face regions in the video data, and extract the lips of each face region in the N face regions. A motion video; determining the lip motion information of the N users based on the lip motion video in each face area.
  • the processor 401B is specifically configured to: based on a spectrum recognition algorithm, recognize audio data of different spectrums in the audio data, and recognize the audio data of the target spectrum as the voice to be recognized information.
  • Each unit in FIG. 10B can be implemented by software, hardware, or a combination thereof.
  • the hardware-implemented units can include circuits and electric furnaces, arithmetic circuits, or analog circuits.
  • a unit implemented in software may include program instructions, which is regarded as a software product, is stored in a memory, and can be run by a processor to implement related functions. For details, refer to the previous introduction.
  • FIG. 11 is a schematic structural diagram of a voice matching apparatus provided by an embodiment of the present invention
  • FIG. 11 is a schematic functional principle diagram of a smart device provided by an embodiment of the present invention.
  • the smart device may be a smart robot, a smart phone, a smart speaker, a smart wearable device, etc.
  • the voice matching device 50 may include an acquiring unit 501, a first extracting unit 502, a second extracting unit 503, and a matching unit 504; among them,
  • the obtaining unit 501 is configured to obtain audio data and video data
  • the first extraction unit 502 is configured to extract voice information to be recognized from the audio data, where the voice information to be recognized includes a voice waveform sequence in a target time period;
  • the second extraction unit 503 is configured to extract lip motion information of N users from the video data, and the lip motion information of each user in the lip motion information of the N users includes the corresponding user's lip motion information.
  • N is an integer greater than 1;
  • the matching unit 504 is configured to input the voice information to be recognized and the lip motion information of the N users into a target feature matching model to obtain the lip motion information of the N users and the lip motion information of the N users respectively.
  • the user determining unit 505 is configured to determine the user corresponding to the lip motion information of the user with the highest matching degree as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first network, a second network, and a third network; the matching unit 504 is specifically configured to:
  • a voice feature where the voice feature is a K-dimensional voice feature, and K is an integer greater than 0;
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third network, and the matching degrees between the N image sequence features and the voice features are obtained.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training; optionally, the M voice information includes voice information that matches the lip motion information of the trained user and (M-1) Voice information that does not match the lip motion information of the training user.
  • the device further includes:
  • the information determining unit 506 is configured to determine user information of the target user, the user information including character attribute information, facial expression information corresponding to the voice information to be recognized, and environment information corresponding to the voice information to be recognized One or more of;
  • the control unit 507 is configured to generate a control instruction matching the user information based on the user information.
  • the first extraction unit 502 is specifically configured to:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • the second extraction unit 503 is specifically configured to:
  • Each unit in FIG. 11 can be implemented by software, hardware, or a combination thereof.
  • the hardware-implemented units can include circuits and electric furnaces, arithmetic circuits, or analog circuits.
  • a unit implemented in software may include program instructions, which is regarded as a software product, is stored in a memory, and can be run by a processor to implement related functions. For details, refer to the previous introduction.
  • FIG. 12 is a schematic structural diagram of a neural network training device provided by an embodiment of the present invention
  • FIG. 12 is a schematic functional principle diagram of a smart device provided by an embodiment of the present invention.
  • the training device of the neural network can be a smart robot, a smart phone, a smart speaker, a smart wearable device, etc.
  • the neural network training device 60 may include an acquisition unit 601 and a training unit 602; among them,
  • the acquiring unit 601 is configured to acquire training samples, where the training samples include lip movement information of the training user and M pieces of voice information; optionally, the M pieces of voice information include information related to the lip movement information of the training user. Matched voice information and (M-1) voice information that does not match the lip movement information of the training user;
  • the training unit 602 is configured to use the lip motion information of the training user and the M voice information as training input, and use the matching degree between the lip motion information of the training user and the M voice information respectively For M tags, train the initialized neural network to obtain the target feature matching model.
  • the lip motion information of the training user includes a lip motion image sequence of the training user
  • the M voice information includes a lip motion image sequence that matches the training user's lip motion image sequence.
  • the training unit 602 is specifically configured to:
  • the calculated matching degrees between the M voice information and the lip motion information of the trained user are compared with the M tags, and the initialized neural network is trained to obtain a target feature matching model.
  • the target feature matching model includes a first model, a second model, and a third model; the training unit 602 is specifically configured to:
  • each of the M voice features is a K-dimensional voice feature, and K is an integer greater than 0;
  • the calculated matching degrees between the M voice features and the image sequence features of the trained user are compared with the M tags, and the initialized neural network is trained to obtain a target feature matching model.
  • Each unit in FIG. 12 can be implemented by software, hardware, or a combination thereof.
  • the hardware-implemented units can include circuits and electric furnaces, arithmetic circuits, or analog circuits.
  • a unit implemented in software may include program instructions, which is regarded as a software product, is stored in a memory, and can be run by a processor to implement related functions. For details, refer to the previous introduction.
  • FIG. 13 is a schematic structural diagram of another smart device provided by an embodiment of the present invention.
  • the smart device may be a smart robot, a smart phone, a smart speaker, a smart wearable device, and the like.
  • the smart device 70 may include a processor 701, a microphone 702 and a camera 703 coupled to the processor 701; wherein,
  • the microphone 702 is used to collect audio data
  • the camera 703 is used to collect video data
  • the processor 701 is used to obtain audio data and video data
  • the lip motion information of N users is extracted from the video data, and the lip motion information of each user in the lip motion information of the N users includes the lip motion of the corresponding user in the target time period Image sequence of, N is an integer greater than 1;
  • FIG. 14 is a schematic structural diagram of a service device provided by an embodiment of the present invention.
  • the service device may be a server, a cloud server, or the like.
  • the service device 80 may include a processor; optionally, the processor may be composed of a neural network processor 801 and a processor 802 coupled to the neural network processor 801; wherein,
  • Neural network processor 801 for:
  • the voice information to be recognized and the lip motion information of the N users are input into the target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized. suitability;
  • the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model, and a third model; the neural network processor 801 is specifically configured to:
  • each of the N image sequence features is a K-dimensional image sequence feature
  • the voice features and the N image sequence features are input into a third model to obtain the matching degrees between the N image sequence features and the voice features.
  • the target feature matching model is to take the lip motion information and M voice information of the training user as input, and use the lip motion information of the training user and the M voice information respectively.
  • the matching degree between M labels is the feature matching model obtained by training.
  • the server further includes a processor 802; the processor 802 is configured to:
  • the user information includes one or more of character attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the server further includes a processor 802; the processor 802 is further configured to:
  • the server further includes a processor 802; the processor 802 is further configured to:
  • the audio data of different frequency spectrums in the audio data are recognized, and the audio data of the target frequency spectrum is recognized as the voice information to be recognized.
  • FIG. 15 is another voice matching system 200 provided by an embodiment of the present invention.
  • the system includes the above-mentioned smart device 70 and a service device 80.
  • the smart device 70 and the service device 80 interact with each other to complete all the steps in this application.
  • the functions of the system can be referred to the relevant method embodiments described in FIGS. 1 to 9F, which will not be repeated here.
  • An embodiment of the present invention further provides a computer storage medium, wherein the computer storage medium may store a program, and the program includes part or all of the steps of any one of the above method embodiments when executed.
  • the embodiment of the present invention also provides a computer program, the computer program includes instructions, when the computer program is executed by a computer, the computer can execute part or all of the steps of any one of the above method embodiments.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the above integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc., specifically a processor in a computer device) to execute all or part of the steps of the above methods of the various embodiments of the present application.
  • the aforementioned storage media may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.
  • U disk mobile hard disk
  • magnetic disk magnetic disk
  • optical disk read-only memory
  • Read-Only Memory abbreviation: ROM
  • Random Access Memory Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Psychiatry (AREA)
  • Game Theory and Decision Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Business, Economics & Management (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例公开了一种语音匹配方法及相关设备,具体可以应用于人工智能AI领域中的智能机器人、智能终端、智能控制、人机交互等多个技术领域,其中的语音匹配方法包括获取音频数据以及视频数据;从所述音频数据中提取待识别的语音信息;从所述视频数据中提取N个用户的唇部运动信息,N为大于1的整数;将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。本申请可以提升多人场景中的语音匹配效率和人机交互体验。

Description

一种语音匹配方法及相关设备
本申请要求于2019年11月30日提交中国专利局、申请号为201911209345.7、申请名称为“一种语音匹配方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及人机交互技术领域,尤其涉及一种语音匹配方法及相关设备。
背景技术
人机交互(Human-Computer Interaction,HCI)主要是研究人和计算机之间的信息交换,它主要包括人到计算机和计算机到人的信息交换两部分。是与认知心理学、人机工程学、多媒体技术、虚拟现实技术等密切相关的综合学科。在人机交互技术中,多模态交互设备是语音交互、体感交互、及触控交互等多种交互模式并行的交互设备。基于多模态交互设备的人机交互:通过交互设备中的多种跟踪模块(人脸、手势、姿态、语音、及韵律)采集用户信息,并理解、处理、及管理后形成虚拟用户表达模块,与计算机进行交互对话,能够极大提升用户的交互体验。
例如,目前市面上已有一些能够与人进行互动的智能机器人产品,并提出利用机器人来代替人进行老年、儿童的精神陪伴。然而,人机交互中的友好性则体现了人作为服务对象对机器人系统提出的更高要求,即通过自然的,更接近人与人之间交流的交流方式来实现人机对话。机器人若要真正达到对老年、儿童进行情感陪护的功能要求,更好的融入其生活,首先要充分理解用户的意图,了解其心理情感变化,针对不同用户的特点和需求进行个性化交互。
但在相对复杂场景下,在机器人与多人面对面交流时(多人教学,游戏,家庭日常生活等),现阶段市场上的机器人由于无法快速准确的确定被交互对象身份,就只能按照既定的程式化交互方式进行交互,极大影响了交互体验。比如,一家人的三个小孩与一个机器人的交互场景,哥哥汤姆明天打算出去郊游,弟弟杰瑞吵着要和他一起去,但姐姐克里丝告诉他明天有暴雨,所以汤姆十分沮丧地问机器人“明天天气如何”;机器人接收到该语音指令后,仅根据接收到的语音信息进行语义识别的结果进行回答:“明天天气晴,有2-3级微风。”而完全不会考虑到发问者汤姆的心情感受,因此,无法实现智能化、个性化的人机交互。
发明内容
本发明实施例提供一种语音匹配方法、神经网络的训练方法及相关设备,以提升多人场景中的语音匹配效率和人机交互体验。
第一方面,本发明实施例提供了一种语音匹配方法,可包括:获取音频数据以及视频数据;从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇 部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
本发明实施例,当应用于多人对话场景中时,可以通过采集音频数据和视频数据,并通过对音频数据中的语音信息和视频数据中的唇部运行信息进行匹配,则可以确定出待识别的语音信息所属的目标用户。即在多人的场景中通过语音特征与多个用户的唇部运动特征进行匹配,识别某段待识别的语音信息具体是由哪个用户发出的,从而可以基于该识别结果进行进一步的控制或操作。区别于现有技术中的通过声纹识别技术或者声源定位技术,即发明实施例不依赖于人的声音(声纹易受身体状况、年龄、情绪等的影响),不受环境干扰(如环境中的噪声干扰等),抗干扰能力强,识别效率和准确度高。其中,待识别的语音信息包括在具体某个时间段内的语音波形序列,而N个用户的唇部运动信息则包括多个用户在同一场景下的该时间段内的唇部运动的图像序列(即唇部运动的视频),便于后续进行相关的特征提取和特征匹配。而采用目标特匹配模型,将待识别的语音信息以及N个用户的唇部运动信息作为该目标特征匹配模型的输入,并且将N个用户的唇部运动信息分别与待识别的语音信息之间的匹配度作为该目标特征匹配模型的输出,进而根据匹配度确定出该待识别的语音信息所属的目标用户。可选的,该目标特征匹配模型为神经网络模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括:将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征;所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。本发明实施例,通过在目标特征匹配模型中,分别利用第一模型和第二模型对待识别的语音信息和N个用户的唇部运动信息进行特征提取(也可认为是降维过程),使得待识别的语音信息和N个用户的唇部运动信息在分别经过第一模型和第二网路的特征提取之后,均能够得到相同维度的特征,从而使得不同类型的信息可以实现特征归一化的效果。即经过上述网络的特征提取处理后,不同类型的原始数据(待识别的语音信息和N个用户的唇部运动信息)之间可转化为无量纲化指标值(即发明实施例中均为K维的语音特征和N个图像序列特征),各指标值处于同一数量级别,可进行综合测评分析(即本发明实施例中的特征匹配)。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型;可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息。本发明实施例,通过将某个训练用户的唇部运动信息,以及与之匹配 的语音信息和多个不匹配的语音信息作为目标特征匹配模型输入,并基于上述M个语音信息与该训练用户的唇部运动信息的实际匹配度作为标签,对初始的神经网络模型进行训练得到的目标特征匹配模型,例如,完全匹配对应的匹配度即标签为1,不匹配对应的匹配度即标签为0。
在一种可能的实现方式中,所述方法还包括:确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;基于所述用户信息,生成与所述用户信息匹配的控制指令。本发明实施例,在确定了待识别的语音信息具体是由当前的场景中的哪个目标用户发出的之后,则可以根据该用户的属性信息(如性别、年龄、性格等)、面部表情信息(如该目标用户发出待识别的语音信息所对应的表情)以及对应的环境信息(如目标用户当前处于办公环境、家庭环境、或娱乐环境等),来确定与上述用户信息匹配的控制指令(如语音指令、操作指令等)。例如,控制智能机器朝着目标用户发出与所述表情数据和人物属性信息等匹配的语音或操作等,包括机器人的语气、机器人的头的转向以及机器人的回话内容等等。
在一种可能的实现方式中,所述从所述视频数据中提取N个用户的唇部运动信息,包括:基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。本发明实施例,通过从视频数据中先识别出人脸区域,然后再基于该人脸区域提取每个人脸区域中的唇部运动视频,然后再依据唇部运动视频确定N个用户的唇部运动信息也即是对应用户的唇部运动图像序列。
在一种可能的实现方式中,所述从所述音频数据中提取待识别的语音信息,包括:基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。由于不同用户发出的声音对应的频谱在一般情况下是不同的,因此,本发明实施例,通过从音频数据中先识别出不同频谱的音频数据,然后再将目标频谱的音频数据识别为待识别的语音信息,进而实现从音频数据中提取待识别语音信息的功能。
第二方面,本发明实施例提供了一种神经网络的训练方法,可包括:
获取训练样本,所述训练样本包括训练用户的唇部运动信息以及M个语音信息,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息;
以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型。
本发明实施例,通过将某个训练用户的唇部运动信息,以及与之匹配的语音信息和多个不匹配的语音信息作为初始化的神经网络的输入,并基于上述M个语音信息与该训练用户的唇部运动信息的实际匹配度作为标签,对上述初始的神经网络模型进行训练得到的目标特征匹配模型,例如,完全匹配对应的匹配度即标签为1,不匹配对应的匹配度即标签 为0,当通过训练后的初始化的神经网络计算得到的训练用户的唇部运动信息分别与M个语音信息之间的匹配度越接近所述M个标签,则该训练后的初始化的神经网络越接近所述目标特征匹配模型。
在一种可能的实现方式中,所述训练用户的唇部运动信息包括所述训练用户的唇部运动图像序列,所述M个语音信息包括一个与所述训练用户的唇部运动图像序列匹配的语音波形序列以及(M-1)个与所述训练用户的唇部运动图像序列不匹配的语音波形序列。
在一种可能的实现方式中,所述以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型,包括:
将所述训练用户的唇部运动信息以及所述M个语音信息输入到所述初始化的神经网络中,计算得到所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度;
将计算得到的所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;
所述将所述训练用户的唇部运动信息以及所述M个语音信息输入到所述初始化的神经网络中,计算得到所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度,包括:
将所述M个语音信息输入到所述第一模型中,得到M个语音特征,所述M个语音特征中的每一个语音特征均为K维语音特征,K为大于0的整数;
将所述训练用户的唇部运动信息输入到所述第二模型中,得到所述训练用户的图像序列特征,所述训练用户的图像序列特征为K维图像序列特征;
将所述M个语音特征和所述训练用户的图像序列特征输入到第三模型中,计算得到所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度。
第三方面,本发明实施例提供了一种智能设备,可包括:处理器以及与所述处理器耦合的麦克风、摄像头:
所述麦克风,用于采集音频数据;
所述摄像头,用于采集视频数据,所述音频数据与所述视频数据为针对同一场景下采集的;
所述处理器,用于
获取所述音频数据以及所述视频数据;
从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中, 得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述处理器,具体用于:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征;所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型;可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息。
在一种可能的实现方式中,所述处理器还用于:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述处理器,具体用于:
基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述处理器,具体用于:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
第四方面,本发明实施例提供了一种智能设备,可包括:处理器以及与所述处理器耦合的麦克风、摄像头和神经网络处理器:
所述麦克风,用于采集音频数据;
所述摄像头,用于采集视频数据,所述音频数据与所述视频数据为针对同一场景下采集的;
所述处理器,用于:
获取所述音频数据以及所述视频数据;
从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N 为大于1的整数;
所述神经网络处理器,用于:
将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述神经网络处理器,具体用于:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征;所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型;可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息。
在一种可能的实现方式中,所述处理器还用于:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述处理器,具体用于:
基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述处理器,具体用于:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
第五方面,本发明实施例提供了一种语音匹配装置,可包括:
获取单元,用于获取音频数据以及视频数据;
第一提取单元,用于从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
第二提取单元,用于从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
匹配单元,用于将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标 特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
确定单元,用于将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一网络、第二网络和第三网络;所述匹配单元,具体用于:
将所述待识别的语音信息输入到所述第一网络中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二网络中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三网络中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型;可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息。
在一种可能的实现方式中,所述装置还包括:
确定单元,用于确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
控制单元,用于基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述第一提取单元,具体用于:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
在一种可能的实现方式中,所述第二提取单元,具体用于:
基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
第六方面,本发明实施例提供了一种神经网络的训练装置,可包括:
获取单元,用于获取训练样本,所述训练样本包括训练用户的唇部运动信息以及M个语音信息,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息;
训练单元,用于以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述训练用户的唇部运动信息包括所述训练用户的唇部运 动图像序列;可选的,所述M个语音信息包括一个与所述训练用户的唇部运动图像序列匹配的语音波形序列以及(M-1)个与所述训练用户的唇部运动图像序列不匹配的语音波形序列。
在一种可能的实现方式中,所述训练单元,具体用于:
将所述训练用户的唇部运动信息以及所述M个语音信息输入到所述初始化的神经网络中,计算得到所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度;
将计算得到的所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述训练单元,具体用于:
将所述M个语音信息输入到所述第一模型中,得到M个语音特征,所述M个语音特征中的每一个语音特征均为K维语音特征,K为大于0的整数;
将所述训练用户的唇部运动信息输入到所述第二模型中,得到所述训练用户的图像序列特征,所述训练用户的图像序列特征为K维图像序列特征;
将所述M个语音特征和所述训练用户的图像序列特征输入到第三模型中,计算得到所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度;
将计算得到的所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
第七方面,本发明实施例提供了一种语音匹配方法,可包括:
获取待识别的语音信息和N个用户的唇部运动信息;所述待识别的语音信息包括目标时间段内的语音波形序列,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;
所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
在一种可能的实现方式中,所述方法还包括:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述方法还包括:从视频数据中提取N个用户的唇部运动信息;进一步地,所述从所述视频数据中提取N个用户的唇部运动信息,包括:
基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述方法还包括:从所述音频数据中提取待识别的语音信息;进一步地,所述从所述音频数据中提取待识别的语音信息,包括:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
第八方面,本发明实施例提供了一种服务装置,可包括处理器;所述处理器用于:
获取待识别的语音信息和N个用户的唇部运动信息;所述待识别的语音信息包括目标时间段内的语音波形序列,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述处理器,具体用于:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
在一种可能的实现方式中,所述服务器还包括处理器;所述处理器用于:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语 音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述服务器还包括处理器;所述处理器,还用于:
基于人脸识别算法,识别视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述服务器还包括处理器;所述处理器,还用于:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
在一种可能的实现方式中,所述处理器为神经网络处理器;可选的,所述处理器完成的功能可以是由多个不同的处理器相互配合完成的,即所述处理器可以是由多个不同功能的处理器组合而成。
第九方面,本申请实施例还提供了一种语音匹配装置,包括处理器和存储器,所述存储器用于存储程序,所述处理器执行所述存储器存储的程序,当所述存储器存储的程序被执行时,使得所述处理器实现如第一方面所示的任一种方法或者第七方面所示的任一种方法。
第十方面,本申请实施例还提供了一种神经网络的训练装置,包括处理器和存储器,所述存储器用于存储程序,所述处理器执行所述存储器存储的程序,当所述存储器存储的程序被执行时,使得所述处理器实现如第二方面所示的任一种方法。
第十一方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读介质用于存储程序代码,所述程序代码包括用于执行如第一方面、第二方面或第七方面所述的任意一种方法。
第十二方面,本申请实施例还提供了提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面、第二方面或第七方面所述的任意一种方法。
第十三方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行第一方面、第二方面或第七方面所述的任意一种方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面、第二方面或第七方面所述的任意一种方法。
第十四方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持智能设备实现上述第一方面中所涉及的功能,或者用于支持语音匹配装置实现上述第一方面或者第七方面中所涉及的功能。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存智能设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
第十五方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持神经网 络的训练装置实现上述第二方面中所涉及的功能。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存神经网络的训练装置必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
第十六方面,提供一种电子设备,该电子设备包括上述第五方面中的任意一个语音匹配装置。
第十七方面,提供一种电子设备,该电子设备包括上述第六方面中的任意一个神经网络的训练装置。
第十八方面,提供一种云端服务器,该云端服务器包括上述第八方面中的任意一个服务装置。
第十九方面,提供一种服务器,该服务器包括上述第八方面中的任意一个服务装置。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1为本发明实施例提供的一种机器人与多人交互的场景示意图。
图2为本发明实施例提供的一种智能音箱与多人交互的场景示意图。
图3为本发明实施例提供的一种智能手机/智能手表与多人交互的场景示意图。
图4为本发明实施例提供了一种系统架构100。
图5为本发明实施例提供的一种卷积神经网络示意图。
图6是本发明实施例提供的一种神经网络处理器硬件结构图。
图7A是本发明实施例提供的一种神经网络的训练方法的流程示意图。
图7B为本发明实施例提供的一种训练样本采样示意图。
图8A是本发明实施例提供的一种语音匹配方法的流程示意图。
图8B为本发明实施例提供的一种声音波形示例图。
图8C为本发明实施例提供的一种使用梅尔频率倒谱系数进行语音特征提取的示意图。
图8D为本发明实施例提供的一种机器人与家庭成员交互的场景示意图。
图9A是本发明实施例提供的另一种语音匹配方法的流程示意图。
图9B为本发明实施例提供的一种第一模型和第二模型的结构示意图。
图9C为本发明实施例提供的一种第三模型的结构示意图。
图9D为本发明实施例提供的一种机器人交互场景示意图。
图9E为本发明实施例提供的一种功能模块以及整体流程的示意图。
图9F为本发明实施例提供的一种语音匹配系统架构图。
图10A是本发明实施例提供的一种智能设备的结构示意图。
图10B是本发明实施例提供的另一种智能设备的结构示意图。
图11是本发明实施例提供的一种语音匹配装置的结构示意图。
图12是本发明实施例提供的一种神经网络的训练装置的结构示意图。
图13是本发明实施例提供的另一种智能设备的结构示意图。
图14是本发明实施例提供的一种服务装置的结构示意图。
图15是本发明实施例提供的另一种语音匹配系统。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)位图(Bitmap):又称栅格图(Raster graphics)或点阵图,是使用像素阵列(Pixel-array/Dot-matrix点阵)来表示的图像。根据位深度,可将位图分为1、4、8、16、24及32位图像等。每个像素使用的信息位数越多,可用的颜色就越多,颜色表现就越逼真,相应的数据量越大。例如,位深度为1的像素位图只有两个可能的值(黑色和白色),所以又称为二值位图。位深度为8的图像有28(即256)个可能的值。位深度为8的灰度模式图像有256个可能的灰色值。RGB图像由三个颜色通道组成。8位/通道的RGB图像中的每个通道有256个可能的值,这意味着该图像有1600万个以上可能的颜色值。有时将带有8位/通道(bpc)的RGB图像称作24位图像(8位x 3通道=24位数据/像素)。通常将使用24位RGB组合数据位表示的位图称为真彩色位图。
(2)语音识别技术(AutomaticSpeech Recognition,ASR),也被称为自动语音识别,其目标是将人类的语音中的词汇内容转换为计算机可读的输入,例如按键、二进制编码或者字符序列。
(3)声纹(Voiceprint),是用电声学仪器显示的携带言语信息的声波频谱,是由波长、频率以及强度等百余种特征维度组成的生物特征。声纹识别是通过对一种或多种语音信号的特征分析来达到对未知声音辨别的目的,简单的说就是辨别某一句话是否是某一个人说 的技术。通过声纹可以确定出说话人的身份,从而进行有针对性的回答。
(4)梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)在声音处理领域中,梅尔频率倒谱(Mel-Frequency Cepstrum)是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换。梅尔频率倒谱系数(MFCC)广泛被应用于语音识别的功能。
(5)多路交叉熵(Multi-way cross-Entropy Loss)交叉熵描述了两个概率分布之间的距离,当交叉熵越小说明二者之间越接近。
(6)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2020129464-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(7)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2020129464-appb-000002
其中,
Figure PCTCN2020129464-appb-000003
是输入向量,
Figure PCTCN2020129464-appb-000004
是输出向量,b是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2020129464-appb-000005
经过如此简单的操作得到输出向量
Figure PCTCN2020129464-appb-000006
由于DNN层数多,则系数W和偏移向量b的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2020129464-appb-000007
上标3代表系数W所在的层数,而下标对应的 是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2020129464-appb-000008
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(8)卷积神经网络
卷积神经网络(CNN,convolutional neuron network)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(9)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(10)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正卷积神经网络中参数的大小,使得卷积神经网络的重建误差损失越来越小。具体地,前向 传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新卷积神经网络中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的卷积神经网络的参数,例如权重矩阵。
(11)像素值
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。
首先,为了便于理解本发明实施例,进一步分析并提出本申请所具体要解决的技术问题。在现有技术中,关于多人场景中的语音匹配识别包括多种技术方案,以下示例性的列举如下常用的两种方案。其中,
方案一:
声纹(Voiceprint),是用电声学仪器显示的携带言语信息的声波频谱,是由波长、频率以及强度等百余种特征维度组成的生物特征。声纹识别是通过对一种或多种语音信号的特征分析来达到对未知声音辨别的目的,简单的说就是辨别某一句话是否是某一个人说的技术。通过声纹可以确定出说话人的身份,从而进行有针对性的回答。主要分为两个阶段:注册阶段和验证阶段。其中,注册阶段:根据发音人语音的声纹特征,建立相应的声纹模型;验证阶段:接收发音人的语音,提取其声纹特征并与注册的声纹模型进行匹配,若匹配成功,则证明是原来注册的发音人。
该方案一的缺点:
声纹识别的应用有一些缺点,比如同一个人的声音具有易变性,易受身体状况、年龄、情绪等的影响;比如不同的麦克风和信道对识别性能有影响;比如环境噪音对识别有干扰;又比如混合说话人的情形下人的声纹特征不易提取。
方案二:
声源定位技术,是利用声学和电子装置接收目标声场信息以确定目标声源位置的一种技术。麦克风阵列的声源定位是指用麦克风阵列拾取声源信号,通过对多路声音信号进行分析与处理,在空间域中定取一个或者多个声源平面或空间坐标,即得到声源的位置。近一步控制麦克风阵列的波束对准说话人。
该方案二的缺点:
(1)机器人运动带来的麦克风阵列的运动,是机器人听觉与传统声源定位技术主要的差别所在,运动的麦克风阵列会面临即时变化的声学环境,要求声源定位系统具有较高的实时性。现在大多数声源定位系统的传感器数量较多,导致算法的计算复杂度较高。少量的麦克风和低复杂度的定位算法有待进一步探索。
(2)几乎所有的实用声源定位系统必然面临着复杂的声学环境,存在各种类型的噪声和混响。机器人工作在真实环境中,信号混响和噪声是难以避免的,因此声源定位系统的抗混响和抗噪声能力在很大程度上影响定位性能。现有的抗噪声技术大多只是针对某类或某几类噪声有效,一种鲁棒的、对各种噪声广泛适用的抗噪声技术或方案也还有待进一步 研究。
综上,上述两种方案若应用于多人场景中的语音匹配识别中,则无法准确识别当前讲话内容具体由哪个用户发出,因此也就无法实现更为精准、有效的人机交互。因此,本申请要解决的技术问题包括如下方面:在机器人或者其它智能设备与多用户面对面交互时(多人教学、游戏、家庭日常生活等),如何准确地区分不同说话人、并进行有针对性的个性化交互。
本申请实施例提供的语音匹配方法能够应用在智能机器人、智能音箱、智能穿戴设备、智能手机等智能设备的人机交互场景。以下示例性列举本申请中语音匹配方法所应用的人机交互场景,可以包括如下三个场景。
场景一,通过机器人实现人机交互:
请参阅图1,图1为本发明实施例提供的一种机器人与多人交互的场景示意图,该应用场景中包括智能机器人和多个用户(图1中以一群朋友,用户A、用户B、用户C、用户D和用户E参加聚会为例),假设,此时一群朋友中的用户C想要让智能机器人帮忙端上来一杯果汁,那么用户C发出语音请求“小智小智,请帮忙拿一杯果汁给我,谢谢”。此时,智能机器人小智首先需要采集当前聚会场景下的语音数据和视频数据,其中,语音数据则包含了上述用户C发出的语音请求“小智小智,请帮忙拿一杯果汁,谢谢”,视频数据则包含了在该语音请求同时间段内的所有用户的唇部运动信息,并基于上述待识别的语音信息和多个用户的唇部运动信息,进行处理、分析,确定发出上述语音请求的具体是用户A、用户B、用户C、用户D和用户E用户中的哪位用户,进一步地,智能机器人“小智”根据判断结果控制自身将指定果汁或者根据判断选择的果汁送到该用户C面前。
场景二,智能音箱的人机交互:
请参阅图2,图2为本发明实施例提供的一种智能音箱与多人交互的场景示意图,该应用场景中包括智能音箱和多个用户(图1中以一群小朋友,小朋友A、小朋友B、小朋友C、小朋友D小朋友E、小朋友F在操场上进行游戏为例),假设,此时,小朋友B想要让智能音箱帮忙点一首自己喜欢的歌曲来活跃气氛,那么小朋友B发出语音请求“小智小智,请播放一首我上次听过的《小太阳》”。由于名为《小太阳》的歌曲可能有很多,并且不同小朋友听过的版本可能不一致,因此,智能音箱需要做更多的判断,以确定当前所需要播放的究竟是哪一首的哪一个版本。此时,智能音箱小智首先需要采集当前游戏场景下的语音数据和视频数据,其中,语音数据则包含了上述小朋B发出的语音请求“小智小智,请播放一首我上次听过的《小太阳》”,视频数据则包含了在该语音请求同时间段内的所有小朋友的唇部运动信息,并基于上述待识别的语音信息和多个小朋友的唇部运动信息,进行处理、分析,确定发出上述语音请求的具体是小朋友A、小朋友B、小朋友C、小朋友D、小朋友E、小朋友F中的哪位小朋友,进一步地,智能音箱“小智”根据判断结果查找与该小朋友B的播放记录中的《小太阳》,并控制播放该首《小太阳》。
场景三,智能手机/智能手表的人机交互:
请参阅图3,图3为本发明实施例提供的一种智能手机/智能手表与多人交互的场景示意图,该应用场景中包括智能手机/智能手表和多个用户(图1中以一帮同学,同学A、同 学B、同学C在拼单吃饭为例),假设,同学A、同学B、同学C想通过智能手机或智能手表进行语音点餐,那么此时,可以通过某一个同学B的智能手机或智能手表进行点餐(拼单点餐可能有优惠,且下单方便),但是需要三个人对同学B的智能手机或智能手表发出语音点餐指令,如,同学B对着自己的智能手机说:“小智小智,我想要点一份餐厅XX的糖醋排骨,同学A对着同学B的智能手机说:“小智小智,我想要点一份餐厅XXXX的酸菜鱼”,同学C对着同学B的智能手机说:“小智小智,我想要点一份餐厅XXX的胡辣汤和紫菜包饭”那么在智能手机下单完成后,可能面临需要针对上述三位同学的点餐进行拆分收费的问题。那么此时,智能手机首先需要采集当前拼桌吃饭场景下的语音数据和视频数据,其中,语音数据则包含了上述同学A、同学B和同学C分别发出的语音点餐指令”,视频数据则包含了三位同学在发出上述语音指令同时间段内的所有三位同学的唇部运动信息,智能手机基于上述待识别的语音信息(三条点餐指令)和三个同学的唇部运动信息,进行处理、分析,确定发出上述三条语音点餐指令的分别对应的同学,并依据此,计算每个同学优惠后的点餐费用,并可通过智能手机上相应的客户端向同学A和同学C分别发起对应的用餐费用支付请求。至此,完成了多人语音拼单点餐并收费的功能。
可以理解的是,图1、图2和图3中的应用场景的只是本发明实施例中的几种示例性的实施方式,本发明实施例中的应用场景包括但不仅限于以上应用场景。本申请中的语音匹配方法还可以应用于,例如智能计算机与多人开会的交互、智能计算机与多人游戏的交互或智能房车与多人互动等场景,其它场景及举例将不再一一列举和赘述。
下面从模型训练侧和模型应用侧对本申请提供的方法进行描述:
本申请提供的任意一种神经网络的训练方法,涉及计算机听觉与视觉的融合处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据(如本申请中的训练用户的唇部运动信息以及M个语音信息)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的目标特征匹配模型;并且,本申请提供的任意一种语音匹配方法可以运用上述训练好的目标特征匹配模型,将输入数据(如本申请中的待识别的语音信息以及N个用户的唇部运动信息)输入到所述训练好的目标特征匹配模型中,得到输出数据(如本申请中的N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度)。需要说明的是,本申请实施例提供的一种神经网络的训练方法和一种语音匹配方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。
参见附图4,图4是本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,在本申请中,该数据采集设备160可以包括麦克风和摄像头。本发明实施例中训练数据(即模型训练侧的输入数据)可包括:视频样本数据和语音样本数据,即分别为本发明实施例中的训练用户的唇部运动信息以及M个语音信息,其中,所述M个语音信息可包括与所述训练用户的唇部运动信息所匹配的语音信息。例如,视频样本数据为某个训练用户在发出语音为:“今天天气特别好,我们去哪里玩?”时的唇部运动图像序列,而语音样本数据则为包含上述训练用户发出“今天天气特别好,我们去 哪里玩?”的语音波形序列(作为语音正样本)以及(M-1)个其它语音波形序列(作为语音负样本)。而上述视频样本数据和音频样本数据可以是由数据采集设备160采集的,也可以是从云端下载下来的,图1只是一种示例性的架构,并不对此进行限定。进一步地,数据采集设备160将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标特征匹配模型/规则101(此处的目标特征匹配模型101即为本发明实施例中的所述目标特征匹配模型,例如,为经过上述训练阶段训练得到的模型,可以用于语音和唇部运动轨迹之间的特征匹配的神经网络模型)。
下面将更详细地描述训练设备120如何基于训练数据得到目标特征匹配模型/规则101,该目标特征匹配模型/规则101能够用于实现本发明实施例提供任意一种语音匹配方法,即,将由数据采集设备160获取的音频数据和视频数据通过相关预处理后输入该目标特征匹配模型/规则101,即可得到多个用户的唇部运动的图像序列特征分别与待识别的语音特征之间的匹配度/置信度。本发明实施例中的目标特征匹配模型/规则101具体可以为时空卷积网络(STCNN),在本申请提供的实施例中,该时空卷积网络可以是通过训练卷积神经网络得到的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标特征匹配模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本发明实施例的限定。
如图1所示,根据训练设备120训练得到目标特征匹配模型/规则101,该目标特征匹配模型/规则101在本发明实施例中可以称之为视听交叉卷积神经网络(V&A Cross CNN)/时空卷积神经网络。具体的,本发明实施例提供的目标特征匹配模型可以包括:第一模型、第二模型和第三模型,其中第一模型用于进行语音特征的提取,第二模型用于多个用户(本申请中为N个用户)唇部运动的图像序列特征的提取,第三模型则用于上述语音特征和N个用户的图像序列特征之间的匹配度/置信度的计算。在本发明实施例提供的目标特征匹配模型中,所述第一模型、所述第二模型和所述第三模型都可以是卷积神经网络即可以理解为目标特征匹配模型/规则101自身可以看作是一个整体的时空卷积神经网络,而该时空卷积神经网络中又包含了多个独立网络,如上述第一模型、第二模型和第三模型,
根据训练设备120训练得到的目标特征匹配模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(Augmented Reality,AR)/虚拟现实(Virtual Reality,VR),智能可穿戴设备、智能机器人、车载终端等,还可以是服务器或者云端等。在附图1中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140(本申请中的客户设备也可以包括麦克风、摄像头等数据采集设备)向I/O接口112输入数据,所述输入数据(即模型应用侧的输入数据)在本发明实施例中可以包括:待识别的语音信息和N个用户的唇部运动信息,即分别为本发明实施例中的目标时间段内的语音波形序列和N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列。例如,当前需要识别一群人中具体是哪个人讲了“明天天气怎么样,适合到哪里出游”的语音信息,则该“明天天气怎么样,适合到哪里出游” 对应的语音波形序列,以及在场所有人对应的唇部运动的图像序列则作为输入数据。可以理解的是,此处的输入数据,可以是用户输入的,也可以是由相关数据库提供的,具体依据应用场景的不同而不同,本发明实施例对此不作具体限定。
在本发明实施例中,客户设备140可以和执行设备110在同一个设备上,数据采集设备160、数据库130和训练设备120也可以和执行设备110和客户设备140在同一个设备上。以本申请中的执行主体为机器人为例,机器人在通过客户设备140(包括麦克风和摄像头以及处理器)将采集的音频数据和视频数据,进行提取获得待识别的语音信息及N个用户的唇部运动信息之后,则可以通机器人内部的执行设备110,进一步对上述提取的语音信息和唇部运动信息之间进行特征匹配,最终输出结果至客户设备140,由客户设备140中的处理器分析得到所述待识别的语音信息在所述N个用户用所属的目标用户。并且,模型训练侧的设备(数据采集设备160、数据库130和训练设备120)可以在机器人内部,也可以在云端,当在机器人内部时,则可以认为机器人拥有可以实现模型训练或者模型更新优化的功能,此时,机器人既有模型训练侧的功能,又有模型应用侧的功能;当在云端,则可以认为机器人侧仅有模型应用侧的功能。可选的,客户设备140和执行设备110也可以不在同一个设备上,即采集音频数据和视频数据、以及提取待识别的语音信息和N个用户的唇部运动信息可以由客户设备140(例如智能手机、智能机器人等)来执行,而对待识别的语音信息和N个用户的唇部运动信息之间进行特征匹配的过程,则可以由执行设备110(例如云端服务器、服务器等)来执行。或者,可选的,采集音频数据和视频数据由客户设备140来执行,而提取待识别的语音信息和N个用户的唇部运动信息,以及对待识别的语音信息和N个用户的唇部运动信息之间进行特征匹配的过程均由执行设备110来完成。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端(例如为麦克风、摄像头),采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
预处理模块113用于根据I/O接口112接收到的输入数据(如所述语音数据)进行预处理,在本发明实施例中,预处理模块113可以用于对语音数据进行预处理,例如从语音数据中提取待识别的语音信息。
预处理模块114用于根据I/O接口112接收到的输入数据,如(所述视频数据)进行预处理,在本发明实施例中,预处理模块114可以用于对视频数据进行预处理,例如从视频数据中提取与上述待识别的语音信息对应的N个用户的唇部运动信息。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。最后, I/O接口112将输出结果,如本发明实施例中的N个用户的唇部运动信息分别与待识别的语音信息之间的匹配度,或者其中最高的一个匹配度的目标用户ID返回给客户设备140,客户设备140从而根据上述匹配度,确定目标用户的用户信息,从而基于该用户信息生成与该用户信息匹配的控制指令。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标特征匹配模型/规则101,该相应的目标特征匹配模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
值得注意的是,附图1仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
基于上述系统架构的介绍,以下描述本发明实施例中模型训练侧和模型应用侧所涉及的神经网络模型机即卷积神经网络,卷积神经网络CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。
如图5所示,图5为本发明实施例提供的一种卷积神经网络示意图,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220,以及神经网络层230,其中池化层为可选的。
卷积层/池化层220:
如图1所示卷积层/池化层120可以包括如示例221-226层,在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
卷积层:
以卷积层221为例,卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……,这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来 提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以被称为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图1中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层220的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的输出。因此,在神经网络层230中可以包括多层隐含层(如图1所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(图1中由210至240的传播为前向传播)完成,反向传播(图1中由240至210的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图5所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层230进行处理。
本申请中的归一化层,作为CNN的功能层,原则上可以在上述CNN中的任何一层之后,或者任何一层之前进行,并以上一层输出的特征矩阵作为输入,其输出也可以作为CNN中任何一层功能层的输入。但在实际CNN应用中,归一化层一般在卷积层之后进行,并以前面卷积层输出的特征矩阵作为输入矩阵。
基于上述图4和图5中对系统架构100以及对卷积神经网络200的相关功能描述,请参见图6,图6是本发明实施例提供的一种神经网络处理器硬件结构图,其中,
神经网络处理器NPU 302作为协处理器挂载到CPU(如Host CPU)301上,由Host CPU301分配任务。例如,对应到上述系统架构100中,在本申请中CPU 301可位于客户设备140中,用于从语音数据和视频数据中提取待识别的语音信息和N个用户的唇部运动信息;而NPU 302则可以位于计算模块111中,用于对上述CPU 301提取后的待识别的语音信息和N个用户的唇部运动信息进行特征提取以及特征匹配,从而将匹配结果发送至CPU 301中进行进一步的计算处理,此处不作详细描述。可以理解的是,上述CPU和NPU可以位于不同的设备中,其依据产品的实际需求可以进行不同的设置。例如,NPU位于云端服务器上,而CPU则位于用户设备(如智能手机、智能机器人)上;或者,CPU和NPU均位于客户设备上(如智能手机、智能机器人等)。
NPU 302的核心部分为运算电路3023,通过控制器3024控制运算电路3023提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路3023内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路3023是二维脉动阵列。运算电路3023还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路3023是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器3022中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器3021中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器3028 accumulator中。
统一存储器3026用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器30212 Direct Memory Access Controller,DMAC被搬运到权重存储器3022中。输入数据也通过DMAC被搬运到统一存储器3026中。
BIU为Bus Interface Unit即,总线接口单元1210,用于AXI总线与DMAC和取指存储器3029 Instruction Fetch Buffer的交互。
总线接口单元1210(Bus Interface Unit,简称BIU),用于取指存储器3029从外部存储器获取指令,还用于存储单元访问控制器30212从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器3026或将权重数据搬运到权重存储器3022中或将输入数据搬运到输入存储器3021中。
向量计算单元3027多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中 非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。
在一些实现种,向量计算单元能3027将经处理的输出的向量存储到统一缓存器3026。例如,向量计算单元3027可以将非线性函数应用到运算电路3023的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元3027生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路3023的激活输入,例如用于在神经网络中的后续层中的使用。
控制器3024连接的取指存储器(instruction fetch buffer)3029,用于存储控制器3024使用的指令;
统一存储器3026,输入存储器3021,权重存储器3022以及取指存储器3029均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
可以理解的是,关于本申请中的任一语音匹配方法实施例中针对待识别的语音信息以及N个用户的唇部运动信息的特征提取、以及特征匹配,以及关于本申请中的任一神经网络的训练方法实施例中的训练用户的唇部运动信息以及所M个语音信息的特征提取、以及特征匹配等相关功能,均由可上述神经网络处理器302(NPU)中相关的功能单元进行实现,或者由上述处理器301和上述神经网络处理器302共同协作实现,此处不作详述。
下面结合上述应用场景、系统架构、卷积神经网络的结构、神经网络处理器的结构,从模型训练侧和模型应用侧对本申请提供的神经网络的训练方法、语音匹配方法的实施例进行描述,以及对本申请中提出的技术问题进行具体分析和解决。
参见图7A,图7A是本发明实施例提供的一种神经网络的训练方法的流程示意图,该方法可应用于上述图1、图2或图3中所述的应用场景及系统架构中,具体可应用于上述图4的训练设备120中。下面结合附图7A以执行主体为上述图4中的训练设备120或者包含训练设备120的设备为例进行描述。该方法可以包括以下步骤S701-步骤S702。
S701:获取训练样本,所述训练样本包括训练用户的唇部运动信息以及M个语音信息。
具体地,例如,训练用户的唇部运动信息为用户小方发出语音信息:“你好,我的名字叫小方,来自中国湖南,你呢?”所对应的唇部运动信息也即是唇部运动视频或唇部连续运动的图像序列,那么,所述M个语音信息则包括上述“你好,我的名字叫小方,来自中国湖南,你呢?”的语音信息作为语音正样本,以及其它的语音信息,如“你好,你晚上吃饭了吗”、“今天天气真不错,你去哪里玩了?”“帮忙查找一下从湖南到北京的旅游路线?”等语音信息作为负样本。可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息。进一步可选的,所述训练用户的唇部运动信息包括所述训练用户的唇部运动图像序列,所述M个语音信息包括一个与所述训练用户的唇部运动图像序列匹配的语音波形序列以及(M-1)个与所述训练用户的唇部运动图像序列不匹配的语音波形序列。例如,上述唇部运动信息为用户小方在发出语音信息:“你好,我的名字叫小方,来自中国湖南,你呢?”所对应的连续的唇部运动的图像序列(即发音口型的视频),而上述M个语音信息则包括上述语音正样本的语音波形序列,和M-1个负样本的语音波形序列。可以理解的是,上述 M个语音信息中也可以包括多个正样本和负样本,即对正样本和负样本的数量不作具体限定,只要均包含即可。
例如,如图7B所示,图7B为本发明实施例提供的一种训练样本采样示意图,该初始化的神经网络模型可以基于自监督的训练方式进行训练,无需额外标注。训练过程采样策略见图7B:实线矩形显示与上面的说话人面相对应的音频段(即为正样本),虚线矩形显示不匹配的音频段(即为负样本,负样本的生成是由正样本的基础上,音频偏移±Δt得到)。
S702:以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型。
具体地,例如,上述训练用户的唇部运动信息与正样本的语音信息“你好,我的名字叫小方,来自中国湖南,你呢?”之间的标签为“匹配度=1”,而上述训练用户的唇部运动信息与其他负样本的语音信息“你好,你晚上吃饭了吗”、“今天天气真不错,你去哪里玩了?”“帮忙查找一下从湖南到北京的旅游路线?”之间的标签为“匹配度=0.2”、“匹配度=0”“匹配度=0”等,此处不再赘述。也即是通过上述训练输入和预先设置的标签,可以将初始化的神经网络模型训练得到本申请中所需要使用的目标特征匹配模型,该目标特征匹配模型可以用于匹配待识别的语音信息与多个用户的唇部运动信息之间的匹配关系,用于实现本申请中的任意一种语音匹配方法。
在一种可能的实现方式中,所述以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型,包括:将所述训练用户的唇部运动信息以及所述M个语音信息输入到所述初始化的神经网络中,计算得到所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度;将计算得到的所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述将所述训练用户的唇部运动信息以及所述M个语音信息输入到所述初始化的神经网络中,计算得到所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度,包括:将所述M个语音信息输入到所述第一模型中,得到M个语音特征,所述M个语音特征中的每一个语音特征均为K维语音特征,K为大于0的整数;将所述训练用户的唇部运动信息输入到所述第二模型中,得到所述训练用户的图像序列特征,所述训练用户的图像序列特征为K维图像序列特征;将所述M个语音特征和所述训练用户的图像序列特征输入到第三模型中,计算得到所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度。
关于上述具体如何从初始化的神经网络模型训练成为本申请中的目标特征匹配模型,在后续图7A-图9F对应的模型应用侧的方法实施例中一并进行描述,此处不作详述。
本发明实施例,通过将某个训练用户的唇部运动信息,以及与之匹配的语音信息和多个不匹配的语音信息作为初始化的神经网络的输入,并基于上述M个语音信息与该训练用户的唇部运动信息的实际匹配度作为标签,对上述初始的神经网络模型进行训练得到的目 标特征匹配模型,例如,完全匹配对应的匹配度即标签为1,不匹配对应的匹配度即标签为0,当通过训练后的初始化的神经网络计算得到的训练用户的唇部运动信息分别与M个语音信息之间的匹配度越接近所述M个标签,则该训练后的初始化的神经网络越接近所述目标特征匹配模型。
参见图8A,图8A是本发明实施例提供的一种语音匹配方法的流程示意图,该方法可应用于上述图1、图2或图3中所述的应用场景及系统架构中,以及具体可应用于上述图4的客户设备140以及执行设备110中,可以理解的是,客户设备140和执行设备110可以在同一个物理设备上,例如智能机器人、智能手机、智能终端、智能可穿戴设备等。下面结合附图8A以执行主体为包含上述客户设备140以及执行设备110的智能设备为例进行描述。该方法可以包括以下步骤S801-步骤S805。
步骤S801:获取音频数据以及视频数据。
具体地,获取音视频数据,例如,智能设备通过麦克风获取语音数据,通过摄像头视频数据。在某个时间段内的原始音频数据及视频数据,即音频数据源和视频数据源。可选的,该音频数据和视频数据是针对同一个场景下的同一个时间段内所采集的,也即是音频数据为该视频数据对应的音频数据。比如,机器人通过处理器获取了通过麦克风采集的某个场景下的某个时间段内语音数据,并通过摄像头采集的该场景下的该时间段内的视频数据。
步骤S802:从所述音频数据中提取待识别的语音信息。
具体地,所述待识别的语音信息包括目标时间段内的语音波形序列。即在本发明实施例中,待识别的语音信息包括在具体某个时间段内的语音波形序列。可选的,音频数据的格式为非压缩wav格式,而wav是最常见的声音文件格式之一,是一种标准数字音频文件,该文件能记录各种单声道或立体声的声音信息,并能保证声音不失真。一般来说,由wav文件还原而成的声音的音质取决于声音卡采样样本的尺寸,采样频率越高,音质就越好,但开销就越大,wav文件也就越大,如图8B所示,图8B为本发明实施例提供的一种声音波形示例图。
在一种可能的实现方式中,假设音频数据中有多个用户在同时讲话,那么此时需要判断其中的某一段语音信息是由哪个用户发出的,则需要先识别、提取音频数据中的目标语音信息即上述待识别的语音信息。或者,假设该音频数据中包括了某一个用户讲的多段语音信息,而智能设备只需要识别其中某一段语音信息,则该段语音信息为待识别的语音信息。例如,智能设备从S801中的麦克风阵列获取的音频数据中提取音频特征,具体方法为如图8C所示,图8C为本发明实施例提供的一种使用梅尔频率倒谱系数进行语音特征提取的示意图。使用梅尔频率倒谱系数(MFCC)对帧长为20ms数据提取40维的特征,帧与帧之间没有重叠(non-overlapping),每15帧(对应0.3秒的音频片段)拼接(concat)为一个维数为15×40×3的cube(其中,15是时序特征,40×3是2d空间特征),以该0.3s内的语音波形序列作为音频特征的输入,其中0.3s则为目标时间段。
在一种可能的实现方式中,如何从所述音频数据中提取待识别的语音信息,包括:基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据 识别为所述待识别的语音信息。由于不同用户发出的声音对应的频谱在一般情况下是不同的,因此,本发明实施例,通过从音频数据中先识别出不同频谱的音频数据,然后再将目标频谱的音频数据识别为待识别的语音信息,进而实现从音频数据中提取待识别语音信息的功能。
步骤S803:从所述视频数据中提取N个用户的唇部运动信息。
具体地,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数。即从原始视频数据中提取的各个用户的唇部视频,即连续的唇部运动的图像序列,包含了对应用户的连续的口型变化特征。可选的,其中的用户可以包括人、也可以包括其它能发出语音信息的机器人、生化人、玩具宠物、真实宠物等。例如,通过摄像头采集的视频数据中的每一帧图像的格式为24位BMP位图,其中,BMP图像文件(Bitmap-File)格式是Windows采用的图像文件存储格式,而24位图像则是使用3字节保存颜色值,每一个字节代表一种颜色,按红(R)、绿(R)、蓝(B)排列,并将RGB彩色图像转换成灰度图。智能设备从上述摄像头采集的视频数据中,基于人脸识别算法,获取至少一个人脸区域,并进一步以每个人脸区域为单位,为每个人脸区域赋予一个人脸ID(以图8D描述的场景为例,图8D为本发明实施例提供的一种机器人与家庭成员交互的场景示意图,获取到的视频可以提取出6个人脸ID),提取嘴部区域的视频序列流,其中,视频的帧速率为30f/s(帧率(Frame rate)=帧数(Frames)/时间(Time),单位为帧每秒(f/s,frames per second,fps))。9个连续的图像帧形成0.3秒的视频流。将9帧图像视频数据(视频速度30fps)拼接(concat)为一个尺寸为9×60×100的cube,其中9是表示时间信息的帧数(时序特征)。每个通道都是口腔区域的60×100灰度图像(2d空间特征)。以此N个用分别对应的0.3s内的唇部运动的图像序列作为视频特征的输入,其中0.3s则为目标时间段。
在一种可能的实现方式中,具体如何从所述视频数据中提取N个用户的唇部运动信息,包括:基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。本发明实施例,通过从视频数据中先识别出人脸区域,然后再基于该人脸区域提取每个人脸区域中的唇部运动视频,然后再依据唇部运动视频确定N个用户的唇部运动信息也即是对应用户的唇部运动图像序列。例如,智能设备从摄像头获取的视频数据中,基于人脸识别算法,获取至少一个人脸区域,并进一步以每个人脸区域为单位,提取嘴部区域的视频序列流。
步骤S804:将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度。
具体地,将上述目标时间段内的语音波形序列和N个用户各自在所述目标时间段内唇部运动的图像序列,分别作为音频特征的输入和视频特征的输入,输入到目标特征匹配模型中,分别计算该语音特征分别和N个用户的唇部运动特征之间的匹配度。例如,若相匹配,计算出的匹配度值为1,若不匹配,则计算出的匹配度值为0。
步骤S805:将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语 音信息所属的目标用户。
具体地,将上述待识别的语音信息与上述N个用户唇部运动信息也即是N个用户的讲话口型中,匹配出最有可能发出上述待识别的语音信息对应的发音口型也即是唇部运动信息。例如,将上述待识别的语音信息以及N个用户的唇部运动信息输入到如训练好的神经网络模型(如本申请中的目标特征匹配模型)中,从所述N个用户中确定所述待识别的语音信息所属的目标用户。
本发明实施例,当应用于多人对话场景中时,可以通过采集音频数据和视频数据,并通过对音频数据中的语音信息和视频数据中的唇部运行信息进行匹配,则可以确定出待识别的语音信息所属的目标用户。即在多人的场景中通过语音特征与多个用户的唇部运动特征进行匹配,识别某段待识别的语音信息具体是由哪个用户发出的,从而可以基于该识别结果进行进一步的控制或操作。区别于现有技术中的通过声纹识别技术或者声源定位技术,即发明实施例不依赖于人的声音(声纹易受身体状况、年龄、情绪等的影响),不受环境干扰(如环境中的噪声干扰等),抗干扰能力强,识别效率和准确度高。其中,待识别的语音信息包括在具体某个时间段内的语音波形序列,而N个用户的唇部运动信息则包括多个用户在同一场景下的该时间段内的唇部运动的图像序列(即唇部运动的视频),便于后续进行相关的特征提取和特征匹配。而采用目标特匹配模型,将待识别的语音信息以及N个用户的唇部运动信息作为该目标特征匹配模型的输入,并且将N个用户的唇部运动信息分别与待识别的语音信息之间的匹配度作为该目标特征匹配模型的输出,进而根据匹配度确定出该待识别的语音信息所属的目标用户。可选的,该目标特征匹配模型为神经网络模型。
参见图9A,图9A是本发明实施例提供的另一种语音匹配方法的流程示意图,该方法可应用于上述图1、图2或图3中所述的应用场景及系统架构中,以及具体可应用于上述图4的客户设备140以及执行设备110中,可以理解的是,客户设备140和执行设备110可以在同一个物理设备上,例如智能机器人、智能手机、智能终端、智能可穿戴设备等。下面结合附图9A以执行主体为包含上述客户设备140以及执行设备110的智能设备为例进行描述。该方法可以包括以下步骤S901-步骤S905。可选的,还可以包括步骤S906-步骤S907。
步骤S901:获取音频数据以及视频数据,所述音频数据与所述视频数据为针对同一场景下采集的。
步骤S902:从所述音频数据中提取待识别的语音信息。
步骤S903:从所述视频数据中提取N个用户的唇部运动信息,N为大于1的整数;
具体地,步骤S901-步骤S903的功能可以参照上述步骤S801-步骤S803的相关描述,此处不再赘述。
步骤S904:将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度。
具体地,由于声音和视觉特征是两种本质上差异很大的模态,而且原始帧速率通常不一样,音频为每秒100帧,而视频为每秒24帧。采用直接拼接的方法会造成信息损失,使 得某一种特征(如语音特征或者是视频特征)在模型训练过程中起到主导作用,造成模型训练难以收敛,最终导致语音信息和唇部运动信息难以准确匹配。因此本发明实施例,可以分别使用两个神经网络编码器对输入的不同模态的序列进行逐层特征抽取,得到高层特征表达。具体方法如下,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括以下步骤S904-A至步骤S904-C:
步骤S904-A:将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,M为大于0的整数;
步骤S904-B:将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征;所述N个图像序列特征中的每一个图像序列特征为K维图像序列特征。
具体地,在上述步骤S904-A和S904-B中,智能设备将上述提取到的视频序列和音频序列作为输入,并分别通过第一模型和第二模型对输入的视频序列和音频序列进行特征归一化,即将视频特征和音频特征提取为相同维度的特征,即均为K维特征,以便于进行后续的特征匹配,进而进一步确定待识别语音信息匹配唇部运动信息的人脸ID,具体流程如下:
本申请中的目标特征匹配网络为两部分,分别为特征分析子网络和特征匹配子网络,而发明实施例中的第一模型和第二模型均为特征分析子网络,第三模型为特征匹配子网络。其中,
1)特征分析子网络(也可称之为特征分析子模型)
如图9B所示,图9B为本发明实施例提供的一种第一模型和第二模型的结构示意图,其功能为将视频序列流和音频特征进行特征归一化,其归一化的网络类型是STCNN(时空卷积网络),左侧第二模型处理视频流,具体构成是3个3D卷积层加池化层(conv+pooling层),后接2个全连接层(FC层),输出为连续视频序列的64维特征表示。右侧第一模型处理音频流,具体构成是2个3D卷积加池化层(conv+pooling层),2个3D卷积层(conv层)和一个全连接层(FC层),音视频均使用时空卷积网络(STCNN)进行提取2d空间特征和时序特征。输出为连续语音序列的64维特征表示,即K等于64。
步骤S904-C:将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
具体地,如图9C所示,图9C为本发明实施例提供的一种第三模型的结构示意图,左侧为模型训练过程,使用多路交叉熵(Multi-way cross-Entropy Loss)作为损失函数,基础网络部分由1个拼接层(concat层),2个FC层和1个softmax层(Softmax函数:归一化指数函数)构成,训练时视频和音频以1:N作为输入进行端到端训练(端到端的含义是:以前有一些数据处理系统或者学习系统,它们需要多个阶段的处理。那么端到端深度学习就是忽略所有这些不同的阶段,用单个神经网络代替它)。如图9C所示,其功能是将视野内所有的人脸唇部序列特征同时与分离出的音频序列特征进行比较,求出最匹配人脸ID。图9C右侧为模型应用过程,基于音视频特征层面的等价性(指音视频数据经过特征分析网络之后获取到的都是64维的特征),将1对N的问题,转化为N对1,推理时视频和音频以N: 1作为输入(N段视频,1段音频),输入内容是Step3中特征分析子网络输出的音视频64维特征,输出是最相似的音视频对。
步骤S905:将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
步骤S906:确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种。
步骤S907:基于所述用户信息,生成与所述用户信息匹配的控制指令。
具体地,本发明实施例,在确定了待识别的语音信息具体是由当前的场景中的哪个目标用户发出的之后,则可以根据该用户的属性信息(如性别、年龄、性格等)、面部表情信息(如该目标用户发出待识别的语音信息所对应的表情)以及对应的环境信息(如目标用户当前处于办公环境、家庭环境、或娱乐环境等),来确定与上述用户信息匹配的控制指令(如语音指令、操作指令等)。例如,控制智能机器朝着目标用户发出与所述表情数据和人物属性信息等匹配的语音或操作等,包括机器人的语气、机器人的头的转向以及机器人的回话内容等等。
如图9D所示,图9D为本发明实施例提供的一种机器人交互场景示意图,在图9D所示的场景示例中,得到最相似的音视频对为包括汤姆人脸的视频和汤姆发出的语音。因此在本步骤可以确定发出指令的人脸ID为汤姆对应的人脸ID。
1)匹配结果后智能设备根据确定的人脸ID,以及存储在存储模块中的知识图谱获取该用户详细的用户个人资料(user profile,或称用户画像);知识图谱包括用户的人口特征,对话历史(在本发明实施例中,包括在图9D所示的场景示例中,姐姐克里丝(告诉他明天有暴雨)和弟弟杰瑞(吵着要和他一起去)的上下文对话),用户的喜好;
2)匹配结果后智能设备根据与该人脸ID对应的人脸区域,使用人脸表情网络,获取最新实时的人脸表情数据;
3)匹配结果后智能设备根据该人脸边界/边框(bounding box/bbox),结合摄像头视野宽度(如图9F场景示例中所示),换算出机器人机械结构的移动水平和垂直角度值,送给舵机控制系统,驱动机器人转向该用户。具体换算公式为:
计算θ角,机器人头部直接向左/右转动θ角度,保证人脸中心与视野中心重合,其中,计算公式:θ=arctan(x/f)。
如图9E所示为本发明实施例提供的一种功能模块以及整体流程的示意图,包括多模态输入、特征提取、特征匹配、匹配结果后处理、对话和控制系统处理。本发明实施例的重点在于特征提取模块,多模态特征匹配网络和后处理模块。多模态输入模块包括用户侧终端设备(本发明实施例以机器人为例)。
摄像头,摄像头用于获取图像/视频,具体地,在本发明中,获取的图像/视频包括多个人的信息;
麦克风,麦克风用于拾取语音,在本发明中,获取的图像/视频包括多个人的信息;
神经网络处理器,特征提取模块可以是用户侧终端设备的处理模块(上图中未示出) 的一部分,可以是软件实现,如一段代码,用于从摄像头获取的视频数据中,基于人脸识别算法,获取至少一个人脸区域,并进一步以每个人脸区域为单位,提取嘴部区域的视频序列流;特征提取模块还用于从麦克风阵列获取的音频数据中提取音频特征;
多模态特征匹配模块用于将特征提模块提取到的视频序列流和音频特征,使用V&A Cross CNN获取嘴型和语音匹配人脸ID;
匹配结果后处理模块用于根据多模态特征匹配模块得到的人脸ID,以及存储在存储模块中的知识图谱(上图中未示出,存储模块可以是云端服务器,也可以是部署在机器人上的,或者一部分在云端,一部分在机器人上),获取该用户详细的用户个人资料(user profile,或称用户画像);
普通处理器,匹配结果后处理模块还用于根据与该人脸ID对应的、特征提取模块获取的人脸区域,使用人脸表情网络,获取最新实时的人脸表情数据;
匹配结果后处理模块还用于根据该人脸边界/边框(bounding box/bbox),以及预设的算法,换算机器人机械结构的移动角度;
对话和控制系统处理模块包括对话模块和控制模块,对话模块用户根据匹配结果后处理模块得到的用户画像和人脸表情数据,得出语音的答复;控制模块用于根据匹配结果后处理模块得到的移动角度,通过舵机(舵机也叫伺服电机,最早用于船舶上实现其转向功能,由于可以通过程序连续控制其转角,因而被广泛应用机器人的各类关节运动)控制系统,驱动该机器人转向该用户。
本发明实施例还提供又一种语音匹配方法,该方法可应用于上述图1、图2或图3中所述的应用场景及系统架构中,以及具体可应用于上述图4的执行设备110中,可以理解的是,此时,客户设备140和执行设备110可以不在同一个物理设备上,如图9F所示,图9F为本发明实施例提供的一种语音匹配系统架构图,在该系统中,例如,客户设备140为智能机器人、智能手机、智能音箱、智能可穿戴设备等,作为音频数据和视频数据的采集设备,进一步地,还可以作为待识别语音信息以及N个用户的唇部信息的提取设备;而关于上述提取后的待识别语音信息以及N个用户的唇部信息之间的匹配则可以在执行设备110所在的服务器/服务设备/服务装置/云端服务设备上执行。可选的,上述待识别语音信息以及N个用户的唇部信息的提取也可以在执行设备110所在的设备侧执行,本发明实施例对此不作具体限定。下面结合附图9F以执行主体为包含上述执行设备110的云端服务设备为例进行描述。该方法可以包括以下步骤S1001-步骤S1003。
步骤S1001:获取待识别的语音信息和N个用户的唇部运动信息;所述待识别的语音信息包括目标时间段内的语音波形序列,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数。
步骤S1002:将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度。
步骤S1003:将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的 语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;
所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
在一种可能的实现方式中,所述方法还包括:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述方法还包括:从视频数据中提取N个用户的唇部运动信息;进一步地,所述从所述视频数据中提取N个用户的唇部运动信息,包括:
基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述方法还包括:从所述音频数据中提取待识别的语音信息;进一步地,所述从所述音频数据中提取待识别的语音信息,包括:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
需要说明的是,本发明实施例中所描述的云端服务设备所执行的方法流程可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
请参见图10A,图10A是本发明实施例提供的一种智能设备的结构示意图,图10A是本发明实施例提供的一种智能设备的功能原理示意图。该智能设备可以为智能机器人、智能手机、智能音箱、智能可穿戴设备等。该智能设备40A中可包括处理器401A,以及耦合于该处理器401A的麦克风402A、摄像头403A和神经网络处理器404A;其中,
麦克风402A,用于采集音频数据;
摄像头403A,用于采集视频数据,所述音频数据与所述视频数据为针对同一场景下采集的;
处理器401A,用于获取所述音频数据以及所述视频数据;从所述音频数据中提取待识别的语音信息;从所述视频数据中提取N个用户的唇部运动信息,N为大于1的整数;
神经网络处理器404A,用于基于所述待识别的语音信息以及所述N个用户的唇部运动信息,从所述N个用户中确定所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述待识别的语音信息包括目标时间段内的语音波形序列;所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列。
在一种可能的实现方式中,神经网络处理器404A,具体用于:将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;神经网络处理器404A,具体用于:将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型,其中,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息。
在一种可能的实现方式中,处理器401A还用于:确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,处理器401A,具体用于:基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,处理器401A,具体用于:基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
需要说明的是,本发明实施例中所描述的智能设备40A中相关模块的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
图10A中每个单元可以以软件、硬件、或其结合实现。以硬件实现的单元可以包括路及电炉、算法电路或模拟电路等。以软件实现的单元可以包括程序指令,被视为是一种软件产品,被存储于存储器中,并可以被处理器运行以实现相关功能,具体参见之前的介绍。
请参见图10B,图10B是本发明实施例提供的另一种智能设备的结构示意图,图10B是本发明实施例提供的一种智能设备的功能原理示意图。该智能设备可以为智能机器人、 智能手机、智能音箱、智能可穿戴设备等。该智能设备40B中可包括处理器401B,以及耦合于该处理器401B的麦克风402B、摄像头403B;其中,
麦克风402B,用于采集音频数据;
摄像头403B,用于采集视频数据,所述音频数据与所述视频数据为针对同一场景下采集的;
处理器401B,用于:
获取所述音频数据以及所述视频数据;从所述音频数据中提取待识别的语音信息;从所述视频数据中提取N个用户的唇部运动信息,N为大于1的整数;
基于所述待识别的语音信息以及所述N个用户的唇部运动信息,从所述N个用户中确定所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述待识别的语音信息包括目标时间段内的语音波形序列;所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列。
在一种可能的实现方式中,处理器404B,具体用于:将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;神经网络处理器404B,具体用于:将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型,其中,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息。
在一种可能的实现方式中,处理器401B还用于:确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,处理器401B,具体用于:基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,处理器401B,具体用于:基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
需要说明的是,本发明实施例中所描述的智能设备40B中相关模块的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
图10B中每个单元可以以软件、硬件、或其结合实现。以硬件实现的单元可以包括路及电炉、算法电路或模拟电路等。以软件实现的单元可以包括程序指令,被视为是一种软件产品,被存储于存储器中,并可以被处理器运行以实现相关功能,具体参见之前的介绍。
请参见图11,图11是本发明实施例提供的一种语音匹配装置的结构示意图,图11是本发明实施例提供的一种智能设备的功能原理示意图。该智能设备可以为智能机器人、智能手机、智能音箱、智能可穿戴设备等。该语音匹配装置50中可包括获取单元501、第一提取单元502、第二提取单元503和匹配单元504;其中,
获取单元501,用于获取音频数据以及视频数据;
第一提取单元502,用于从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
第二提取单元503,用于从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
匹配单元504,用于将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
用户确定单元505,用于将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一网络、第二网络和第三网络;匹配单元504,具体用于:
将所述待识别的语音信息输入到所述第一网络中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二网络中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三网络中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型;可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息。
在一种可能的实现方式中,所述装置还包括:
信息确定单元506,用于确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
控制单元507,用于基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,第一提取单元502,具体用于:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
在一种可能的实现方式中,所述第二提取单元503,具体用于:
基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
需要说明的是,本发明实施例中所描述的语音匹配装置50中相关模块的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
图11中每个单元可以以软件、硬件、或其结合实现。以硬件实现的单元可以包括路及电炉、算法电路或模拟电路等。以软件实现的单元可以包括程序指令,被视为是一种软件产品,被存储于存储器中,并可以被处理器运行以实现相关功能,具体参见之前的介绍。
请参见图12,图12是本发明实施例提供的一种神经网络的训练装置的结构示意图,图12是本发明实施例提供的一种智能设备的功能原理示意图。该神经网络的训练装置可以为智能机器人、智能手机、智能音箱、智能可穿戴设备等。该神经网络的训练装置60中可包括获取单元601和训练单元602;其中,
获取单元601,用于获取训练样本,所述训练样本包括训练用户的唇部运动信息以及M个语音信息;可选的,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的语音信息;
训练单元602,用于以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述训练用户的唇部运动信息包括所述训练用户的唇部运动图像序列,所述M个语音信息包括一个与所述训练用户的唇部运动图像序列匹配的语音波形序列以及(M-1)个与所述训练用户的唇部运动图像序列不匹配的语音波形序列。
在一种可能的实现方式中,训练单元602,具体用于:
将所述训练用户的唇部运动信息以及所述M个语音信息输入到所述初始化的神经网络中,计算得到所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度;
将计算得到的所述M个语音信息分别与所述训练用户的唇部运动信息之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;训练单元602,具体用于:
将所述M个语音信息输入到所述第一模型中,得到M个语音特征,所述M个语音特征中的每一个语音特征均为K维语音特征,K为大于0的整数;
将所述训练用户的唇部运动信息输入到所述第二模型中,得到所述训练用户的图像序列特征,所述训练用户的图像序列特征为K维图像序列特征;
将所述M个语音特征和所述训练用户的图像序列特征输入到第三模型中,计算得到所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度;
将计算得到的所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
需要说明的是,本发明实施例中所描述的神经网络的训练装置60中相关模块的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
图12中每个单元可以以软件、硬件、或其结合实现。以硬件实现的单元可以包括路及电炉、算法电路或模拟电路等。以软件实现的单元可以包括程序指令,被视为是一种软件产品,被存储于存储器中,并可以被处理器运行以实现相关功能,具体参见之前的介绍。
请参见图13,图13是本发明实施例提供的另一种智能设备的结构示意图,该智能设备可以为智能机器人、智能手机、智能音箱、智能可穿戴设备等。该智能设备70中可包括处理器701,以及耦合于该处理器701的麦克风702、摄像头703;其中,
麦克风702,用于采集音频数据;
摄像头703,用于采集视频数据;
处理器701,用于获取音频数据以及视频数据;
从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
需要说明的是,本发明实施例中所描述的智能设备70中相关模块的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
请参见图14,图14是本发明实施例提供的一种服务装置的结构示意图,该服务装置可以为服务器、云端服务器等。该服务装置80中可包括处理器;可选的,该处理器可由神经网络处理器801和与该神经网络处理器801耦合的处理器802组成;其中,
神经网络处理器801,用于:
获取待识别的语音信息和N个用户的唇部运动信息;
将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;神经网络处理器801,具体用于:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列 特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
在一种可能的实现方式中,所述服务器还包括处理器802;处理器802用于:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述服务器还包括处理器802;处理器802,还用于:
基于人脸识别算法,识别视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述服务器还包括处理器802;处理器802,还用于:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
需要说明的是,本发明实施例中所描述的语音匹配装置50中相关模块的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
请参见图15,图15是本发明实施例提供的另一种语音匹配系统200,该系统包括上述智能设备70和服务装置80,该智能设备70和服务装置80通过交互完成本申请中的所述任意一种语音匹配方法,关于该系统的功能可参见上述图1-图9F中所述的相关方法实施例,此处不再赘述。
本发明实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时包括上述方法实施例中记载的任意一种的部分或全部步骤。
本发明实施例还提供一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述方法实施例中记载的任意一种的部分或全部步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之 间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-Only Memory,缩写:ROM)或者随机存取存储器(Random Access Memory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (21)

  1. 一种语音匹配方法,其特征在于,包括:
    获取音频数据以及视频数据;
    从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
    从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
    将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
    将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
  2. 根据权利要求1所述的方法,其特征在于,所述目标特征匹配模型包括第一模型、第二模型和第三模型;
    所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括:
    将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
    将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
    将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
  3. 根据权利要求1或2所述的方法,其特征在于,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述方法还包括:
    确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
    基于所述用户信息,生成与所述用户信息匹配的控制指令。
  5. 根据权利要求1-4任意一项所述的方法,其特征在于,所述从所述视频数据中提取N个用户的唇部运动信息,包括:
    基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中 每个人脸区域中的唇部运动视频;
    基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
  6. 根据权利要求1-5任意一项所述的方法,其特征在于,所述从所述音频数据中提取待识别的语音信息,包括:
    基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
  7. 一种智能设备,其特征在于,包括:处理器以及与所述处理器耦合的麦克风、摄像头:
    所述麦克风,用于采集音频数据;
    所述摄像头,用于采集视频数据;
    所述处理器,用于
    获取所述音频数据以及所述视频数据;
    从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
    从所述视频数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
    将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
    将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
  8. 根据权利要求7所述的智能设备,其特征在于,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述处理器,具体用于:
    将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
    将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
    将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
  9. 根据权利要求7或8所述的智能设备,其特征在于,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
  10. 根据权利要求7-9任意一项所述的智能设备,其特征在于,所述处理器还用于:
    确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
    基于所述用户信息,生成与所述用户信息匹配的控制指令。
  11. 根据权利要求7-10任意一项所述的智能设备,其特征在于,所述处理器,具体用于:
    基于人脸识别算法,识别所述视频数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
    基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
  12. 根据权利要求7-11任意一项所述的智能设备,其特征在于,所述处理器,具体用于:
    基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
  13. 一种语音匹配方法,其特征在于,包括:
    获取待识别的语音信息和N个用户的唇部运动信息;所述待识别的语音信息包括目标时间段内的语音波形序列,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
    将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
    将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
  14. 根据权利要求13所述的方法,其特征在于,所述目标特征匹配模型包括第一模型、第二模型和第三模型;
    所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括:
    将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
    将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
    将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
  15. 根据权利要求13或14所述的方法,其特征在于,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与 所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
  16. 一种服务装置,其特征在于,包括处理器;所述处理器用于:
    获取待识别的语音信息和N个用户的唇部运动信息;所述待识别的语音信息包括目标时间段内的语音波形序列,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
    将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;
    将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
  17. 根据权利要求16所述的装置,其特征在于,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述处理器,具体用于:
    将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
    将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
    将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
  18. 根据权利要求16或17所述的装置,其特征在于,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
  19. 一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行时,权利要求1-6或权利要求13-15中任意一项所述的方法得以实现。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读介质用于存储程序代码,所述程序代码包括用于执行如权利要求1-6或者权利要求13-15任一项所述的方法。
  21. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被执行时,使得如权利要求1-6或者权利要求13-15中的任意一项所述的方法得以实现。
PCT/CN2020/129464 2019-11-30 2020-11-17 一种语音匹配方法及相关设备 WO2021104110A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20893436.4A EP4047598B1 (en) 2019-11-30 2020-11-17 Voice matching method and related device
US17/780,384 US20230008363A1 (en) 2019-11-30 2020-11-17 Audio matching method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911209345.7A CN111091824B (zh) 2019-11-30 2019-11-30 一种语音匹配方法及相关设备
CN201911209345.7 2019-11-30

Publications (1)

Publication Number Publication Date
WO2021104110A1 true WO2021104110A1 (zh) 2021-06-03

Family

ID=70393888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129464 WO2021104110A1 (zh) 2019-11-30 2020-11-17 一种语音匹配方法及相关设备

Country Status (4)

Country Link
US (1) US20230008363A1 (zh)
EP (1) EP4047598B1 (zh)
CN (1) CN111091824B (zh)
WO (1) WO2021104110A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113625662A (zh) * 2021-07-30 2021-11-09 广州玺明机械科技有限公司 用于摇饮料机器人的数据采集传输的律动动感控制系统
CN113835065A (zh) * 2021-09-01 2021-12-24 深圳壹秘科技有限公司 基于深度学习的声源方向确定方法、装置、设备及介质
CN114466179A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN114666639A (zh) * 2022-03-18 2022-06-24 海信集团控股股份有限公司 视频播放方法及显示设备
CN114821622A (zh) * 2022-03-10 2022-07-29 北京百度网讯科技有限公司 文本抽取方法、文本抽取模型训练方法、装置及设备

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114616620A (zh) * 2019-10-18 2022-06-10 谷歌有限责任公司 端到端多讲话者视听自动语音识别
CN111091824B (zh) * 2019-11-30 2022-10-04 华为技术有限公司 一种语音匹配方法及相关设备
CN111583916B (zh) * 2020-05-19 2023-07-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111640438B (zh) * 2020-05-26 2023-09-05 同盾控股有限公司 音频数据处理方法、装置、存储介质及电子设备
CN111756814A (zh) * 2020-05-29 2020-10-09 北京京福安科技股份有限公司 居家养老精准化服务系统
CN113963692A (zh) * 2020-07-03 2022-01-21 华为技术有限公司 一种车舱内语音指令控制方法及相关设备
CN111860335A (zh) * 2020-07-22 2020-10-30 安徽兰臣信息科技有限公司 一种基于人脸识别的智能穿戴设备
CN112069897B (zh) * 2020-08-04 2023-09-01 华南理工大学 基于知识图谱的语音和微表情识别自杀情绪感知方法
CN112633136B (zh) * 2020-12-18 2024-03-22 深圳追一科技有限公司 视频分析方法、装置、电子设备及存储介质
CN112770062B (zh) * 2020-12-22 2024-03-08 北京奇艺世纪科技有限公司 一种图像生成方法及装置
CN112906650B (zh) * 2021-03-24 2023-08-15 百度在线网络技术(北京)有限公司 教学视频的智能处理方法、装置、设备和存储介质
CN113158917B (zh) * 2021-04-26 2024-05-14 维沃软件技术有限公司 行为模式识别方法及装置
CN113571060B (zh) * 2021-06-10 2023-07-11 西南科技大学 一种基于视听觉融合的多人对话点餐方法及系统
CN113506578A (zh) * 2021-06-30 2021-10-15 中汽创智科技有限公司 一种语音与图像的匹配方法、装置、存储介质及设备
CN113241060B (zh) * 2021-07-09 2021-12-17 明品云(北京)数据科技有限公司 一种安保预警方法及系统
CN113852851B (zh) * 2021-08-12 2023-04-18 国网浙江省电力有限公司营销服务中心 一种基于并行流模型的快速唇动-语音对齐方法
CN114494930B (zh) * 2021-09-09 2023-09-22 马上消费金融股份有限公司 语音与图像同步性衡量模型的训练方法及装置
WO2023090057A1 (ja) * 2021-11-17 2023-05-25 ソニーグループ株式会社 情報処理装置、情報処理方法および情報処理プログラム
CN114408115A (zh) * 2022-01-19 2022-04-29 中国人民解放军海军特色医学中心 一种便于人机交互的船舶用操作台
CN114708642B (zh) * 2022-05-24 2022-11-18 成都锦城学院 商务英语仿真实训装置、系统、方法及存储介质
CN114786033B (zh) * 2022-06-23 2022-10-21 中译文娱科技(青岛)有限公司 一种基于人工智能的视听数据智能分析管理系统
CN117793607A (zh) * 2022-09-28 2024-03-29 华为技术有限公司 一种播放控制方法及装置
CN116362596A (zh) * 2023-03-10 2023-06-30 领途教育咨询(北京)有限公司 一种基于元宇宙vr的大数据人才评测系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194456A (zh) * 2010-03-11 2011-09-21 索尼公司 信息处理设备、信息处理方法和程序
CN108564943A (zh) * 2018-04-27 2018-09-21 京东方科技集团股份有限公司 语音交互方法及系统
WO2019150708A1 (ja) * 2018-02-01 2019-08-08 ソニー株式会社 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN110276259A (zh) * 2019-05-21 2019-09-24 平安科技(深圳)有限公司 唇语识别方法、装置、计算机设备及存储介质
CN110475093A (zh) * 2019-08-16 2019-11-19 北京云中融信网络科技有限公司 一种活动调度方法、装置及存储介质
CN111091824A (zh) * 2019-11-30 2020-05-01 华为技术有限公司 一种语音匹配方法及相关设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682273A (zh) * 2011-03-18 2012-09-19 夏普株式会社 嘴唇运动检测设备和方法
CN107293300A (zh) * 2017-08-01 2017-10-24 珠海市魅族科技有限公司 语音识别方法及装置、计算机装置及可读存储介质
US10621991B2 (en) * 2018-05-06 2020-04-14 Microsoft Technology Licensing, Llc Joint neural network for speaker recognition
CN110517295A (zh) * 2019-08-30 2019-11-29 上海依图信息技术有限公司 一种结合语音识别的实时人脸轨迹跟踪方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194456A (zh) * 2010-03-11 2011-09-21 索尼公司 信息处理设备、信息处理方法和程序
WO2019150708A1 (ja) * 2018-02-01 2019-08-08 ソニー株式会社 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN108564943A (zh) * 2018-04-27 2018-09-21 京东方科技集团股份有限公司 语音交互方法及系统
CN110276259A (zh) * 2019-05-21 2019-09-24 平安科技(深圳)有限公司 唇语识别方法、装置、计算机设备及存储介质
CN110475093A (zh) * 2019-08-16 2019-11-19 北京云中融信网络科技有限公司 一种活动调度方法、装置及存储介质
CN111091824A (zh) * 2019-11-30 2020-05-01 华为技术有限公司 一种语音匹配方法及相关设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4047598A4

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113625662A (zh) * 2021-07-30 2021-11-09 广州玺明机械科技有限公司 用于摇饮料机器人的数据采集传输的律动动感控制系统
CN113625662B (zh) * 2021-07-30 2022-08-30 广州玺明机械科技有限公司 用于摇饮料机器人的数据采集传输的律动动感控制系统
CN113835065A (zh) * 2021-09-01 2021-12-24 深圳壹秘科技有限公司 基于深度学习的声源方向确定方法、装置、设备及介质
CN113835065B (zh) * 2021-09-01 2024-05-17 深圳壹秘科技有限公司 基于深度学习的声源方向确定方法、装置、设备及介质
CN114466179A (zh) * 2021-09-09 2022-05-10 马上消费金融股份有限公司 语音与图像同步性的衡量方法及装置
CN114821622A (zh) * 2022-03-10 2022-07-29 北京百度网讯科技有限公司 文本抽取方法、文本抽取模型训练方法、装置及设备
CN114666639A (zh) * 2022-03-18 2022-06-24 海信集团控股股份有限公司 视频播放方法及显示设备
CN114666639B (zh) * 2022-03-18 2023-11-03 海信集团控股股份有限公司 视频播放方法及显示设备

Also Published As

Publication number Publication date
CN111091824B (zh) 2022-10-04
EP4047598B1 (en) 2024-03-06
EP4047598A1 (en) 2022-08-24
US20230008363A1 (en) 2023-01-12
CN111091824A (zh) 2020-05-01
EP4047598A4 (en) 2022-11-30

Similar Documents

Publication Publication Date Title
WO2021104110A1 (zh) 一种语音匹配方法及相关设备
Zhang et al. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things
Zheng et al. Recent advances of deep learning for sign language recognition
CN108363973B (zh) 一种无约束的3d表情迁移方法
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
WO2019062931A1 (zh) 图像处理装置及方法
WO2015158017A1 (zh) 智能交互及心理慰藉机器人服务系统
WO2023284435A1 (zh) 生成动画的方法及装置
Bencherif et al. Arabic sign language recognition system using 2D hands and body skeleton data
CN108491808B (zh) 用于获取信息的方法及装置
WO2021203880A1 (zh) 一种语音增强方法、训练神经网络的方法以及相关设备
US20230129816A1 (en) Speech instruction control method in vehicle cabin and related device
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
Vasudevan et al. Introduction and analysis of an event-based sign language dataset
CN111581470A (zh) 用于对话系统情景匹配的多模态融合学习分析方法和系统
Ding et al. Designs of human–robot interaction using depth sensor-based hand gesture communication for smart material-handling robot operations
Vasudevan et al. SL-Animals-DVS: event-driven sign language animals dataset
CN113611318A (zh) 一种音频数据增强方法及相关设备
CN109961152B (zh) 虚拟偶像的个性化互动方法、系统、终端设备及存储介质
Ahmed et al. Two person interaction recognition based on effective hybrid learning
CN116758451A (zh) 基于多尺度和全局交叉注意力的视听情感识别方法及系统
CN116417008A (zh) 一种跨模态音视频融合语音分离方法
CN113763925B (zh) 语音识别方法、装置、计算机设备及存储介质
CN116312512A (zh) 面向多人场景的视听融合唤醒词识别方法及装置
CN111949773A (zh) 一种阅读设备、服务器以及数据处理的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20893436

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020893436

Country of ref document: EP

Effective date: 20220520

NENP Non-entry into the national phase

Ref country code: DE