WO2022001347A1 - 一种车舱内语音指令控制方法及相关设备 - Google Patents

一种车舱内语音指令控制方法及相关设备 Download PDF

Info

Publication number
WO2022001347A1
WO2022001347A1 PCT/CN2021/091138 CN2021091138W WO2022001347A1 WO 2022001347 A1 WO2022001347 A1 WO 2022001347A1 CN 2021091138 W CN2021091138 W CN 2021091138W WO 2022001347 A1 WO2022001347 A1 WO 2022001347A1
Authority
WO
WIPO (PCT)
Prior art keywords
vehicle
instruction
motion information
lip motion
type
Prior art date
Application number
PCT/CN2021/091138
Other languages
English (en)
French (fr)
Inventor
邱梅清
蒋慧颖
黄怡
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21833146.0A priority Critical patent/EP4163913A4/en
Priority to KR1020237002403A priority patent/KR20230027252A/ko
Publication of WO2022001347A1 publication Critical patent/WO2022001347A1/zh
Priority to US18/146,662 priority patent/US20230129816A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the invention relates to the technical field of human-computer interaction, and in particular, to a voice command control method and related equipment in a vehicle cabin.
  • the smart device in the car needs to obtain the specific location of the command issuer for the execution of some special commands. , thereby determining how to instruct the corresponding instruction.
  • some commands of the vehicle need to be operated according to the specific position of the cabin, such as adjusting the air volume of a specific air outlet, or adjusting the volume of a specific speaker.
  • Embodiments of the present invention provide a voice command control method and related equipment, so as to improve the execution accuracy of voice commands in a multi-person scene in a vehicle and reduce misoperation or misrecognition.
  • an embodiment of the present invention provides a voice control method in a vehicle cabin, including: acquiring a first type of instruction and lip motion information of vehicle members located at N positions in the vehicle cabin within a target time period, The first type of instruction is obtained according to the target audio data collected in the vehicle cabin, and the lip motion information of the member in the vehicle is obtained when the first type of instruction is identified from the target audio data, and the The target time period is the time period corresponding to the first type of instruction in the audio data; the first type of instruction is matched with the lip motion information of the vehicle members at N positions in the vehicle cabin, A target position is acquired according to the matching result between the lip motion information of the occupants in the vehicle at the N positions and the first type of instruction, and the target position is the matching result indicating the lip motion information and the first type of instruction.
  • the location of the in-car member matched by a type of instruction; sending instruction information indicating that the first type of instruction is executed for the target location.
  • the embodiment of the present invention can be applied to refined voice command control in the vehicle cabin.
  • the voice command is an instruction that needs to be further determined to be executed, such as an operation command of a device in the vehicle cabin, such as playing a video, a loudspeaker Adjustment, air conditioning adjustment, seat conditions, etc. need to determine the specific location of the instruction to carry out targeted local operation instructions, and determine which one is by obtaining the lip motion information of the members at various positions in the car and the identified specific instruction information.
  • the instructions issued by the members of the location so that the operation control can be carried out for a specific location area in a targeted manner.
  • the above voice control method involves the processing of video data and the analysis of the degree of matching between its local features and instruction information, it can occur locally, that is, the vehicle or a smart device on the vehicle to execute the above method, or the above video can be executed in the cloud. Data processing and matching actions.
  • the method of the embodiment of the present invention can be applied to a vehicle scene with any number of people, and is especially suitable for a scene where there are multiple members in the vehicle and multiple people are talking at the same time.
  • the vehicle is combined by matching lip motion information and instruction information.
  • the internal location distribution information can accurately locate the location information of the member who issued the voice command, and then determine the location area for which the command is executed.
  • the N mentioned in the method involved in the embodiment of the present invention does not necessarily represent all members in the vehicle, and may be all members or some members in the vehicle.
  • the acquiring the first type of instruction and the lip motion information of the vehicle members located at N positions in the vehicle cabin is specifically: acquiring the target audio data in the vehicle cabin; It is output that the target audio data includes a first type of instruction to acquire in-vehicle image data; and from the in-vehicle image data, the lip motion information of the occupants at N positions in the cabin is extracted.
  • the extraction of the lip motion information of the members at N positions in the vehicle from the image data in the vehicle cabin mentioned in the embodiments of the present invention may specifically be based on a face recognition algorithm to identify all the lip motion information in the video data.
  • the lip motion information of the N members In the N face regions, extract the lip motion video or sample video frame sequence in each face region in the N face regions; determine the lip motion video or the video frame sequence based on the lip motion video in each face region
  • the lip motion information of the N members The image data is usually acquired by one or more cameras in the vehicle, and the types of the cameras can be various.
  • the image data mentioned in the embodiment of the present invention may be an image obtained by one camera from one angle
  • the image data in this embodiment of the present invention may also be image data from different viewing angles of the same camera, or Image data from multiple cameras, and other combinations of the above.
  • the lip motion information of the vehicle members at N positions in the vehicle cabin is extracted from the image data in the vehicle cabin, specifically: when the identified vehicle members are greater than 1,
  • the lip motion information of the vehicle occupants at N positions in the vehicle cabin is extracted from the vehicle cabin image data.
  • the number of people in the vehicle cabin may be determined according to the acquired in-vehicle image data. When there is only one person in the vehicle cabin, The lip motion information is not extracted, but only when there are more than one person, the lip motion information of the person at the position where the person is seated is extracted.
  • the first type of instruction is matched with the lip motion information of the occupants in the N positions in the cabin, according to the Obtaining the target position from the matching result between the member's lip motion information and the first type of instruction is specifically: according to the first type of instruction and the lip motion information of the N members located in the cabin, Obtain the matching degree between the lip motion information of the in-vehicle members at each of the N positions and the instruction information, where N is an integer greater than 1; The member's location serves as the target location.
  • the target feature matching model is to train the user's lip motion information and one or more voice information (the voice information can be a voice waveform).
  • the sequence may also be the text information corresponding to the speech) as input, and the matching degree between the lip motion information of the training user and the M pieces of speech information is M labels, and a feature matching model obtained by training.
  • the reasoning method of the model will also be different according to the different sample data for training.
  • multiple sets of lip motion information such as 5 lip motion information
  • the The 5 lip motion information and each of the M pieces of speech information are constructed into M groups of samples as input, the 5 lip motion information and the matching degree of the speech information in the sample are output respectively, and the label information is the target output of training yes result.
  • the model thus trained also uses 5 lip motion information and target instruction information as input during inference.
  • the lip information of the vacant position can be used as the input with the default value, such as All 0 sequence, the matching degree between the output and the instruction information.
  • the first type of instruction is a speech waveform sequence extracted from the audio data or text instruction information identified according to the audio data.
  • the lip motion information of the occupants in the N positions in the vehicle cabin is the lip movement information of the occupants in the N positions in the cabin within the target time period A sequence of images in motion.
  • the obtained voice command is matched with the lip information of the member in the vehicle at each position to know which position the member in the command-related voice sends out, and the relationship between the command information and the member's lip movement information
  • the matching degree is obtained through the matching model.
  • the command information to be obtained is also different.
  • different lip motion information can be extracted from the video of lip motion, for example, it can be an image sequence of lip motion in the target time period or a vector parameter representing the temporal change of the distance between the upper and lower lips .
  • the first type of instruction may also have various forms, which may be a sequence of speech waveforms, or may be text information corresponding to the instruction.
  • the audio data in the vehicle cabin is obtained according to audio data collected by a microphone in a designated location area in the vehicle cabin.
  • the in-vehicle audio data is obtained based on target audio data selected from audio data collected by multiple microphones in the vehicle cabin.
  • the audio data in the cabin mentioned in the embodiment of the present invention can be collected. It is obtained after comprehensive processing of multiple audio data; it can also be obtained by comprehensively comparing the audio data collected by multiple microphones in the cabin, and the parameters selected according to the preset rules are the best, that is, the recorded voice quality is the best. Audio data; or audio data collected by a microphone at the command position, such as audio data collected by a microphone set in the central area of the vehicle.
  • the target feature matching model includes a first model, a second model, and a third model;
  • the lip motion information is input into the target feature matching model, and the matching degrees between the lip motion information of the vehicle members at N positions in the vehicle cabin and the instruction information are obtained respectively, including: combining the instruction information Input into the first model to obtain voice features, the voice features are K-dimensional voice features, and K is an integer greater than 0; input the lip motion information of the members of the car at N positions in the cabin
  • N image sequence features are obtained, and each image sequence feature in the N image sequence features is a K-dimensional image sequence feature; the voice feature and the N image sequence features are combined.
  • Input into the third model to obtain the matching degree between the N image sequence features and the speech features respectively.
  • the target feature matching model includes a first model and a second model; the instruction information and the lip motion information of the vehicle members at N positions in the vehicle cabin are combined input into the target feature matching model to obtain the matching degree between the lip motion information of the vehicle members at N positions in the vehicle cabin and the instruction information, including: inputting the audio data into the In the first model, the corresponding instruction information is obtained, and the lip movement information of the occupants at N positions in the cabin is input into the second model, and the lip movement of each occupant in the car is The information corresponds to a set of image sequence features, and the N image sequence features are simultaneously or separately input into the second model to obtain instruction information corresponding to each lip motion information; based on the recognition results of the two models, it is determined that the The directive's target location member.
  • the instruction information output by the first model may be the identifier and matching degree corresponding to the instruction, and the model selects and outputs the instruction identifier corresponding to the instruction with the highest matching degree, and the identifier may be the instruction code. , or it can be the text feature corresponding to the instruction.
  • the above-mentioned judgment is made based on the output results of the first model and the second model, and the judgment rules can be various.
  • the instruction information identified by the first model is the same, then the matching degree output by the second model is compared, and the target position with high matching degree is selected to execute the instruction, or the instruction information with the highest matching degree identified in the second model and the first model are selected.
  • the instruction information is compared, and if the instruction information is the same, the position corresponding to the instruction information with the highest matching degree in China of the second model is determined as the target position.
  • the corresponding relationship between the lip motion information of the in-vehicle members at the N positions and the N positions is generated;
  • the position of the member is taken as the target position, including: obtaining the target lip motion information with the highest matching degree;
  • the position corresponding to the target lip motion information is determined as the target position.
  • the generation of the correspondence between the lip motion information of the members in the car at the N positions and the N positions may be obtained by acquiring the member relationship at each position through image acquisition in the car, and then the lips of each member are obtained.
  • the motion information corresponds to the positional relationship, and the image data may be the same image extracted from the lip information, or may be an independent acquisition process.
  • the corresponding relationship between the lip motion information of the occupants in the N positions and the identities of the occupants in the N positions is generated; the lip motion with the highest matching degree is The position of the member in the car corresponding to the information is taken as the target position, including: obtaining the target lip motion information with the highest matching degree;
  • the corresponding relationship between the identities of the in-vehicle members on the position determines the target in-vehicle member; the position information of the target in-vehicle member is determined as the target position, and the position information of the target in-vehicle member is determined according to the sensor data in the vehicle.
  • the first type of instruction mentioned in the embodiment of the present invention may be an in-cabin manipulation instruction, which is mainly applicable to an instruction interaction scenario in which position information needs to be identified in the vehicle cabin to determine the target area for which the instruction is executed. Since sometimes the user in the car will give clear location information when executing voice commands, such as closing the window of the right rear seat, in a possible implementation, this type of command with clear location information can be considered as Not belonging to the first type of instruction, the first type of instruction may be an instruction that needs to distinguish the location area for execution but the instruction does not contain location area information.
  • an embodiment of the present invention further provides a voice control method in a vehicle cabin, including acquiring a first type of instruction and lip motion information of a vehicle member at a first position in the vehicle, the first type of instruction
  • the lip motion information of the occupant in the first position is obtained when the second type of instruction is identified from the target audio data, and the target time
  • the segment is the time period corresponding to the second type of instruction in the audio data
  • the second type of instruction is matched with the lip motion information of the occupant in the first position to obtain a matching result; when according to The matching result determines that the instruction of the second type matches the lip motion information of the occupant in the vehicle at the first position, and sends instruction information instructing to execute the instruction of the second type.
  • the embodiment of the present invention can be applied to the refined voice command control in the vehicle cabin.
  • the voice command is recognized as the command that needs to be determined after the identity or authority of the command issuer is determined, such as the driving control command of the vehicle, or other related commands.
  • Commands to operate private information such as switching the driving mode, controlling the driving direction, viewing historical call data, etc., usually based on the special environment of the vehicle, we often assume that the member in the driving seat has the highest authority, so when the vehicle obtains the relevant information, it needs to be confirmed
  • issuing the instruction of the identity of the person by obtaining the lip motion information of the member at a specific position (such as the driving seat) and the identified specific instruction information to determine whether it is the instruction issued by the member of the driving position, so as to be able to target specific Location area for operational control.
  • the above voice control method involves the processing of video data and the analysis of the degree of matching between its local features and instruction information, it can occur locally, that is, the vehicle or a smart device on the vehicle to execute the above method, or the above video can be executed in the cloud.
  • Data processing and matching actions In addition to the collection of lip motion information of members in the driving position, it can also be preset or artificially set that members in one or more positions have the control authority of certain types of instructions. When such instructions are identified, the corresponding positions are obtained. The lip movement information of the personnel is matched and analyzed.
  • the location of a specific user (such as a vehicle owner) in the current in-vehicle environment may also be recognized through face recognition technology, and the location of the specific user is used as the first location.
  • the default is to get the lip motion information of the member in the driving seat.
  • the method of the embodiment of the present invention can be applied to a vehicle scene with any number of people, and is especially suitable for a scene where there are multiple members in the vehicle and multiple people are talking at the same time, in this case, the lip motion information and the instruction information are matched and combined
  • the location distribution information in the vehicle can accurately determine whether it is a voice command issued by a member in a specific location, and then determine whether to execute the command.
  • the matching method may be to obtain multiple members in the car (including members in the first position) for matching, and determine whether the members in the first position have the highest matching degree, or only obtain members in the first position for matching, when the matching degree is the highest. When the threshold is reached, it is determined as a match and the instruction can be executed.
  • the situation where it is usually necessary to judge a specific location is for a scenario where only members of a specific location have the execution authority to execute such an instruction, for example, the second type of instruction
  • the second type of instruction It is usually a vehicle manipulation instruction.
  • this type of instruction is usually set so that only members in a specific position, such as the driving seat, have the ability to control the vehicle through voice.
  • a common implementation method is to obtain the lip motion information of the user in the driving seat and match the second instruction information, and when the structure is matched, it is determined that the second type of instruction is issued by the user in the driving seat, so as to execute the second type of instruction. Because the user who is usually in the default driving position has the right to control the vehicle. Passengers in other positions can also be manually set to have voice control permissions, so the first position is still the position in other cars.
  • the acquiring the second type of instruction and the lip motion information of the vehicle member at the first position in the vehicle may specifically be: acquiring the target audio data in the vehicle cabin;
  • the target audio data includes a second type of instruction to acquire in-vehicle image data; and from the in-vehicle image data, the lip motion information of the occupant in the vehicle at the first position is extracted.
  • the second type of instruction and the lip motion information of the occupant in the first position are matched to obtain a matching result, specifically: according to the second type of instruction and all The matching degree of the lip motion information of the occupant in the first position and a preset threshold value determine the matching result of the second type of instruction and the lip motion information of the occupant in the first position.
  • the second type of instruction in the embodiment of the present invention may be a speech waveform sequence extracted from the audio data or text instruction information identified according to the audio data.
  • the lip movement information of the occupants in the vehicle is an image sequence of lip movements of the occupants in the target time period.
  • the embodiment of the present invention further includes, when the audio data includes the second type of instruction, acquiring image data of the members of the car at other N positions in the car; Extracting the lip motion information of the occupants in the other N positions in the car within the target time period of the image data of the occupants in the car;
  • the lip motion information of the member in the car is matched to obtain a matching result, which is specifically: matching the second type instruction with the lip movement information of the member in the car at the first position in the car and the N positions in the car
  • the lip motion information of the members is matched to obtain the matching degree between the lip motion information of the N+1 members in the car and the second type of instructions respectively, and the lip motion information with the highest matching degree is obtained;
  • the matching result determines that the second type of instruction and the lip motion information of the vehicle member at the first position in the vehicle are a match, and sends instruction information instructing to execute the second type of instruction, specifically: when all the If the lip motion information with the highest matching degree is the lip motion information of the vehicle
  • the lip motion information of the member in the in-vehicle position is extracted from the video data of the member in the in-vehicle position, and the specific method is to identify a plurality of face regions in the video data based on a face recognition algorithm , extract the lip motion video in each of the multiple face regions; determine the lip motion information corresponding to each face based on the lip motion video in each of the face regions.
  • the audio data in the cabin is obtained according to data collected by multiple microphones in the cabin, or the audio data in the cabin is collected according to microphones in a designated location area in the cabin Audio data is obtained.
  • an embodiment of the present invention provides a voice command control device, including a processor; the processor is configured to: obtain a first type of command and the members of the car at N positions in the vehicle cabin within a target time period lip movement information, the first type of instruction is obtained according to the target audio data collected in the cabin, and the lip movement information of the member in the car is when the first type of lip movement information is identified from the target audio data.
  • the target time period is the time period corresponding to the first type of instruction in the audio data; the first type of instruction and the lips of the members of the car at N positions in the cabin are combined match the lip motion information of the vehicle occupants at the N positions, and obtain the target position according to the matching result between the lip motion information of the occupant in the vehicle at the N positions and the first type of instruction, and the target position indicates the lip of the matching result.
  • the position of the member in the vehicle whose motion information matches the first type of instruction; and the instruction information indicating that the first type of instruction is executed for the target position is sent.
  • the embodiment of the present invention can be applied to refined voice command control in the vehicle cabin.
  • the voice command is an instruction that needs to be further determined to be executed, such as an operation command of a device in the vehicle cabin, such as playing a video, a loudspeaker Adjustment, air conditioning adjustment, seat conditions, etc. need to determine the specific location of the instruction to carry out targeted local operation instructions, and determine which one is by obtaining the lip motion information of the members at various positions in the car and the identified specific instruction information.
  • the instructions issued by the members of the location so that the operation control can be carried out for a specific location area in a targeted manner.
  • the above-mentioned voice control device needs to process video data and analyze the degree of matching between its local features and instruction information, so the above-mentioned device can be a local device, such as an intelligent vehicle-mounted device, or a vehicle-mounted processor chip, or it can include a microphone.
  • the cloud server can obtain the data of the in-vehicle camera and the in-vehicle speaker to perform the above-mentioned video data processing and matching actions.
  • the processor is configured to acquire the target audio data in the vehicle cabin; when it is identified that the target audio data includes a first type of instruction, acquire image data in the vehicle cabin;
  • the lip motion information of the vehicle occupants at N positions in the vehicle cabin is extracted from the image data in the vehicle cabin.
  • the processor extracts the lip motion information of the members at N positions in the vehicle from the image data in the vehicle cabin.
  • the processor may identify the lip motion information in the video data based on a face recognition algorithm. For the N face regions, extract the lip motion video or sample video frame sequence in each face region in the N face regions; determine based on the lip motion video or video frame sequence in each of the face regions Lip motion information of the N members.
  • the image data is usually acquired by one or more cameras in the vehicle, and the types of the cameras can be various.
  • the processor is further configured to extract, from the in-vehicle image data, the lips of the occupants at N positions in the cabin when the identified occupants are greater than 1 sports information.
  • the processor in order to avoid the waste of computing resources, after identifying the first instruction type, the processor will determine the number of people in the vehicle cabin according to the acquired image data inside the vehicle, and when there is only one person in the vehicle cabin When there is no need to extract lip motion information, only when there are more than one person, the lip motion information of the person at the position where the person is seated is extracted.
  • the processor matches the first type of instruction with the lip motion information of the vehicle occupants at N positions in the vehicle cabin, according to the N positions Obtaining the target position from the matching result between the lip motion information of the occupant in the vehicle and the first type of instruction, specifically: the processor is based on the first type of instruction and the N vehicle occupants located in the cabin.
  • lip motion information obtain the matching degree between the lip motion information of the occupants in each of the N positions and the command information, N is an integer greater than 1; the lip with the highest matching degree is The position of the in-car member corresponding to the motion information is taken as the target position.
  • the acquisition of the matching degree by the processor can be obtained through different target matching models, and the target feature matching model is to train the user's lip motion information and one or more voice information (the voice information can be a sequence of voice waveforms) It can also be the text information corresponding to the voice) as the input, and the matching degree between the lip motion information of the training user and the M voice information is M labels, the feature matching model obtained by training, the model's Training is usually carried out on the cloud side and is independent of the use of the model, that is, the model is trained on one device, and after the training is completed, it is sent to the device that needs to match the instructions to run and use.
  • the target feature matching model is to train the user's lip motion information and one or more voice information (the voice information can be a sequence of voice waveforms) It can also be the text information corresponding to the voice) as the input, and the matching degree between the lip motion information of the training user and the M voice information is M labels
  • the feature matching model obtained by training, the model's Training is usually carried
  • the reasoning method of the model will also be different according to the different sample data for training.
  • multiple sets of lip motion information such as 5 lip motion information
  • the The 5 lip motion information and each of the M pieces of speech information are constructed into M groups of samples as input, the 5 lip motion information and the matching degree of the speech information in the sample are output respectively, and the label information is the target output of training yes result.
  • the model thus trained also uses 5 lip motion information and target instruction information as input during inference.
  • the lip information of the vacant position can be used as the input with the default value, such as All 0 sequence, the matching degree between the output and the instruction information.
  • the processor is further configured to generate a correspondence between the lip motion information of the occupants in the N positions and the N positions; the processor will determine the highest matching degree
  • the position of the member in the vehicle corresponding to the lip motion information of the N is used as the target position, including: the processor obtains the target lip motion information with the highest matching degree;
  • the corresponding relationship between the lip motion information and the N positions determines the position corresponding to the target lip motion information as the target position.
  • the processor can identify the member relationship at each position through the images collected in the car, and then correspond the lip motion information and the position relationship of each member, so
  • the image data can be the same image from the lip information extraction or an independent acquisition process.
  • the processor generates a correspondence between the lip motion information of the occupants in the N positions and the identities of the occupants in the N positions; the processor Taking the position of the member in the vehicle corresponding to the lip motion information with the highest degree of matching as the target position, comprising: acquiring, by the processor, the target lip motion information with the highest degree of matching; The correspondence between the lip motion information of the in-car member and the identities of the in-car members at the N positions determines the target in-car member; the position information of the target in-car member is determined as the target position, and the target in-car member The location information is determined based on sensor data in the car.
  • the first type of instruction mentioned in the embodiment of the present invention may be an in-cabin manipulation instruction, which is mainly applicable to an instruction interaction scenario in which position information needs to be identified in the vehicle cabin to determine the target area for which the instruction is executed. Because sometimes the user in the car will give clear location information when executing the voice command, such as closing the window of the right rear seat, in this case, the processor can directly identify the target location for the command, so in the In a possible implementation manner, when the processor recognizes that the acquired instruction is this type of instruction with clear location information, the processor will determine that the instruction does not belong to the first type of instruction, and the first type of instruction may be a location area that needs to be distinguished. to execute an instruction that does not contain location area information in the instruction.
  • an embodiment of the present invention provides a voice command control device, including a processor; the processor is configured to: acquire a first type of command and lip motion information of a vehicle member at a first position in the vehicle, so that the The first type of instruction is obtained according to the target audio data collected in the vehicle cabin, and the lip motion information of the occupant of the vehicle at the first position is obtained when the second type of instruction is identified from the target audio data.
  • the target time period is the time period corresponding to the second type of instruction in the audio data; the second type of instruction is matched with the lip motion information of the occupant of the vehicle at the first position to obtain Matching result; when it is determined according to the matching result that the second type of instruction matches the lip motion information of the occupant of the vehicle at the first position, sending instruction information indicating that the second type of instruction is executed.
  • the embodiment of the present invention can be applied to the refined voice command control in the vehicle cabin.
  • the voice command is recognized as the command that needs to be determined after the identity or authority of the command issuer is determined, such as the driving control command of the vehicle, or other related commands.
  • Commands to operate private information such as switching the driving mode, controlling the driving direction, viewing historical call data, etc., usually based on the special environment of the vehicle, we often assume that the member in the driving seat has the highest authority, so when the vehicle obtains the relevant information, it needs to be confirmed
  • issuing the instruction of the identity of the person by obtaining the lip motion information of the member at a specific position (such as the driving seat) and the identified specific instruction information to determine whether it is the instruction issued by the member of the driving position, so as to be able to target specific Location area for operational control.
  • the above-mentioned voice control device can be a local device, such as a vehicle-mounted smart device, a vehicle-mounted smart chip, a vehicle-mounted system including a camera and a microphone, or a smart vehicle because it involves the processing of video data and the analysis of the degree of matching between its local features and the instruction information. It can also be a cloud processor on the cloud side. In addition to the collection of lip motion information of members in the driving position, the processor can also perform targeted collection according to system presets or artificially set members in one or more positions. Obtain the lip motion information of the person at the corresponding position for matching analysis.
  • the first position is a driving position.
  • the second type of instruction is a vehicle driving manipulation instruction.
  • the device of the embodiment of the present invention can be used in a vehicle scene with any number of people, and is especially suitable for a scene where there are multiple members in the vehicle and multiple people are talking at the same time, to identify and execute instructions.
  • the processor uses the lip Combining the matching of the movement information with the instruction information to obtain the location distribution information in the vehicle can accurately determine whether it is a voice instruction issued by a member at a specific location, and then determine whether to execute the instruction.
  • the matching method may be to obtain multiple members in the car (including members in the first position) for matching, and determine whether the members in the first position have the highest matching degree, or only obtain members in the first position for matching, when the matching degree is the highest. When the threshold is reached, it is determined as a match and the instruction can be executed.
  • the situation where the processor needs to judge a specific location is a scenario in which only members of a specific location have the execution authority to execute such instructions, for example, the second type
  • the command is usually a command for vehicle manipulation.
  • this kind of command is usually set so that only members in a specific position, such as the driving seat, have the ability to control the driving of the vehicle by voice.
  • the processor obtains the lip motion information of the user in the driving seat and matches the second instruction information, and when the structure is matched, it is determined that the second type of instruction is issued by the user in the driving seat, so as to execute the second type of instruction. Because the user who is usually the default driving position has the right to control the vehicle. Passengers in other positions can also be manually set to have voice control permissions, so the first position is still the position in other cars.
  • the processor acquires the second type of instruction and the lip motion information of the occupant located in the first position in the vehicle, specifically: the processor acquires the target audio in the vehicle cabin data; when it is identified that the target audio data includes a second type of instruction, obtain image data in the vehicle cabin; and extract the lip motion information of the vehicle member at the first position from the image data in the vehicle cabin.
  • the processor matches the second type of instruction with the lip motion information of the occupant of the vehicle at the first position, and obtains a matching result, specifically: the processor according to the second type of instruction.
  • the matching degree with the lip motion information of the occupant in the first position and a preset threshold value determine the matching result of the second type of instruction and the lip motion information of the occupant in the first position.
  • the second type of instruction in the embodiment of the present invention may be a speech waveform sequence extracted from the audio data or text instruction information identified according to the audio data.
  • the lip movement information of the occupants in the vehicle is an image sequence of lip movements of the occupants in the target time period.
  • the processor is further configured to, when the audio data includes the second type of instruction, acquire image data of the vehicle members at other N positions in the vehicle; Extracting the lip motion information of the in-car members in the other N positions in the vehicle within the target time period of the image data of the in-car members in the other N positions; the processor combines the second type instruction with the The lip motion information of the occupant in the first position in the car is matched to obtain a matching result. Specifically, the processor matches the second type instruction with the lip of the occupant in the first position in the car.
  • the motion information is matched with the lip motion information of the members in the car at the N positions, and the matching degree between the lip motion information of the N+1 members in the car and the second type of instructions is obtained, and the highest matching degree is obtained.
  • lip motion information when the processor determines according to the matching result that the second type of instruction and the lip motion information of the vehicle member at the first position in the vehicle are a match, send an instruction to execute the first
  • the instruction information of the second type of instruction is specifically: when the lip motion information with the highest matching degree is the lip motion information of the vehicle member located at the first position in the vehicle, the processor sends an instruction to execute the Instructions for the second type of instruction.
  • the lip motion information of the member in the in-vehicle position is extracted from the video data of the member in the in-vehicle position, and the specific method is to identify a plurality of face regions in the video data based on a face recognition algorithm , extract the lip motion video in each of the multiple face regions; determine the lip motion information corresponding to each face based on the lip motion video in each of the face regions.
  • an embodiment of the present invention provides a chip system, the chip system includes at least one processor, a memory, and an interface circuit, and the memory, the interface circuit, and the at least one processor are interconnected through a line, so Instructions are stored in the at least one memory; the instructions are executed by the processor to implement the method of any one of the first and second aspects.
  • an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable medium is used to store program codes, and the program codes include a method for executing any one of the first and second aspects .
  • an embodiment of the present invention provides a computer program, where the computer program includes instructions, and when the computer program is executed, is used to implement the method of any one of the first and second aspects.
  • FIG. 1 is a schematic diagram of a scene of multi-person interaction in a vehicle according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a scene of multi-person interaction in a vehicle according to an embodiment of the present invention.
  • FIG. 3 provides a system architecture 100 for an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a convolutional neural network according to an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of a training method for a neural network provided by an embodiment of the present invention.
  • FIG. 6 is an example diagram of a sound waveform provided by an embodiment of the present invention.
  • FIG. 7A is a voice instruction matching method provided by an embodiment of the present invention.
  • FIG. 7B is a voice instruction matching method provided by an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a cloud interaction scenario according to an embodiment of the present invention.
  • FIG. 9 is a flowchart of a method according to an embodiment of the present invention.
  • FIG. 10 is a flowchart of a method according to an embodiment of the present invention.
  • FIG. 11 is a flowchart of a method according to an embodiment of the present invention.
  • FIG. 12 is a flowchart of a method according to an embodiment of the present invention.
  • FIG. 13 is a schematic structural diagram of an instruction control apparatus provided by an embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of an apparatus for training a neural network according to an embodiment of the present invention.
  • FIG. 15 is another instruction control system provided by an embodiment of the present invention.
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device may be components.
  • One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between 2 or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.
  • data packets eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals
  • Bitmap Also known as raster graphics or bitmap, it is an image represented by a pixel array (Pixel-array/Dot-matrix). According to the bit depth, the bitmap can be divided into 1, 4, 8, 16, 24 and 32 bit images. The more bits of information each pixel uses, the more colors are available, the more realistic the color representation, and the larger the corresponding amount of data. For example, a pixel bitmap with a bit depth of 1 has only two possible values (black and white), so it is also called a binary bitmap. An image with a bit depth of 8 has 28 (ie 256) possible values. A grayscale mode image with a bit depth of 8 has 256 possible gray values. An RGB image consists of three color channels.
  • a bitmap represented by 24-bit RGB combined data bits is usually called a true-color bitmap.
  • ASR Automatic Speech Recognition
  • automatic speech recognition also known as automatic speech recognition, whose goal is to convert the lexical content in human speech into computer-readable input, such as keystrokes, binary codes, or character sequences.
  • Voiceprint is a sound wave spectrum that carries speech information displayed by electroacoustic instruments, and is a biological feature composed of more than 100 characteristic dimensions such as wavelength, frequency and intensity.
  • Voiceprint recognition is a technology to identify unknown voices by analyzing the characteristics of one or more voice signals. The identity of the speaker can be determined through the voiceprint, so as to make targeted answers.
  • Mel Frequency Cepstrum Coefficient (Mel-Frequency Cepstrum) is a pair of nonlinear Mel scales based on sound frequencies. Linear transformation of the digital energy spectrum. Mel Frequency Cepstral Coefficients (MFCCs) are widely used in speech recognition functions.
  • Multi-way cross-Entropy Loss describes the distance between two probability distributions. The smaller the cross-entropy, the closer the two are.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN Deep neural network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as It should be noted that the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional Neural Network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the convolutional neural network during the training process, so that the reconstruction error loss of the convolutional neural network becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate error loss, and updating the parameters in the convolutional neural network by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the optimal parameters of the convolutional neural network, such as the weight matrix.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • Voiceprint is a sound wave spectrum that carries speech information displayed by electroacoustic instruments. It is a biological feature composed of more than 100 characteristic dimensions such as wavelength, frequency and intensity. Voiceprint recognition is a technology to identify unknown voices by analyzing the characteristics of one or more voice signals. The identity of the speaker can be determined through the voiceprint, so as to make targeted answers. It is mainly divided into two phases: registration phase and verification phase. Among them, the registration stage: establish a corresponding voiceprint model according to the voiceprint features of the speaker's voice; the verification stage: receive the speaker's voice, extract its voiceprint features and match with the registered voiceprint model, if the matching is successful, then Prove that it is the original registered speaker.
  • Sound source localization technology is a technology that uses acoustic and electronic devices to receive target sound field information to determine the location of the target sound source.
  • the sound source localization of the microphone array refers to picking up the sound source signal with the microphone array, and by analyzing and processing the multi-channel sound signals, one or more sound source planes or spatial coordinates are determined in the spatial domain, that is, the position of the sound source is obtained. . Further directs the beams of the microphone array to the speaker.
  • voiceprint recognition For the application of voiceprint recognition, it is first necessary to store the voiceprint information of the passengers in advance. If it is a person who has not performed voiceprint recognition and recording, it cannot be recognized. At the same time, the voice of the same person is volatile, and there are many people in the car. In the human environment, it is not easy to extract the voiceprint features when many people speak at the same time, or when the ambient noise is very loud, it will also interfere with the recognition.
  • the technical problems to be solved by this application include the following aspects: when there are multiple users in the vehicle cabin, when a specific type of instruction is collected, how to accurately determine the specific location of the voice sender, and the corresponding instruction of the targeted instruction instruction.
  • the voice matching method provided by the embodiment of the present application can be applied to the human-computer interaction scenario of the intelligent vehicle.
  • the following exemplarily enumerates the human-computer interaction scenarios to which the voice command control method in this application is applied, which may include the following two scenarios.
  • multiple speakers in the car which are distributed in different positions in the cabin.
  • Multiple speakers can provide different volume of music for passengers in different areas of the car according to the needs of passengers and drivers.
  • passenger A wants to rest. A quiet environment is needed, so you can choose to adjust the volume of the speakers in his area to the lowest level, and passenger B needs to listen to music normally, you can set the speakers in his area to a normal size;
  • Different audio playback content can be provided for users in different areas. For example, children in the back row can choose to play fairy tales for children in the back row, while the driver and co-pilot in the front row want to listen to pop music, they can Pop music is played on the speakers in the front row area.
  • the embodiment of the present invention can provide a method for the members in the car cabin to control the speaker in the area where the voice commander is located. For example, as shown in Figure 1, when there are 4 people A, B, C, and D in the car , sit in the driver's seat, the co-pilot seat, the left rear row, and the right rear row, respectively. At this time, member D said: “Turn down the volume”. At this time, it can be obtained through the camera and microphone in the cabin as shown in Figure 7A.
  • Audio command information and video information of multiple members in the car through the lip movement information of the members in the car and the feature matching of the command information, determine the member who speaks, and control the location of the member based on the command and position issued by the member who speaks position, and if the speaking member is identified as a passenger in the rear right, the speakers in the rear right area are turned down.
  • the audio command information and video information of the members in the car can also be obtained through the camera and microphone in the car cabin respectively. Analyze the lip movement information and voice information of members A, B, C, and D in the car, determine that the speaking member is member C, and control the speaker at the position of the speaking member based on the instructions and location of the speaking member C , if it is recognized that the speaking member C is the passenger in the rear left, control the speaker in the rear left to play the song "****".
  • air-conditioning outlets in the car which are distributed in different positions in the cabin. Multiple air-conditioning outlets can provide different air-conditioning outlets for passengers in different areas of the car according to the needs of passengers and drivers.
  • the size of the air volume can realize the differential adjustment of the temperature in the local area. For example, the temperature of passenger A is lower, so you can choose to increase the air volume in his area, but passenger B feels cold, you can use the command to adjust the air outlet in his area.
  • the direction of the air outlet is adjusted so that it does not blow directly to people or the air volume is reduced.
  • passengers in different areas will also adjust various parameters of the seats according to their own needs.
  • the audio command information and video information of the members in the car can be obtained respectively through the camera and the microphone in the car cabin.
  • the member's lip movement information and voice information determine the speaking member, and based on the instructions and position of the speaking member, control the air outlet direction or air volume of the air conditioner outlet where the speaking member is located, or control the seat The back angle of the chair or the height of the seat.
  • the control of voice commands in the cabin requires that the control of some in-vehicle facilities needs to distinguish the target area for specific command implementation, so it is necessary to identify the voice command issued by the member in which position.
  • the driver wants to control the driving of the vehicle, he can also choose to control the driving of the vehicle by voice.
  • the embodiment of the present invention can provide a voice command authority recognition method for the driving control of the vehicle. For example, when there are many people in the vehicle, if a voice command related to the driving control of the vehicle is received, for example, "Switch In this case, the vehicle needs to obtain the lip motion information of the driver's seat member as shown in Figure 2, and as shown in Figure 7B The acquired lip motion information of the driver's seat member and the voice command information are feature-matched, and the matching degree with the command information is obtained, thereby judging whether it is a voice command issued by the driver's seat member, so as to determine whether to execute the command.
  • a voice command related to the driving control of the vehicle for example, "Switch In this case, the vehicle needs to obtain the lip motion information of the driver's seat member as shown in Figure 2, and as shown in Figure 7B
  • the acquired lip motion information of the driver's seat member and the voice command information are feature-matched, and the matching degree with the command information is obtained, thereby judging whether it is a voice command issued
  • the specific judgment is whether it is a voice command issued by a member of the driver's seat, or it can be obtained by obtaining the lip motion information of multiple members in the car, and analyzing the matching degree between it and the command information to see whether it is the lip of the member in the driver's position.
  • the matching degree of the partial motion information is the highest, and then it is judged whether to execute the instruction.
  • the application scenarios in the vehicle cabin of FIG. 1 and FIG. 2 are only several exemplary implementations in the embodiments of the present invention, and the embodiments of the present invention may be implemented in various and flexible implementation manners.
  • scenarios First it is not necessary to obtain the lip movement information of all the members in the car. It may be possible to obtain the lip movement information of only some members according to the specific instruction type. command to obtain only the lip movement information of the front row members.
  • scenario 2 it is not necessary to obtain the lip motion information of the driver's seat member.
  • the vehicle defaults to the owner having the operating authority of the identified command, the location of the owner is obtained, and the lip motion information of the vehicle owner is extracted to determine whether it is the owner of the vehicle. issued instructions.
  • the model can be obtained by model training, and the corresponding matching degree can be output by inputting the lip motion information and instruction information.
  • the method provided by this application is described in the following:
  • Any neural network training method provided in this application involves the fusion processing of computer hearing and vision, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning.
  • the lip motion information and M voice information carry out symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain a trained target feature matching model;
  • the voice matching method can use the above-mentioned trained target feature matching model to input input data (such as the voice information to be recognized in this application and the lip motion information of N users) into the trained target feature matching model. , to obtain output data (such as the degree of matching between the lip motion information of the N users in this application and the speech information to be recognized respectively).
  • a neural network training method and a speech matching method provided in the embodiments of this application are inventions based on the same concept, and can also be understood as two parts in a system, or as part of an overall process. Two stages: such as model training stage and model application stage.
  • FIG. 3 is a system architecture 100 provided by an embodiment of the present invention.
  • the data collection device 160 is used to collect training data, and in this application, the data collection device 160 may include a microphone and a camera.
  • the training data (that is, the input data on the model training side) may include: video sample data and voice sample data, that is, the lip motion information and M pieces of voice information of the training user in the embodiment of the present invention, respectively, wherein , the M pieces of voice information may include voice information matched with the lip motion information of the training user.
  • the video sample data is a sequence of lip motion images when a training user utters the speech: "The weather is very good today, where are we going to play?", while the voice sample data contains the above-mentioned training user uttering "The weather is very special today.” OK, where shall we play?” (as speech positive samples) and (M-1) other speech waveform sequences (as speech negative samples).
  • the above-mentioned video sample data and audio sample data may be collected by the data collection device 160 or downloaded from the cloud.
  • FIG. 3 is only an exemplary architecture, which is not limited thereto.
  • the data collection device 160 stores the training data in the database 130, and the training device 120 trains the target feature matching model/rule 101 based on the training data maintained in the database 130 (the target feature matching model 101 here is an embodiment of the present invention)
  • the target feature matching model in , for example, is a model trained in the above training phase, which can be used for feature matching between speech and lip motion trajectories. Neural network model).
  • the target feature matching model/rule 101 can be used to implement any speech matching method provided by the embodiment of the present invention, that is, the data will be
  • the audio data and image data acquired by the acquisition device 160 are input into the target feature matching model/rule 101 after relevant preprocessing, so as to obtain the matching between the image sequence features of the lip movements of multiple users and the speech features to be recognized, respectively.
  • Degree/Confidence The target feature matching model/rule 101 in the embodiment of the present invention may specifically be a spatiotemporal convolutional network (STCNN).
  • STCNN spatiotemporal convolutional network
  • the spatiotemporal convolutional network may be obtained by training a convolutional neural network.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target feature matching model/rule 101 entirely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a Limitations to the embodiments of the present invention.
  • a target feature matching model/rule 101 is obtained by training according to the training device 120, and the target feature matching model/rule 101 may be referred to as an audio-visual cross convolutional neural network (V&A Cross CNN)/ Spatiotemporal Convolutional Neural Networks.
  • the target feature matching model provided in this embodiment of the present invention may include: a first model, a second model, and a third model, where the first model is used to extract speech features, and the second model is used for multiple users (this The application is the extraction of image sequence features of lip movements of N users), and the third model is used for the calculation of the matching degree/confidence between the above-mentioned speech features and the image sequence features of N users.
  • the first model, the second model, and the third model may all be convolutional neural networks, that is, it can be understood that the target feature matching model/rule 101 itself can It is regarded as a whole spatiotemporal convolutional neural network, and the spatiotemporal convolutional neural network includes multiple independent networks, such as the above-mentioned first model, second model and third model.
  • the embodiments of the present invention may also be implemented through other model training and execution schemes.
  • the training data (that is, the input data on the model training side) in the embodiment of the present invention may include: video sample data and voice sample data, which are respectively in the embodiment of the present invention.
  • the training user's lip motion information and M pieces of voice information wherein the lip motion information includes lip motion information corresponding to various voice command sentences of different users, and the voice information includes voice command sentences issued by different users.
  • some negative samples may also be included, that is, lip motion information corresponding to sentences that are not voice commands, and voice information that is not voice commands.
  • the voice command here means that the on-board system can recognize and respond to the corresponding voice information, which can be a keyword or a complete sentence.
  • the above-mentioned video sample data and audio sample data may be collected by the data collection device 160, downloaded from the cloud, or provided by a third-party data holder. Further, the data collection device 160 stores the training data in the database 130, and the training device 120 trains the target feature matching model/rule 101 based on the training data maintained in the database 130 (the target feature matching model 101 here is an embodiment of the present invention)
  • the target feature matching model in , for example, is a model trained in the above training phase, which can be used for feature matching between speech and lip motion trajectories. Neural network model).
  • the target feature matching model/rule 101 can be used to implement any speech matching method provided by the embodiment of the present invention, that is, the data will be
  • the audio data and image data acquired by the acquisition device 160 are input into the target feature matching model/rule 101 after relevant preprocessing, so as to obtain the matching between the image sequence features of the lip movements of multiple users and the speech features to be recognized, respectively. Degree/Confidence.
  • the target feature matching model/rule 101 in the embodiment of the present invention may specifically be a convolutional network (CNN), in the embodiments provided in this application.
  • CNN convolutional network
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target feature matching model/rule 101 entirely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a Limitations to the embodiments of the present invention.
  • the target feature matching model/rule 101 is obtained by training according to the training device 120 .
  • the target feature matching model provided in the embodiment of the present invention may include: a first model and a second model, wherein the first model is used to match the voice command and identify the command information corresponding to the voice command, specifically the command identifier , or the text feature of the command, the second model is used to identify the correspondence between the voice commands corresponding to each lip motion information based on the image sequence features of the N users, for example, the identifier of the corresponding command that can be matched and its matching Finally, according to the instruction identifier corresponding to the voice instruction and the voice identifier corresponding to each user's lip motion information and their matching degree, the target user who issued the voice instruction is output.
  • the first model and the second model may be CNN, RNN, DBN, DNN, and the like.
  • the training of the first model takes the voice command as a label corresponding to the input voice command (the expression form of the logo may be coding) as a label.
  • the second model takes the user's lip motion information as the input (the lip motion information may specifically be a lip motion image sequence feature, such as the opening and closing amplitude of the lips according to time sampling as a vector sequence), and the instruction identifier corresponding to the lip motion information and the matching degree is the output, where the instruction identifier can be the code corresponding to the instruction, and the matching degree can be the output matching value, and the judgment of whether it matches is made according to the matching value.
  • the target feature matching model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 4 , the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet Computers, laptops, Augmented Reality (AR)/Virtual Reality (VR), smart wearable devices, smart robots, in-vehicle terminals, smart cockpit environments, etc., and can also be servers or the cloud.
  • the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet Computers, laptops, Augmented Reality (AR)/Virtual Reality (VR), smart wearable devices, smart robots, in-vehicle terminals, smart cockpit environments, etc., and can also be servers or the cloud.
  • AR Augmented Reality
  • VR Virtual Reality
  • the execution device 110 is configured with an I/O interface 112 for data interaction with external devices, and the user can use the client device 140 (the client device in this application may also include data acquisition devices such as microphones and cameras) Input data to the I/O interface 112, and the input data (that is, the input data on the model application side) may include, in this embodiment of the present invention: the voice information to be recognized and the lip motion information of N users, that is, the In the embodiment of the invention, the speech waveform sequence in the target time period and the lip motion information of each user in the lip motion information of the N users include an image sequence of the corresponding user's lip motion in the target time period.
  • the input data here may be input by a user or provided by a related database, which varies according to different application scenarios, which is not specifically limited in this embodiment of the present invention.
  • the client device 140 may be on the same device as the execution device 110 , and the data collection device 160 , the database 130 , and the training device 120 may also be on the same device as the execution device 110 and the client device 140 .
  • the robot extracts the audio data and image data collected by the client device 140 (including a microphone, a camera and a processor) to obtain the voice information to be recognized and the lips of N users.
  • the execution device 110 inside the robot can further perform feature matching between the above-mentioned extracted speech information and the lip movement information, and finally output the result to the client device 140 for analysis by the processor in the client device 140.
  • the equipment on the model training side can be inside the robot or in the cloud.
  • the robot has a model that can realize model training or model update optimization.
  • the robot has both the function of the model training side and the function of the model application side; when it is in the cloud, it can be considered that the robot side only has the function of the model application side.
  • the client device 140 and the execution device 110 may not be on the same device, that is, the collection of audio data and image data, and the extraction of the voice information to be recognized and the lip motion information of the N users can be performed by the client device 140 (for example, Smartphones, intelligent robots, etc.), and the process of feature matching between the speech information to be recognized and the lip motion information of N users can be performed by the execution device 110 (eg, cloud server, server, etc.).
  • the acquisition of audio data and image data is performed by the client device 140, and the voice information to be recognized and the lip motion information of the N users, and the voice information to be recognized and the lip motion of the N users are extracted
  • the process of feature matching between the information is all completed by the execution device 110 .
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data acquisition terminal (such as a microphone, a camera) to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as new sample data, and store them in the data. Database 130.
  • a data acquisition terminal such as a microphone, a camera
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample
  • the data is stored in database 130 .
  • the preprocessing module 113 is configured to perform preprocessing according to the input data (such as the voice data) received by the I/O interface 112.
  • the preprocessing module 113 may be used to preprocess the voice data, such as Extract the voice information to be recognized from the voice data.
  • the preprocessing module 114 is configured to perform preprocessing according to the input data received by the I/O interface 112, such as (the image data), in this embodiment of the present invention, the preprocessing module 114 may be used to preprocess the image data, For example, the lip motion information of the N users corresponding to the voice information to be recognized is extracted from the image data.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 . Finally, the I/O interface 112 will output a result, such as the matching degree between the lip motion information of the N users and the voice information to be recognized in the embodiment of the present invention, or the target user ID of the highest matching degree among them. Returning to the client device 140, the client device 140 determines the user information of the target user according to the above matching degree, and generates a control instruction matching the user information based on the user information.
  • the training device 120 can generate corresponding target feature matching models/rules 101 based on different training data for different targets or tasks, and the corresponding target feature matching models/rules 101 can be used to implement The above goals or accomplish the above tasks to provide the user with the desired result.
  • FIG. 4 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the convolutional neural network CNN is a deep neural network with a convolutional structure.
  • a network is a deep learning architecture, which refers to multiple levels of learning at different levels of abstraction through machine learning algorithms.
  • a CNN is a feed-forward artificial neural network in which each neuron responds to overlapping regions in images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 , and a neural network layer 230, where the pooling layer is optional.
  • the convolutional/pooling layer 120 may include layers 221-226 as examples.
  • layer 221 is a convolutional layer
  • layer 222 is a pooling layer
  • layer 223 is a convolutional layer
  • layer 224 is a convolutional layer.
  • layer is a pooling layer
  • 225 is a convolutional layer
  • 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, 224 and 225 are convolutional layers, and 226 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels..., depending on the value of stride), to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification...
  • the dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can extract information from the input image, thereby helping the convolutional neural network 200 to make correct predictions.
  • the initial convolutional layer eg 221
  • the features extracted by the later convolutional layers eg, 226) become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • a pooling layer after the convolutional layer can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of the pixel values in the image within a certain range.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to utilize the neural network layer 230 to generate one or a set of desired class outputs. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 4) and the output layer 240, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 240 After the multi-layer hidden layers in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (the propagation from 210 to 240 in FIG. 4 is the forward propagation) is completed, the back propagation (the propagation from 240 to 210 in FIG. 4 is the back propagation) starts to update
  • the weight values and biases of the aforementioned layers are used to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 4 is only used as an example of a convolutional neural network. In a specific application, the convolutional neural network may also exist in the form of other network models.
  • the convolutional layers/pooling layers are in parallel, and the extracted features are input to the full neural network layer 230 for processing.
  • the normalization layer in this application can in principle be performed after any layer of the above CNN, or before any layer, and the feature matrix output by the previous layer is used as input, and its output can also be as input to any functional layer in the CNN.
  • the normalization layer is generally performed after the convolution layer, and the feature matrix output by the previous convolution layer is used as the input matrix.
  • FIG. 5 is a schematic flowchart of a training method of a neural network provided by an embodiment of the present invention.
  • the method can be applied to the application scenarios and system architectures described in the above-mentioned FIGS. 1 and 2, and can be specifically applied to In the training device 120 of FIG. 3 described above.
  • the following description will be given in conjunction with FIG. 5 , taking the execution subject as the training device 120 in FIG. 3 or a device including the training device 120 as an example.
  • the method may include the following steps S701-S702.
  • S701 Acquire training samples, where the training samples include lip motion information of the training user and M pieces of instruction information.
  • the M pieces of instruction information include the above-mentioned "air conditioning temperature"
  • the waveform sequence or text information of the instruction information of "turn up a little” is used as the instruction sample, and other instruction information, such as "turn down the seat back angle a little", “open the window”, “turn off the music” and other voice information as the instruction sample. negative sample.
  • the M pieces of instruction information include instruction information that matches the lip motion information of the training user and (M-1) instruction information that does not match the lip motion information of the training user.
  • the above-mentioned lip motion information is the image sequence of the continuous lip motion corresponding to the instruction information of user A: "Turn up the temperature of the air conditioner a little" (that is, the video of the mouth shape), and the above-mentioned M pieces of instruction information include: The speech waveform sequence of the above-mentioned positive speech samples, and the speech waveform sequence of M-1 negative samples. It can be understood that the above-mentioned M pieces of instruction information may also include a plurality of positive samples and negative samples, that is, the number of positive samples and negative samples is not specifically limited, as long as both are included.
  • S702 Taking the lip motion information of the training user and the M pieces of voice information as training inputs, and taking the matching degrees between the lip motion information of the training user and the M pieces of voice information as M labels , train the initialized neural network to obtain the target feature matching model.
  • the initialized neural network model can be trained to obtain the target feature matching model required in this application, and the target feature matching model can be used to match the instruction information to be identified with multiple data.
  • the matching relationship between the lip motion information of each user is used to implement any one of the speech matching methods in this application.
  • the lip motion information of the training user and the M pieces of instruction information are used as training inputs, and the lip motion information of the training user and the M pieces of instruction information are used respectively.
  • the matching degree between them is M labels
  • the initialized neural network is trained to obtain a target feature matching model, including: inputting the lip motion information of the training user and the M instruction information into the initialized neural network.
  • the degree of matching between the M pieces of instruction information and the lip motion information of the training user is calculated; the M pieces of instruction information obtained by calculation are respectively matched with the lip motion information of the training user.
  • the matching degree between the two is compared with the M labels, and the initialized neural network is trained to obtain a target feature matching model.
  • the target feature matching model includes a first model, a second model and a third model; the lip motion information of the training user and the M pieces of instruction information are input into the In the initialized neural network, calculating the degree of matching between the M pieces of instruction information and the lip motion information of the training user, comprising: inputting the M pieces of instruction information into the first model, Obtain M voice features, each voice feature in the M voice features is a K-dimensional voice feature, and K is an integer greater than 0; input the lip motion information of the training user into the second model , obtain the image sequence feature of the training user, and the image sequence feature of the training user is a K-dimensional image sequence feature; Input the M voice features and the image sequence feature of the training user into the third model, calculate The degree of matching between the M speech features and the image sequence features of the training user is obtained.
  • the lip motion information of a training user as well as the matching instruction information and a plurality of unmatched instruction information are used as the input of the initialized neural network, and based on the above-mentioned M instruction information and the training
  • the actual matching degree of the user's lip motion information is used as the label
  • the target feature matching model obtained by training the above-mentioned initial neural network model for example, the matching degree corresponding to the complete match is the label 1, and the matching degree corresponding to the mismatch is the label. is 0, when the matching degree between the training user's lip motion information calculated by the initialized neural network after training and the M instruction information is closer to the M labels, then the initialized neural network after training The closer to the target feature the matching model is.
  • FIG. 9 is a schematic flowchart of another voice command control method provided by an embodiment of the present invention, which is mainly applicable to the scene of voice interaction control of in-vehicle equipment in the car by members of the car.
  • the in-vehicle device receives a voice command for the control of the in-vehicle device
  • this solution can be used to accurately Identify the voice command from the member at which location.
  • the method can be applied to application scenarios and system architectures in the vehicle cabin, and can be specifically applied to the client device 140 and the execution device 110 in the above-mentioned FIG. inside the car.
  • the following description will be given by taking the execution subject as an intelligent vehicle as an example in conjunction with FIG. 9 .
  • the method may include the following steps S1601-S1605
  • Step S1601 Acquire in-vehicle audio data.
  • the in-vehicle audio data collected by the in-vehicle microphone is acquired.
  • the audio data includes ambient sounds in the car, such as music from speakers, noise from engine air conditioners, etc., ambient sounds such as sounds outside the car, and voice commands issued by the user.
  • step S1601 can be specifically:
  • S1601a Acquire audio data collected by multiple microphones in the vehicle.
  • the microphone array will collect audio data, or the in-vehicle microphone array will be in the real-time audio data collection state after the vehicle is started, or after the vehicle members such as the owner perform specific operations, such as turning on After the audio capture function, the microphone array enters the audio capture state.
  • the way in which the microphone array collects audio data is that multiple microphones respectively collect audio data at different positions in the vehicle cabin.
  • S1601b Acquire target audio data based on audio data collected by multiple microphones.
  • the in-vehicle microphone array is usually set up with multiple microphones at different positions in the vehicle, so the acquisition of audio data in the in-vehicle environment can have multiple audio sources to choose from.
  • the data collected by the front passenger seat will be The audio data will be louder due to the loudspeaker in the front passenger seat, and the instruction information of the rear passengers will be relatively small, while the rear speakers will relatively collect a relatively clear voice signal accompanied by a small musical sound.
  • the audio data collected by each microphone is usually preprocessed, and then the target audio data is selected through analysis and comparison. For example, because environmental noise, music sound, and the frequency band where the voice command is located are different, the preprocessing may be to filter the audio data collected by multiple microphones, and select the audio signal with the strongest voice signal after filtering as the target audio. Signal.
  • the audio signal may be the original audio signal collected by the microphone, or may be the pre-processed audio signal.
  • Step S1602 When it is recognized that the audio data includes the first type of instruction information, acquire image data.
  • the semantic recognition of the audio information can be performed based on the RNN model, and then the instruction content can be performed based on the recognized text information.
  • Recognition determine the type of instruction according to the content of the instruction, or directly determine the type of instruction according to the feature information in the text information, such as keywords, specifically based on voice.
  • instruction recognition There are many solutions for instruction recognition in the prior art, which are different here an enumeration.
  • the audio data used for model input may be audio data that has been preprocessed by performing environmental noise filtering on the collected audio data, or may be directly input based on the collected audio data. Other voice recognition methods in the prior art may also be used to determine whether instruction information is included.
  • the first type of command information in this embodiment refers to the command information that the in-vehicle device can receive and recognize, and needs to respond to the location area by judging the location of the command initiator, that is, usually the interior of the vehicle cabin.
  • control instructions such as air conditioning adjustment instructions in the cabin, sound adjustment, and audio content selection adjustment related instructions.
  • the instruction information may be a sequence of speech waveforms in the audio data corresponding to the instruction time period, or a text feature sequence of text information within the target time period extracted from the audio data.
  • the speech information to be recognized will be mentioned, which is also the speech waveform sequence in the corresponding time period issued by the speech command. Therefore, when the command information is mentioned in Figure 9, when it is in the form of a speech waveform sequence, it is also a sequence of speech waveforms.
  • the acquisition of in-vehicle audio and image data is the same as the audio acquisition by the microphone. It can be automatically started to perform real-time acquisition after the vehicle is started, or the real-time acquisition function can be enabled according to the user's instructions, or the default audio acquisition starts at the same time. data collection. There are usually multiple cameras on the vehicle, and there are also different types of cameras, such as monocular cameras, binocular cameras, TOF cameras, infrared cameras, etc. In this solution, the number of cameras that collect in-vehicle image data is not limited. The deployment location and number and the type of cameras can be selected and deployed by those skilled in the art according to the needs of specific solution implementation.
  • the microphone in step S1601 may be an independently set microphone, or may be a microphone integrated in the camera.
  • the image data may be the image data obtained by the camera while the vehicle-mounted processing system obtains the voice data through the microphone. That is, the above-mentioned audio data and image data are the original audio data and image data in a certain period of time, that is, the audio data source and the image data source. Optionally, the audio data and the image data are collected in the same time period under the same scene.
  • audio data can be collected through microphones set at various positions in the car, and image data in the cabin can be collected through the global camera in the car. It is also possible to collect audio data through a designated microphone, and collect image data in the vehicle cabin through cameras at multiple locations in the vehicle.
  • Step S1603 Extract the lip motion information of the members at N positions in the vehicle from the image data.
  • the position distribution of the members in the vehicle is determined, and the lip motion information of the members at each position is extracted, and the lip motion information carries the corresponding position identification.
  • the lip motion information of each member in the lip motion information of the plurality of members in the vehicle includes the corresponding image sequence of the user's lip motion in the corresponding target time period, and the target time period is the corresponding instruction information in the audio. time period. That is, the lip video of each member extracted from the original image data, that is, the image sequence of continuous lip motion, contains the continuous mouth shape change features of the corresponding member.
  • the format of each frame of image in the image data collected by the camera is a 24-bit BMP bitmap, wherein the BMP image file (Bitmap-File) format is the image file storage format adopted by Windows, and the 24-bit image is used 3 bytes save the color value, each byte represents a color, arranged in red (R), green (R), blue (B), and converts the RGB color image into a grayscale image.
  • the smart device obtains at least one face area based on the face recognition algorithm from the image data collected by the above-mentioned camera, and further assigns a face ID (and a robot or a smart speaker) to each face area with each face area as a unit.
  • 9 consecutive image frames form
  • Step S1604 Input the instruction information and the lip motion information of the N members into a target feature matching model to obtain the degree of matching between the lip motion information of the N members and the instruction information, respectively.
  • the instruction information and the image sequence of the lip movement of each of the N members in the target time period are used as the input of the audio feature and the input of the video feature, respectively, into the target feature matching model, and the instruction information is calculated respectively.
  • the matching degree may specifically be a value greater than or equal to 0 and less than or equal to 1.
  • the instruction information here may be as shown in FIG. 6 , which is an example diagram of a sound waveform provided by an embodiment of the present invention, or the form of identification information of the instruction, such as the form of a serial number, or the form of an instruction sentence, etc. .
  • FIG. 6 is an example diagram of a sound waveform provided by an embodiment of the present invention, or the form of identification information of the instruction, such as the form of a serial number, or the form of an instruction sentence, etc. .
  • the voice information is the above-mentioned voice information to be recognized.
  • the smart device extracts audio features from the audio data obtained by the microphone array in S801.
  • the specific method can use the Mel-frequency cepstral coefficients to extract speech features, and use the Mel-frequency cepstral coefficients (MFCC) for data with a frame length of 20ms.
  • Step S1605 Determine which location area to execute the instruction corresponding to the instruction information according to the matching degree.
  • the corresponding determination strategy in S1605 may be to determine the in-vehicle position of the member corresponding to the lip motion information of the member with the highest matching degree as the target area for executing the instruction information , do as described.
  • the operation of lowering the temperature or air volume of the air outlet is performed only for the target area.
  • S1604-S1605 can also be:
  • S1604 Input the instruction information and the lip motion information of one of the members into a target feature matching model to obtain the matching degree between the lip motion information of the member and the instruction information.
  • Step S1605 When the matching degree is greater than the known threshold value, execute the instruction corresponding to the instruction information in the location area of the member.
  • the matching degree is less than the known threshold, continue to judge the matching degree between the lip motion information of the member at another position in the car and the instruction information according to certain rules, until the matching degree is greater than the known threshold If the lip motion information of the value is reached, or all the members in the car are matched, the matching process is ended.
  • FIG. 10 is a schematic flowchart of another voice matching method provided by an embodiment of the present invention. Only the driver has the authority to control the vehicle by voice operation. In order to avoid misoperation and misrecognition, when the in-vehicle device receives the voice command for vehicle driving control, it needs to judge whether it is the voice command issued by the driver, and then based on the recognition As a result, it is judged whether to execute the vehicle running command.
  • the method may include the following steps S1701-S1705.
  • Step S1701 Acquire in-vehicle audio data.
  • S1701 The specific implementation of S1701 is the same as that of S1601.
  • Step S1702 When it is recognized that the audio data includes the second type of instruction information, acquire image data.
  • the second type of instruction information in S1702 only refers to instruction information related to the form control of the vehicle, such as vehicle turning, acceleration, starting, switching of driving modes, and the like.
  • the identification is this type of instruction information, the image data of the driver's seat member needs to be acquired.
  • Step S1703 Extract the lip motion information of the first position member from the image data. See S1603 for how to lift the lip motion information and how to identify the lip motion information.
  • Step S1704 Input the instruction information and the lip motion information of the first position member into the target feature matching model to obtain the matching degree between the lip motion information of the driver's seat member and the instruction information respectively.
  • Step S1705 Determine whether to execute the instruction corresponding to the instruction information according to the matching degree.
  • S1705 has a variety of judgment methods. Since the matching degree is generally in the form of a numerical value, S1705 may determine whether to execute the instruction information according to whether the matching degree is higher than a preset threshold value. That is, it may be that, when the matching degree is greater than a preset threshold value, it is considered that the instruction is issued by the first location member, and the vehicle form control instruction is executed. Otherwise, the instruction is not executed.
  • S1705 may also determine whether the matching degree of the lip information of the member at the first position is the highest matching degree among all the matching degrees of the lip motion information and the instruction information of all the members in the vehicle. If this is the case, in S1703, in addition to extracting the lip motion information of the member at the first position, the lip motion information of other members in the vehicle should also be lifted. Also in S1704, in addition to inputting the instruction information and the lip motion information of the member at the first position into the target feature matching model, the lip information of other members is also input into the target feature matching model to obtain the corresponding match.
  • the first position in the above embodiment is usually the driver's seat.
  • the initial setting of the in-vehicle control system may be that the driver's seat member has the authority to control the driving operation of the vehicle by voice, or it may be based on the user's
  • the manual setting is changed based on the specific location distribution of each ride, such as setting the driver's seat and the co-driver's seat to have the vehicle driving control authority.
  • the first position is the driver's seat and the passenger seat.
  • the vehicle when the embodiment of the present invention is specifically implemented, when the vehicle is initialized, the vehicle will be prompted on the vehicle according to the requirements of the vehicle, or the vehicle owner will actively set and enter the image information and permission information of the family members who will use the vehicle. At this time, the present invention is implemented.
  • the in-vehicle camera can obtain the location information of the registered members with the driving control authority before the vehicle starts or after the vehicle starts, and then when the vehicle control-related command is recognized, based on the control authority The lip movement information at the position of the member of the member is determined whether it is a voice command issued by the member at the position.
  • the embodiment of the present invention can also be applied to judging whether other types of commands are executable. For example, for a call function, it can be set manually or by default to only the owner of the vehicle. Alternatively, the driver may perform voice control, and the above embodiments are only specific examples, and do not limit specific instruction types or specific fixed positions.
  • FIG. 8 is an architectural diagram of a voice command control system provided by an embodiment of the present invention.
  • the intelligent vehicle 800 is included as a collection device for audio data and image data, and further, it can also be used as an extraction device for the instruction information to be recognized and the lip information of N users;
  • the matching between the information and the lip information of the N users can be performed on the server/service device/service device/cloud service device 801 where the execution device 110 is located.
  • the above-mentioned extraction of the instruction information to be identified and the lip information of the N users may also be performed on the device side where the execution device 110 is located, which is not specifically limited in this embodiment of the present invention.
  • the following description will be given by taking the cloud service device 801 in FIG. 8 as an example. As shown in FIG. 11, the method may include the following steps S1001-S1003.
  • Step S1001 Acquire command information and lip motion information of N vehicle members located in the cabin;
  • the instruction information is obtained according to the audio data collected in the cabin
  • the lip motion information of the members in the vehicle is obtained when it is judged that the instruction corresponding to the instruction information is the first type instruction
  • the lip motion information includes the An image sequence of lip movements of a vehicle occupant at a first position in the vehicle within a target time period, the target time period being the time period corresponding to the instruction in the audio data.
  • Step S1002 Input the instruction information and the lip motion information of the N vehicle members located in the cabin into the target feature matching model, and obtain the lip motion information of the vehicle members at the N positions respectively. the degree of matching with the instruction information;
  • Step S1003 Use the position of the member corresponding to the lip motion information of the user with the highest matching degree as the target position for executing the instruction corresponding to the instruction information.
  • Step S1021 acquiring the instruction information and the lip motion information of the vehicle member located at the first position in the vehicle;
  • the instruction information in the above steps is obtained according to the audio data collected in the vehicle cabin, and the lip motion information of the vehicle member at the first position in the vehicle is obtained when it is recognized that the instruction corresponding to the instruction information is a second type instruction,
  • the lip movement information includes an image sequence of lip movements of the vehicle occupant at the first position in the vehicle within a target time period, where the target time period is a time period corresponding to the instruction in the audio data.
  • Step S1022 Input the instruction information and the lip motion information of the vehicle member at the first position in the vehicle into the target feature matching model to obtain the lips of the vehicle member at the first position in the vehicle a first degree of matching between the motion information and the instruction information;
  • Step S1023 Determine whether to execute the instruction corresponding to the instruction information according to the first matching degree.
  • the target feature matching model includes a first model, a second model and a third model
  • the described voice information to be recognized and the lip motion information of the N users are input into the target feature matching model to obtain the difference between the lip motion information of the N users and the voice information to be recognized, respectively. match, including:
  • voice features are K-dimensional voice features, and K is an integer greater than 0;
  • the voice feature and the N image sequence features are input into the third model to obtain the matching degrees between the N image sequence features and the voice features respectively.
  • the target feature matching model is to take the lip motion information of the training user and the M pieces of instruction information as input, and the lip motion information of the training user and the M pieces of voice information respectively.
  • the matching degree between is M labels, and the feature matching model obtained by training.
  • the method further includes:
  • the user information includes one or more of person attribute information, facial expression information corresponding to the voice information to be recognized, and environmental information corresponding to the voice information to be recognized ;
  • the method further includes: extracting lip motion information of N users from image data; further, extracting lip motion information of N users from the image data, include:
  • the lip motion information of the N users is determined based on the lip motion video in each of the face regions.
  • the method further includes: extracting the voice information to be recognized from the audio data; further, the extracting the voice information to be recognized from the audio data includes:
  • the audio data of different spectrums in the audio data is identified, and the audio data of the target spectrum is identified as the voice information to be identified.
  • FIG. 13 is a schematic structural diagram of a smart device provided by an embodiment of the present invention
  • FIG. 13 is a schematic functional principle diagram of a smart device provided by an embodiment of the present invention
  • the smart device can be an in-vehicle device, an in-vehicle system, or a smart vehicle.
  • the smart device 40 may include a processor 401, and a microphone 402 and a camera 403 coupled to the processor 401.
  • the microphones 402 and cameras 403 are usually multiple , as corresponding to the application scenario of Figure 12, wherein,
  • a camera 403 used to collect image data, the audio data and the image data are collected for the same scene;
  • the processor 401 acquires the audio data in the vehicle cabin, and when it is recognized that the audio data in the vehicle cabin includes a first type of instruction, acquires image data in the vehicle cabin; The lip motion information of the occupants in the N positions; for inputting the instruction information corresponding to the first type of instruction and the lip motion information of the occupants in the N positions in the cabin into the target feature
  • the degree of matching between the lip motion information of the members in the vehicle at the N positions and the instruction information is obtained; the position of the member corresponding to the lip motion information of the user with the highest degree of matching is obtained. as the target position for executing the instruction corresponding to the instruction information.
  • the microphone 402 is used to collect audio data
  • a camera 403 used to collect image data, the audio data and the image data are collected for the same scene;
  • the processor 401 acquires the audio data in the vehicle cabin, and when it is recognized that the audio data in the vehicle cabin includes a second type of instruction, acquires the image data in the vehicle cabin, and obtains the first type of instruction from the image data in the vehicle cabin.
  • an image data the first image data is image data including an occupant located in a first position in the vehicle, and the lips of the occupant located in the first position in the vehicle are extracted from the first image data motion information; used to input the instruction information corresponding to the second type instruction and the lip motion information of the vehicle member located at the first position in the vehicle into the target feature matching model, and obtain the first type of instruction information located in the vehicle.
  • the first matching degree between the lip motion information of the occupant in the vehicle and the instruction information, and whether to execute the instruction corresponding to the instruction information is determined according to the first matching degree.
  • the speech information to be recognized includes a sequence of speech waveforms in a target time period; the lip movement information of each user in the lip movement information of the N users includes the corresponding user A sequence of images of lip motion during the target time period.
  • the processor 401 is specifically configured to: input the voice information to be recognized and the lip motion information of the N users into a target feature matching model to obtain the N users The matching degree between the lip motion information of the user and the voice information to be recognized respectively; the user corresponding to the lip motion information of the user with the highest matching degree is determined as the target user to which the voice information to be recognized belongs.
  • the target feature matching model includes a first model, a second model and a third model; the processor 401 is specifically configured to: input the voice information to be recognized into the first model In the model, the voice features are obtained, the voice features are K-dimensional voice features, and K is an integer greater than 0; the lip motion information of the N users is input into the second model, and N image sequence features are obtained.
  • each image sequence feature in the N image sequence features is a K-dimensional image sequence feature; the voice feature and the N image sequence features are input into the third model to obtain the N image sequence features The matching degree between the feature and the speech feature respectively.
  • the target feature matching model is to take the lip motion information of the training user and the M pieces of voice information as input, and the lip motion information of the training user and the M pieces of voice information respectively.
  • the matching degree between M tags is a feature matching model obtained by training, wherein the M pieces of voice information include voice information matched with the lip motion information of the training user.
  • the processor 401 is further configured to: determine user information of the target user, where the user information includes character attribute information, facial expression information corresponding to the voice information to be recognized, and One or more kinds of environmental information corresponding to the voice information to be recognized; based on the user information, a control instruction matching the user information is generated.
  • the processor 401 is specifically configured to: identify N face regions in the image data based on a face recognition algorithm, and extract the lips in each of the N face regions and determining the lip motion information of the N users based on the lip motion video in each face region.
  • the processor 401 is specifically configured to: based on a spectrum identification algorithm, identify audio data of different spectrums in the audio data, and identify the audio data of the target spectrum as the voice to be identified information.
  • FIG. 14 is a schematic structural diagram of a neural network training apparatus provided by an embodiment of the present invention
  • FIG. 14 is a schematic functional principle diagram of a smart device provided by an embodiment of the present invention.
  • the model trained by the neural network training device can be used for in-vehicle equipment, in-vehicle systems, intelligent vehicles, cloud servers, and the like.
  • the training device 60 of the neural network may include an acquisition unit 601 and a training unit 602; wherein,
  • the acquiring unit 601 is configured to acquire training samples, where the training samples include lip motion information of the training user and M pieces of instruction information; optionally, the M pieces of instruction information include the lip motion information related to the training user. matching instruction information and (M-1) instruction information that does not match the lip motion information of the training user;
  • a training unit 602 configured to use the lip motion information of the training user and the M pieces of instruction information as training inputs, and use the matching degree between the lip motion information of the training user and the M pieces of instruction information respectively For M labels, the initialized neural network is trained to obtain the target feature matching model.
  • the lip motion information of the training user includes a sequence of lip motion images of the training user
  • the M pieces of instruction information include a sequence of lip motion images matching the lip motion image sequence of the training user.
  • the training unit 602 is specifically used for:
  • the matching degrees between the calculated M pieces of instruction information and the lip motion information of the training user are compared with the M labels, and the initialized neural network is trained to obtain a target feature matching model.
  • the target feature matching model includes a first model, a second model and a third model; the training unit 602 is specifically used for:
  • each voice feature in the M voice features is a K-dimensional voice feature, and K is an integer greater than 0;
  • the M voice features and the image sequence features of the training user are input into the third model, and the matching degree between the M voice features and the image sequence features of the training user is obtained by calculation;
  • the matching degrees between the calculated M speech features and the image sequence features of the training user are compared with the M labels, and the initialized neural network is trained to obtain a target feature matching model.
  • FIG. 15 is a system structure diagram provided by an embodiment of the present invention, including a schematic structural diagram of a smart device 70 and a server device 80 , and the smart device may be a smart vehicle.
  • the smart device 70 may include a processor 701, and a microphone 702 and a camera 703 coupled to the processor 701; wherein,
  • a camera 703 used for collecting image data
  • the speech information to be recognized includes a speech waveform sequence within a target time period
  • the lip motion information of N users is extracted from the image data, and the lip motion information of each user in the lip motion information of the N users includes the lip motion of the corresponding user within the target time period
  • the image sequence of , N is an integer greater than 1;
  • the processor 701 When applied to an intelligent vehicle or an in-vehicle voice interaction system, the processor 701 is configured to acquire audio data, and when the audio data includes a target instruction, acquire image data in the vehicle cabin; extract the image data from the vehicle cabin The lip motion information of the occupants at N positions in the cabin.
  • the lip motion information of the members in the vehicle can be acquired and sent to the service device, or the collected in-vehicle image information can be sent to the service device, and the lip motion information can be extracted by the service device.
  • the processor 701 is configured to acquire audio data in the vehicle cabin; when it is recognized that the audio data includes a second type of instruction, acquire first image data, where the first image data includes a vehicle located at a first position in the vehicle image data of the occupant; and extracting the lip motion information of the occupant located at the first position in the car from the first image data.
  • FIG. 15 provides a schematic structural diagram of a service device, and the service device may be a server, a cloud server, or the like.
  • the service device 80 may include a processor; optionally, the processor may be composed of a neural network processor and a processor 802 coupled to the neural network processor, or directly composed of a processor; wherein,
  • the neural network processor 801 is used for:
  • the matching degree between the lip motion information and the instruction information respectively; the position of the member corresponding to the lip motion information of the user with the highest matching degree is taken as the target position for executing the instruction corresponding to the instruction information.
  • the target feature matching model includes a first model, a second model and a third model; the processor 802 is specifically configured to: input the to-be-recognized voice information or instruction information into the In the first model, voice features are obtained, and the voice features are K-dimensional voice features, and K is an integer greater than 0; input the lip motion information of the N users into the second model, and obtain N Image sequence features, each image sequence feature in the N image sequence features is a K-dimensional image sequence feature; the voice feature and the N image sequence features are input into the third model to obtain the N The matching degree between each image sequence feature and the speech feature respectively.
  • the target feature matching model is to take the lip motion information of the training user and the M pieces of voice information as input, and the lip motion information of the training user and the M pieces of voice information respectively.
  • the matching degree between is M labels, and the feature matching model obtained by training.
  • the server further includes a processor 802; the processor 802 is configured to: determine user information of the target user, where the user information includes personal attribute information and the voice information to be recognized corresponding to one or more of facial expression information and environmental information corresponding to the to-be-recognized voice information; and based on the user information, a control instruction matching the user information is generated.
  • the server further includes a processor 802; the processor 802 is further configured to: identify N face regions in the image data based on a face recognition algorithm, and extract the N face regions in the N face regions. Lip motion video in each face region; lip motion information of the N users is determined based on the lip motion video in each face region.
  • the server further includes a processor 802; the processor 802 is further configured to: identify audio data of different spectrums in the audio data based on a spectrum identification algorithm, and identify the audio data of the target spectrum The data is recognized as the voice information to be recognized.
  • An embodiment of the present invention further provides a computer storage medium, wherein the computer storage medium may store a program, and when the program is executed, the program includes part or all of any of the steps described in the foregoing method embodiments.
  • the embodiments of the present invention also provide a computer program, the computer program includes instructions, when the computer program is executed by the computer, the computer can perform some or all of the steps of any one of the above method embodiments.
  • the disclosed apparatus may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc., specifically a processor in the computer device) to execute all or part of the steps of the foregoing methods in the various embodiments of the present application.
  • a computer device which may be a personal computer, a server, or a network device, etc., specifically a processor in the computer device
  • the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, Read-Only Memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.
  • a medium that can store program code may include: U disk, mobile hard disk, magnetic disk, optical disk, Read-Only Memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

本发明实施例公开了车舱内语音控制方法,具体可以应用于智能座舱领域,其中,所述语音控制方法包括: 获取车舱内的音频数据,当识别所述音频数据中包括目标类型的指令信息,获取对于与所述音频数据中指令信息相关事件段的车舱内图像数据,基于所述图像数据获取车舱内特定位置上的车内成员的图像数据,并从中提取该位置上车内成员的唇部运动信息,所述指令信息和所述唇部运动信息输入到目标特征匹配模型中,得到所述特定位置的车内成员的唇部运动信息与所述指令信息之间的匹配度; 根据所述匹配度确定是否执行所述指令信息对应的指令。

Description

一种车舱内语音指令控制方法及相关设备
本申请要求于2020年07月03日提交中国专利局、申请号为202010631879.5、申请名称为“一种车舱内语音指令控制方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及人机交互技术领域,尤其涉及一种车舱内语音指令控制方法及相关设备。
背景技术
随着无人驾驶技术的发展,车辆的智能性越来越高,车舱内进行智能的语音控制也成为了当前智能座舱的一个主流需求,越来越多的车辆开始具备语音交互功能,能够基于车内成员的语音指令来执行相应的功能。
由于车舱内同时也是一个多人的空间,不同的车内成员可能存在不同的操控需求,因此在执行语音指令时,对于有些特殊指令的执行车内智能设备需要获取指令发出者所在的具体位置,从而确定如何指令相应的指令。在车舱内的多人场景中,车辆有些指令需要针对车舱的具体位置进行操作,如调整某个具体出风口的风量,或者调整某个具体扬声器的音量,这个时候当接收到一个调小出风口风量的语音指令时,车辆智能设备如果不能识别出当前指令发出者所在的具体位置往往无法基于语音进行车舱内的精细化控制调整;有些车辆可以通过语音进行车辆的行驶控制,如进行自动驾驶,如自动泊车,但是当车内存在多个成员时,如车内存在儿童,此时如何判断何时可以执行驾驶相关的语音指令,何时不能执行,确认发出指令者的身份和权限,防止车辆的误操控也是自动驾驶领域车舱内人机交互控制需要解决的问题之一。
发明内容
本发明实施例提供一种语音指令控制方法及相关设备,以提升车内多人场景下的语音指令的执行准确度,降低误操作或者误识别的情况。
第一方面,本发明实施例提供了一种车舱内语音控制方法,包括:获取第一类型指令和位于车舱内N个位置上的车内成员在目标时间段内的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述车内成员的唇部运动信息为当从所述目标音频数据中识别出所述第一类型指令时获取,所述目标时间段为所述第一类型指令在所述音频数据中对应的时间段;将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,所述目标位置为所述匹配结果指示唇部运动信息与所述第一类型指令匹配的车内成员所处的位置;发送指示针对目标位置执行所述第一类指令的指示信息。
本发明实施例,可应用于车舱内的精细化的语音指令控制,当识别到语音指令为需要进一步进行位置确定后才能执行的指令,如车舱内装置的操作指令,如播放视频,扬声器调节,空调调节,座椅条件等需要确定指令具体针对的位置进行针对性进行局部操作的指令,通过获取车内各个位置上成员的唇部运动信息和所识别出的具体指令信息来判断是哪个位置的成员发出的指令,从而能够有针对性的针对特定位置区域进行操作控制。上述语音控制方法因为涉及视频数据的处理和分析其局部特征与指令信息的匹配的程度,因此可以发生在本地,即车辆或者车辆上的智能设备来执行上述方法,也可以是在云端执行上述视频数据的处理和匹配动作。
本发明实施例的方法可适用于任意人数的车辆场景,尤其适用于当车内有多个成员,且同时有多人在说话的场景,此时通过唇部运动信息与指令信息的匹配结合车内位置分布信息能够精准的定位发出语音指令的成员的位置信息,进而判断出指令执行所针对的位置区域。本发明实施例所涉及的方法中提到的N不一定代表车内的全部成员,可以是车内的全部成员,也可以是部分成员。
在一种可能的实现方式中,所述获取第一类型指令和位于车舱内N个位置上的车内成员的唇部运动信息,具体为:获取车舱内所述目标音频数据;当识别出所述目标音频数据中包括第一类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。本发明的实施例中提到的从车舱内的图像数据中提取所述车内N个位置上成员的唇部运动信息,具体可以是基于人脸识别算法,识别所述视频数据中的所述N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频或者抽样提取视频帧序列;基于所述每个人脸区域中的唇部运动视频或者视频帧序列确定所述N个成员的唇部运动信息。所述图像数据通常是车内的一个或者多个摄像头采集获得,所述摄像头的类型可以有多种。
同时由于车舱内的环境通常会存在多个摄像头,有的摄像头可以通过变换拍摄角度获取不同视角的视屏数据,因此本发明实施例中提到的图像数据可以是一个摄像头从一个角度获取的图像数据,而由于从一些角度去拍摄时,车内不同位置上的成员之间可能存在遮挡的情况,因此本发明实施例中的图像数据也可以是来自同一摄像头的不同视角的图像数据,或者是来自多个摄像头的图像数据,以及其他上述组合情况。
在一种可能的实现方式中,所述从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息,具体为:当识别车内成员大于1时,从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。本发明实施例在具体实现的过程中,为了避免计算资源的浪费,可以在识别第一指令类型后,根据获取的车内图像数据判断车舱内的人数,当车舱内仅有一人时,不用提取唇部运动信息,仅在人数多于1人时对有人员乘坐的位置上的人员的唇部运动信息的提取。
在一种可能的实现方式中,所述将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,具体为:根据所述第一类型指令和所述位于车舱内N个车内成员的唇部运动信息,得到每个所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度,N为大于1的整数;将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置。
其中匹配度的获取有多种方式,通常可以通过目标匹配模型进行获取,所述目标特征匹配模型为以训练用户的唇部运动信息以及一个或者多个语音信息(所述语音信息可以是语音波形序列也可以是语音所对应的文本信息)为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
本发明实施例中的训练模型的方式,根据训练的样本数据的不同,模型的推理方式也会有区别,当采用多组唇部运动信息进行输入时,如5个唇部运动信息,则将5个唇部运动信息分别和M个语音信息中的每一个,构建成M组样本作为输入,5个唇部运动信息分别和样本内语音信息匹配度为输出,标签信息为训练是的目标输出结果。由此训练出的模型在进行推理时也是以5个唇部运动信息和目标指令信息为输入,当车内人员不足5人时,可以将空置的位置的唇部信息用默认值作为输入,如全0序列,输出与指令信息之间的匹配度。也可以以单个唇部运动信息和语音信息为一组样本,匹配标签为目标训练结果,进行训练,这样训练得到的模型在进行推理时,需要以每个车内成员的唇部运动指令和指令信息作为输入,分别获取多个位置成员的匹配度。
在一种可能的实现方式中,所述第一类型指令为从所述音频数据中提取的语音波形序列或者根据所述音频数据识别出的文本指令信息。
在一种可能的实现方式中,所述车舱内N个位置上的车内成员的唇部运动信息为所述车舱内N个位置上的车内成员在所述目标时间段内的唇部运动的图像序列。
在本发明的实施例通过获取的语音指令和各个位置车内成员的唇部信息的匹配度来获知指令相关的语音是哪个位置上的成员发出的,指令信息和成员的唇部运动信息之间的匹配度通过匹配模型获知,取决于匹配模型的不同,需要获取的指令信息也不相同。根据模型训练的需要可以针对唇部运动的视频提取不同的唇部运动信息,例如可以是在所述目标时间段内的唇部运动的图像序列或者是表示上下唇距离的时序变化情况的向量参数。所述第一类型指令同样也可以有多种形式,可以是语音波形序列,也可以是指令对应的文本信息。
在一种可能的实现方式中,所述车舱内音频数据为根据车舱内指定位置区域的麦克风所采集的音频数据获得。
在一种可能的实现方式中,所述车舱内音频数据基于从车舱内多个麦克风采集的音频数据中选择的目标音频数据获得。
车舱内通常会存在多个麦克风,设置于车舱内的不同位置,因此车舱内音频数据的采集,为了获取最佳音频效果,本发明实施例中提到的车舱内的音频数据可以是多个音频数据进行综合处理之后获得的;也可以是综合比较车舱内多个麦克风采集的音频数据之后,根据预设规则选择出的参数最优的,即收录到的语音质量最优的音频数据;或者是指令位置的麦克风所采集的音频数据,如是设置在车内中央位置区域的麦克风所采集的音频数据。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述将所述指令信息和所述车舱内N个位置上的车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述车舱内N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度,包括:将所述指令信息输入到所述第一模型中,得 到语音特征,所述语音特征为K维语音特征,K为大于0的整数;将所述车舱内N个位置上的车内成员的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型;所述将所述指令信息和所述车舱内N个位置上的车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述车舱内N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度,包括:将所述音频数据输入到所述第一模型中,得到对应的指令信息,将所述车舱内N个位置上的车内成员的唇部运动信息输入到所述第二模型中,所述每个车内成员的唇部运动信息对应一组图像序列特征,将所述N个图像序列特征同时或者分别输入到第二模型中,获得每个唇部运动信息对应的指令信息;基于两个模型的识别结果,判断出发出指令的目标位置成员。
在本发明实施例的模型的其中一种实现方式第一模型输出的指令信息可以是指令对应的标识及匹配度,模型选择输出匹配度最高的指令对应的指令标识,所述标识可以是指令编码,也可以是指令对应的文本特征。上述基于第一模型和第二模型的输出结果进行判断,其中判断规则可以有多种,如识别出与第一模型输出的指令信息相同的位置成员,若识别的有多个位置上的成员都与第一模型识别的指令信息相同,则比较第二模型输出的匹配度,选择匹配度高的目标位置执行所述指令,或者选择第二模型中识别的匹配度最高的指令信息和第一模型的指令信息进行比较,如果指令信息相同,则确定第二模型中国匹配度最高的指令信息对应的位置为目标位置。
在一种可能的实现方式中,生成所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系;所述将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:获取匹配度最高的目标唇部运动信息;根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系将所述目标唇部运动信息所对应的位置确定为目标位置。
所述生成N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系,可以是通过车内的图像采集来获取各个位置上的成员关系,然后将各个成员的唇部运动信息和位置关系对应起来,所述图像数据可以是来自于唇部信息提取的相同图像也可以是独立的采集过程。
在一种可能的实现方式中,生成所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系;将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:获取匹配度最高的目标唇部运动信息;根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系确定目标车内成员;将所述目标车内成员的位置信息确定为目标位置,所述目标车内成员的位置信息为根据车内的传感器数据确定。
本发明实施例中所提到的第一类型指令可以为车舱内操控指令,主要适用于对于车舱内需要识别位置信息以用于确定执行指令所针对的目标区域的指令交互场景。由于有时候车内用户在执行语音指令时会给出明确的位置信息,如关闭右后座的车窗, 那么在一种可能的实现方式中这一类有明确的位置信息的指令可以被认为不属于第一类型指令,第一类型指令可以是需要区分位置区域来执行但是指令中不包含位置区域信息的指令。
第二方面,本发明实施例还提供了一种车舱内语音控制方法,包括,获取第一类型指令和位于车内第一位置的车内成员的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述第一位置的车内成员的唇部运动信息为当当从所述目标音频数据中识别出所述第二类型指令时获取,所述目标时间段为所述第二类型指令在所述音频数据中对应的时间段;将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果;当根据所述匹配结果确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息。
本发明实施例,可应用于车舱内的精细化的语音指令控制,当识别到语音指令为需要确定指令发出者的身份或者权限后才能执行的指令,如车辆的行驶操控指令,或者其他涉及到隐私信息操作的指令,如切换驾驶模式,控制行驶方向,查看历史通话数据等,通常基于车辆的特殊环境,我们往往会默认驾驶位的成员具备最高权限,因此当车辆在获取到相关需要确认发出者身份的指令时,通过获取特定位置上(如驾驶位)成员的唇部运动信息和所识别出的具体指令信息来判断是否是驾驶位成员发出的指令,从而能够有针对性的针对特定位置区域进行操作控制。上述语音控制方法因为涉及视频数据的处理和分析其局部特征与指令信息的匹配的程度,因此可以发生在本地,即车辆或者车辆上的智能设备来执行上述方法,也可以是在云端执行上述视频数据的处理和匹配动作。除了驾驶位的成员的唇部运动信息的采集,也可以是预设或人为设定一个或多个位置上的成员具备某类指令的控制权限,当识别出此类指令时,则获取相应位置的人员的唇部运动信息进行匹配分析。
在一种可能的实现方式中,还可以是通过人脸识别技术识别当前车内环境下特定用户(如车主)所在的位置,将特定用户所在的位置作为第一位置。通常默认为获取驾驶位的成员的唇部运动信息。
本发明实施例的方法可适用于任一人数的车辆场景,尤其适用于当车内有多个成员,且同时有多人在说话的场景,此时通过唇部运动信息与指令信息的匹配结合车内位置分布信息能够精准的判断是否是特定位置上的成员发出的语音指令,进而确定是否执行所述指令。所述匹配方式可以是获取车内多个成员(包括第一位置成员)进行匹配,判段是否第一位置的成员匹配度最高,也可以是仅获取第一位置的成员进行匹配,当匹配度达到阈值时确定为匹配,指令可执行。
在涉及对第二类型指令进行判断指令的实施例中,通常需要对特定位置进行判断的情况是,针对仅有特定位置的成员才具备进行此类指令的执行权限的场景,例如第二类型指令通常是车辆操控类的指令,此类指令为了防止误操作通常会设置为只有特定位置如驾驶位的成员才具备通过语音进行车辆行驶操控的能力。通常的实现方式是获取驾驶位的用户的唇部运动信息和第二指令信息进行匹配,当结构为匹配时判断是驾驶位的用户发出的第二类型指令,从而执行第二类型指令。因为通常是默认驾驶位 的用户具有对车辆的操控权限。也可以人工设置其他位置的乘客具备语音操控权限,那么第一位置还是是其他车内的位置。
也可以根据需要设置某些指令也只有特定位置的成员才能进行,那么这类指令的执行规则也可以参照第二类型指令的执行方式进行判断是否执行。
在本发明实施例的中,所述获取第二类型指令和位于车内第一位置的车内成员的唇部运动信息,具体可以为:获取车舱内所述目标音频数据;当识别出所述目标音频数据中包括第二类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取第一位置的车内成员的唇部运动信息。
上述方案的实现过程中,所述将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为:根据所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配度和预设阈值确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配结果。
本发明实施例中的所述第二类型指令可以为从所述音频数据中提取的语音波形序列或者根据所述音频数据识别出的文本指令信息。车内成员的唇部运动信息为车内成员在所述目标时间段内的唇部运动的图像序列。
在实现过程中,本发明实施例还包括,当所述音频数据中包括所述第二类型指令,获取车内其他N个位置车内成员的图像数据;从所述车内其他N个位置车内成员的图像数据的所述目标时间段内提取所述车内其他N个位置车内成员的唇部运动信息;所述将所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为:将所述第二类型指令与所述位于车内第一位置的车内成员的唇部运动信息和所述N个位置车内成员的唇部运动信息进行匹配,得到N+1个车内成员的唇部运动信息分别与所述第二类型指令之间的匹配度,获取匹配度最高的唇部运动信息;所述当根据所述匹配结果确定所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息,具体为:当所述匹配度最高的唇部运动信息为所述位于车内第一位置的车内成员的唇部运动信息,则发送指示执行所述第二类指令的指示信息。
本发明实施例中从所述车内位置上的成员的视频数据中提取车内位置上成员的唇部运动信息,具体方法为基于人脸识别算法,识别所述视频数据中的多个人脸区域,提取多个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定每个人脸对应的唇部运动信息。
在一种可能的实现方式中,所述车舱内音频数据为根据车舱内多个麦克风采集的数据获得,或者所述车舱内音频数据为根据车舱内指定位置区域的麦克风所采集的音频数据获得。
第三方面,本发明实施例提供了一种语音指令控制设备,包括处理器;所述处理器用于:获取第一类型指令和位于车舱内N个位置上的车内成员在目标时间段内的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述车内成员的唇部运动信息为当从所述目标音频数据中识别出所述第一类型指令时获取,所述目标时间段为所述第一类型指令在所述音频数据中对应的时间段;将所述第一类型指 令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,所述目标位置为所述匹配结果指示唇部运动信息与所述第一类型指令匹配的车内成员所处的位置;发送指示针对目标位置执行所述第一类指令的指示信息。
本发明实施例,可应用于车舱内的精细化的语音指令控制,当识别到语音指令为需要进一步进行位置确定后才能执行的指令,如车舱内装置的操作指令,如播放视频,扬声器调节,空调调节,座椅条件等需要确定指令具体针对的位置进行针对性进行局部操作的指令,通过获取车内各个位置上成员的唇部运动信息和所识别出的具体指令信息来判断是哪个位置的成员发出的指令,从而能够有针对性的针对特定位置区域进行操作控制。上述语音控制设备需要进行视频数据的处理和分析其局部特征与指令信息的匹配的程度,因此上述设备可以是本地设备,如可以是智能车载设备,或者车载处理器芯片,也可以是包括了麦克风和摄像头的车载系统,或者是智能车辆,同时根据方案不同的实现方式也可以是在云端服务器,获取车载摄像头以及车内扬声器的数据后执行上述视频数据的处理和匹配动作。
在一种可能的实现方式中,所述处理器用于,获取车舱内所述目标音频数据;当识别出所述目标音频数据中包括第一类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。本发明的实施例中处理器从车舱内的图像数据中提取所述车内N个位置上成员的唇部运动信息,具体可以是处理器基于人脸识别算法,识别所述视频数据中的所述N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频或者抽样提取视频帧序列;基于所述每个人脸区域中的唇部运动视频或者视频帧序列确定所述N个成员的唇部运动信息。所述图像数据通常是车内的一个或者多个摄像头采集获得,所述摄像头的类型可以有多种。
在一种可能的实现方式中,所述处理器还用于,当识别车内成员大于1时,从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。本发明实施例在具体实现的过程中,为了避免计算资源的浪费,处理器在识别第一指令类型后,会根据获取的车内图像数据判断车舱内的人数,当车舱内仅有一人时,不用提取唇部运动信息,仅在人数多于1人时对有人员乘坐的位置上的人员的唇部运动信息的提取。
在一种可能的实现方式中,所述处理器将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,具体为:所述处理器根据所述第一类型指令和所述位于车舱内N个车内成员的唇部运动信息,得到每个所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度,N为大于1的整数;将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置。
其中处理器关于匹配度的获取可以通过不同的目标匹配模型来获取,所述目标特征匹配模型为以训练用户的唇部运动信息以及一个或者多个语音信息(所述语音信息可以是语音波形序列也可以是语音所对应的文本信息)为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型,模型的训练通常是在云侧进行,并且与模型的使用是独立开来的,即在一个 设备上进行模型的训练,训练完成之后发给需要进行指令匹配的设备来运行使用。
本发明实施例中的训练模型的方式,根据训练的样本数据的不同,模型的推理方式也会有区别,当采用多组唇部运动信息进行输入时,如5个唇部运动信息,则将5个唇部运动信息分别和M个语音信息中的每一个,构建成M组样本作为输入,5个唇部运动信息分别和样本内语音信息匹配度为输出,标签信息为训练是的目标输出结果。由此训练出的模型在进行推理时也是以5个唇部运动信息和目标指令信息为输入,当车内人员不足5人时,可以将空置的位置的唇部信息用默认值作为输入,如全0序列,输出与指令信息之间的匹配度。也可以以单个唇部运动信息和语音信息为一组样本,匹配标签为目标训练结果,进行训练,这样训练得到的模型在进行推理时,需要以每个车内成员的唇部运动指令和指令信息作为输入,分别获取多个位置成员的匹配度。
在一种可能的实现方式中,所述处理器还用于生成所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系;所述处理器将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:所述处理器获取匹配度最高的目标唇部运动信息;根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系将所述目标唇部运动信息所对应的位置确定为目标位置。
处理器对于车内成员的唇部运动信息与位置的对应关系,可以是通过车内采集的图像来识别各个位置上的成员关系,然后将各个成员的唇部运动信息和位置关系对应起来,所述图像数据可以是来自于唇部信息提取的相同图像也可以是独立的采集过程。
在一种可能的实现方式中,所述处理器生成所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系;所述处理器将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:所述处理器获取匹配度最高的目标唇部运动信息;根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系确定目标车内成员;将所述目标车内成员的位置信息确定为目标位置,所述目标车内成员的位置信息为根据车内的传感器数据确定。
本发明实施例中所提到的第一类型指令可以为车舱内操控指令,主要适用于对于车舱内需要识别位置信息以用于确定执行指令所针对的目标区域的指令交互场景。由于有时候车内用户在执行语音指令时会给出明确的位置信息,如关闭右后座的车窗,那么这种情况下处理器可以直接识别出所述指令所针对的目标位置,因此在一种可能的实现方式中当处理器识别出获取的指令为这一类有明确的位置信息的指令,处理器会判断此指令不属于第一类型指令,第一类型指令可以是需要区分位置区域来执行但是指令中不包含位置区域信息的指令。
第四方面,本发明实施例提供了一种语音指令控制设备,包括处理器;所述处理器用于:获取第一类型指令和位于车内第一位置的车内成员的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述第一位置的车内成员的唇部运动信息为当当从所述目标音频数据中识别出所述第二类型指令时获取,所述目标时间段为所述第二类型指令在所述音频数据中对应的时间段;将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果;当根据所述匹配结 果确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息。
本发明实施例,可应用于车舱内的精细化的语音指令控制,当识别到语音指令为需要确定指令发出者的身份或者权限后才能执行的指令,如车辆的行驶操控指令,或者其他涉及到隐私信息操作的指令,如切换驾驶模式,控制行驶方向,查看历史通话数据等,通常基于车辆的特殊环境,我们往往会默认驾驶位的成员具备最高权限,因此当车辆在获取到相关需要确认发出者身份的指令时,通过获取特定位置上(如驾驶位)成员的唇部运动信息和所识别出的具体指令信息来判断是否是驾驶位成员发出的指令,从而能够有针对性的针对特定位置区域进行操作控制。上述语音控制设备因为涉及视频数据的处理和分析其局部特征与指令信息的匹配的程度,因此可以是本地设备,如车载智能设备,车载智能芯片,包括摄像头和麦克风的车载系统或者是智能车辆,也可以是云侧的云端处理器。处理器除了对驾驶位的成员的唇部运动信息的采集,也可以根据系统预设或人为设定一个或多个位置上的成员来进行针对性的采集,当识别出此类指令时,则获取相应位置的人员的唇部运动信息进行匹配分析。
在一种可能的实现方式中,第一位置为驾驶位。
在一种可能的实现方式中,第二类型指令为车辆行驶操控指令。
本发明实施例的设备可用于任一人数的车辆场景,尤其适用于当车内有多个成员,且同时有多人在说话的场景,进行指令的识别和执行判断,此时处理器通过唇部运动信息与指令信息的匹配情况结合获取车内位置分布信息能够精准的判断是否是特定位置上的成员发出的语音指令,进而确定是否执行所述指令。所述匹配方式可以是获取车内多个成员(包括第一位置成员)进行匹配,判段是否第一位置的成员匹配度最高,也可以是仅获取第一位置的成员进行匹配,当匹配度达到阈值时确定为匹配,指令可执行。
在涉及对第二类型指令进行判断指令的实施例中,处理器需要对特定位置进行判断的情况是,针对仅有特定位置的成员才具备进行此类指令的执行权限的场景,例如第二类型指令通常是车辆操控类的指令,此类指令为了防止误操作通常会设置为只有特定位置如驾驶位的成员才具备通过语音进行车辆行驶操控的能力。通常的实现方式是处理器获取驾驶位的用户的唇部运动信息和第二指令信息进行匹配,当结构为匹配时判断是驾驶位的用户发出的第二类型指令,从而执行第二类型指令。因为通常是默认驾驶位的用户具有对车辆的操控权限。也可以人工设置其他位置的乘客具备语音操控权限,那么第一位置还是是其他车内的位置。
在一种可能的实现方式中,所述处理器获取第二类型指令和位于车内第一位置的车内成员的唇部运动信息,具体为:所述处理器获取车舱内所述目标音频数据;当识别出所述目标音频数据中包括第二类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取第一位置的车内成员的唇部运动信息。
上述方案的实现过程中,处理器将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为:处理器根据所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配度和预设阈值确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配结果。
本发明实施例中的所述第二类型指令可以为从所述音频数据中提取的语音波形序列或者根据所述音频数据识别出的文本指令信息。车内成员的唇部运动信息为车内成员在所述目标时间段内的唇部运动的图像序列。
在一种可能的实现方式中,所述处理器还用于,当所述音频数据中包括所述第二类型指令,获取车内其他N个位置车内成员的图像数据;从所述车内其他N个位置车内成员的图像数据的所述目标时间段内提取所述车内其他N个位置车内成员的唇部运动信息;所述处理器将所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为所述处理器将所述第二类型指令与所述位于车内第一位置的车内成员的唇部运动信息和所述N个位置车内成员的唇部运动信息进行匹配,得到N+1个车内成员的唇部运动信息分别与所述第二类型指令之间的匹配度,获取匹配度最高的唇部运动信息;所述处理器当根据所述匹配结果确定所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息,具体为:所述处理器当所述匹配度最高的唇部运动信息为所述位于车内第一位置的车内成员的唇部运动信息,则发送指示执行所述第二类指令的指示信息。
本发明实施例中从所述车内位置上的成员的视频数据中提取车内位置上成员的唇部运动信息,具体方法为基于人脸识别算法,识别所述视频数据中的多个人脸区域,提取多个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定每个人脸对应的唇部运动信息。
第五方面,本发明实施例提供了一种芯片系统,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行用于实现第一、二方面中的任意一种的方法。
第六方面,本发明实施例提供了一种计算机可读存储介质,所述计算机可读介质用于存储程序代码,所述程序代码包括用于执行第一、二方面中的任意一种的方法。
第七方面,本发明实施例提供了一种计算机程序,所述计算机程序包括指令,当所述计算机程序被执行执行用于实现第一、二方面中的任意一种的方法。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1为本发明实施例提供的一种车内多人交互的场景示意图。
图2为本发明实施例提供的一种车内多人交互的场景示意图。
图3为本发明实施例提供了一种系统架构100。
图4为本发明实施例提供的一种卷积神经网络示意图。
图5是本发明实施例提供的一种神经网络的训练方法的流程示意图。
图6为本发明实施例提供的一种声音波形示例图。
图7A为本发明实施例提供的一种语音指令匹配方法。
图7B为本发明实施例提供的一种语音指令匹配方法。
图8为本发明实施例提供的一种云端交互场景示意图。
图9为本发明实施例的一种方法流程图。
图10为本发明实施例的一种方法流程图。
图11为本发明实施例的一种方法流程图。
图12为本发明实施例的一种方法流程图。
图13是本发明实施例提供的一种指令控制装置的结构示意图。
图14是本发明实施例提供的一种神经网络的训练装置的结构示意图。
图15是本发明实施例提供的另一种指令控制系统。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)位图(Bitmap):又称栅格图(Raster graphics)或点阵图,是使用像素阵列(Pixel-array/Dot-matrix点阵)来表示的图像。根据位深度,可将位图分为1、4、8、16、24及32位图像等。每个像素使用的信息位数越多,可用的颜色就越多,颜色表现就越逼真,相应的数据量越大。例如,位深度为1的像素位图只有两个可能的值(黑色和白色),所以又称为二值位图。位深度为8的图像有28(即256)个可能的值。位深度为8的灰度模式图像有256个可能的灰色值。RGB图像由三个颜色通道组成。8位/通道的RGB图像中的每个通道有256个可能的值,这意味着该图像有1600 万个以上可能的颜色值。有时将带有8位/通道(bpc)的RGB图像称作24位图像(8位x 3通道=24位数据/像素)。[2]通常将使用24位RGB组合数据位表示的的位图称为真彩色位图。
(2)语音识别技术(AutomaticSpeech Recognition,ASR),也被称为自动语音识别,其目标是将人类的语音中的词汇内容转换为计算机可读的输入,例如按键、二进制编码或者字符序列。
(3)声纹(Voiceprint),是用电声学仪器显示的携带言语信息的声波频谱,是由波长、频率以及强度等百余种特征维度组成的生物特征。声纹识别是通过对一种或多种语音信号的特征分析来达到对未知声音辨别的目的,简单的说就是辨别某一句话是否是某一个人说的技术。通过声纹可以确定出说话人的身份,从而进行有针对性的回答。
(4)梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)在声音处理领域中,梅尔频率倒谱(Mel-Frequency Cepstrum)是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换。梅尔频率倒谱系数(MFCC)广泛被应用于语音识别的功能。
(5)多路交叉熵(Multi-way cross-Entropy Loss)交叉熵描述了两个概率分布之间的距离,当交叉熵越小说明二者之间越接近。
(6)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021091138-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(7)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021091138-appb-000002
其中,
Figure PCTCN2021091138-appb-000003
是输入向量,
Figure PCTCN2021091138-appb-000004
是输出向量,b是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021091138-appb-000005
经过如此简单的操作得到输出向量
Figure PCTCN2021091138-appb-000006
由于DNN层数多,则系数W和偏移向量b的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层 的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021091138-appb-000007
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021091138-appb-000008
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(8)卷积神经网络
卷积神经网络(CNN,convolutional neuron network)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(9)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(10)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正卷积神经网络中参数的大小,使得卷积神经网络的重建误差损失越来越小。具体 地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新卷积神经网络中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的卷积神经网络的参数,例如权重矩阵。
(11)像素值
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。
首先,为了便于理解本发明实施例,进一步分析并提出本申请所具体要解决的技术问题。在现有技术中,关于车内多人场景中对于检测到的语音指令进行发出者所在位置的的识别,可以有多种实现方式,例如可以通过车舱内的声纹识别和/或声源定位来实现,以下示例性的列举如下常用的两种方案。其中,
方案一:
声纹(Voiceprint),是用电声学仪器显示的携带言语信息的声波频谱,是由波长、频率以及强度等百余种特征维度组成的生物特征。声纹识别是通过对一种或多种语音信号的特征分析来达到对未知声音辨别的目的,简单的说就是辨别某一句话是否是某一个人说的技术。通过声纹可以确定出说话人的身份,从而进行有针对性的回答。主要分为两个阶段:注册阶段和验证阶段。其中,注册阶段:根据发音人语音的声纹特征,建立相应的声纹模型;验证阶段:接收发音人的语音,提取其声纹特征并与注册的声纹模型进行匹配,若匹配成功,则证明是原来注册的发音人。
方案二:
声源定位技术,是利用声学和电子装置接收目标声场信息以确定目标声源位置的一种技术。麦克风阵列的声源定位是指用麦克风阵列拾取声源信号,通过对多路声音信号进行分析与处理,在空间域中定取一个或者多个声源平面或空间坐标,即得到声源的位置。近一步控制麦克风阵列的波束对准说话人。
方案一和方案二应用于车舱内指令发出者位置分析的缺点:
对于声纹识别的应用,首先需要提前存储有乘坐人员的声纹信息,如果是没有进行过声纹识别和记录的人员则无法识别,同时同一个人的声音具有易变性、同时车内是一个多人环境,当多人同时说话时声纹特征不易提取,或者环境噪音很大时也会对识别有干扰。
对于声源定位技术,由于车内是一个相对狭小且拥挤的空间,尤其是后排乘客之间空间距离上非常的接近,成员在说话时也会存在晃动或者身体倾斜的情况,上诉因素都会导致声源定位的准确性的下降,同时车舱内常常会有多人同时说话的情况,此时也会影响声源定位的准确性。
综上,上述两种方案若应用于应用于车内的语音指令发出者位置的识别,尤其是应用于多人同时发声的车内场景中的指令发出者的位置识别,会存在无法准确识别所采集到的指令具体由哪个位置上的车内成员,因此也就无法实现更为精准、有效的人机交互。因此,本申请要解决的技术问题包括如下方面:在车舱内存在多个用户的情 况下,当采集到特定类型指令,如何准确地判断语音发出者的具体位置、并针对性的指令相应的指令。
本申请实施例提供的语音匹配方法能够应用在智能车辆的人机交互场景。以下示例性列举本申请中语音指令控制方法所应用的人机交互场景,可以包括如下两个场景。
车内交互场景一:
通常车内分布有多个扬声器,分别分布于车舱内的不同位置,多个扬声器可以根据乘客和驾驶员的需求为车内不同区域的乘客提供不同音量大小的音乐,例如乘客A想要休息则需要一个安静的环境,因此可以选择将其所在区域的扬声器音量调整到最低,而乘客B需要正常的听音乐,则可以将其所在区域的扬声器设置为正常的大小;又或者多个扬声器还可以为不同区域的用户提供不同的音频播放内容,例如后排做的是小朋友,则可以在后排为小朋友选择播放童话故事,而前排的驾驶员和副驾驶想要听流行音乐,则可以在前排区域的扬声器播放流行音乐。
而本发明的实施例则可以为车舱内的成员提供一种可以语音控制发出语音指令者所在区域扬声器的方法,例如如图1所示,当车内有A,B,C,D 4个人,分别乘坐于驾驶位,副驾驶位和左后排,右后排,这时成员D说:“把音量调低”,此时可以如图7A所示通过车舱内的摄像头和麦克风分别获取多个车内成员的音频指令信息和视频信息,通过车内成员的唇部运动信息和指令信息的特征匹配,确定说话的成员,并基于说话的成员所发出的指令和位置,控制说话成员所在位置的扬声器,如果识别出说话成员是右后方的乘客,则调低右后方区域的扬声器。
若成员C的语音指令是说:“给我播放一首《****》”,则同样可以通过车舱内的摄像头和麦克风分别获取车内成员的音频指令信息和视频信息,通过处理、分析车内成员A、B、C、D唇部运动信息和语音信息,确定说话的成员是成员C,并基于说话的成员C所发出的指令和所处的位置,控制说话成员所在位置的扬声器,如果识别出说话成员C是左后方的乘客,则控制左后方的扬声器播放歌曲《****》。
类似的应用场景还可以有,车内分布有多个空调出风口,分别分布于车舱内的不同位置,多个空调出风口可以根据乘客和驾驶员的需求为车内不同区域的乘客提供不同风量的大小,实现局部区域温度的差异化调整,例如乘客A温度低一点,因此可以选择将其所在区域的风量调大,而乘客B觉得冷,则可以通过指令将其所在区域的出风口的出风方向调整为不直接吹到人或者风量调小。或者当车内的座椅可以分别独立的调整角度和高低时,不同区域的乘客也会针对自己的需求对座椅的各个参数进行调整。上述场景,都可以通过本发明实施例的语音识别的方法来进行方便的控制,同样可以通过车舱内的摄像头和麦克风分别获取车内成员的音频指令信息和视频信息,通过处理、分析车内成员的唇部运动信息和语音信息,确定说话的成员,并基于说话的成员所发出的指令和所处的位置,控制说话成员所在位置的空调出风口的出风方向或者风量大小,或者控制座椅的椅背角度或者座椅的高低前后。
车内交互场景二:
车舱内的语音指令控制除了前述场景中提到的对车内设置的控制,因为有些车内设施的控制需要区分具体指令实施的目标区域,因此需要识别是哪个位置上的成员发 出的语音指令,除了以上场景外,当驾驶员要执行车辆的行驶控制时,也可以选择通过语音控制车辆的行驶,这种语音指令交互场景下,同样需要识别当前的车辆控制指令是否是驾驶位上的成员发出的。
由此本发明的实施例则可以为车辆的行驶控制提供一种语音指令的权限识别方法,例如当车内有多个人,此时若接受到一个车辆行驶控制相关的语音指令,如,“切换到自动驾驶模式”,而车辆系统默认是驾驶位的成员才具有这类指令的执行权限,此时车辆需要如图2所示获取驾驶位成员的唇部运动信息,并且如图7B所示将获取的驾驶位成员的唇部运动信息和语音指令信息进行特征匹配,获取其与指令信息的匹配度,由此来判断是否是驾驶位成员发出的语音指令,从而判断是否执行所述指令。
具体的判断是否是驾驶位成员发出的语音指令,也可以是获取车内多个成员的唇部运动信息,分析其与指令信息之间的匹配度,看是否是驾驶员位置上的成员的唇部运动信息的匹配度最高,进而判断是否执行所述指令。
可以理解的是,图1、图2车舱中的应用场景的只是本发明实施例中的几种示例性的实施方式,本发明实施例具体实现时可以有多样灵活的实现方式,例如对于场景一,并不需要获取车内全部成员的唇部运动信息,可能根据具体的指令类型,仅获取部分成员的唇部运动信息,例如只有前排座椅可调节时,当检测当座椅调节的指令,只获取前排成员的唇部运动信息。对于场景二,不一定是要获取驾驶位成员的唇部运动信息,当车辆默认是车主具有所识别指令的操作权限时,获取车主所在位置,并提取车主的唇部运动信息,判断是否是车主发出的指令。
由于进行指令信息和车内成员唇部运动信息的匹配,可以采用模型训练的方式获取模型,并通过输入唇部运动信息和指令信息来输出相应的匹配度,因此下面从模型训练侧和模型应用侧对本申请提供的方法进行描述:
本申请提供的任意一种神经网络的训练方法,涉及计算机听觉与视觉的融合处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据(如本申请中的训练用户的唇部运动信息以及M个语音信息)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的目标特征匹配模型;并且,本申请提供的任意一种语音匹配方法可以运用上述训练好的目标特征匹配模型,将输入数据(如本申请中的待识别的语音信息以及N个用户的唇部运动信息)输入到所述训练好的目标特征匹配模型中,得到输出数据(如本申请中的N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度)。需要说明的是,本申请实施例提供的一种神经网络的训练方法和一种语音匹配方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。
参见附图3,图3是本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,在本申请中,该数据采集设备160可以包括麦克风和摄像头。本发明实施例中训练数据(即模型训练侧的输入数据)可包括:视频样本数据和语音样本数据,即分别为本发明实施例中的训练用户的唇部运动信息以及M个语音信息,其中,所述M个语音信息可包括与所述训练用户的唇部运动信 息所匹配的语音信息。例如,视频样本数据为某个训练用户在发出语音为:“今天天气特别好,我们去哪里玩?”时的唇部运动图像序列,而语音样本数据则为包含上述训练用户发出“今天天气特别好,我们去哪里玩?”的语音波形序列(作为语音正样本)以及(M-1)个其它语音波形序列(作为语音负样本)。而上述视频样本数据和音频样本数据可以是由数据采集设备160采集的,也可以是从云端下载下来的,图3只是一种示例性的架构,并不对此进行限定。进一步地,数据采集设备160将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标特征匹配模型/规则101(此处的目标特征匹配模型101即为本发明实施例中的所述目标特征匹配模型,例如,为经过上述训练阶段训练得到的模型,可以用于语音和唇部运动轨迹之间的特征匹配的神经网络模型)。
下面将更详细地描述训练设备120如何基于训练数据得到目标特征匹配模型/规则101,该目标特征匹配模型/规则101能够用于实现本发明实施例提供任意一种语音匹配方法,即,将由数据采集设备160获取的音频数据和图像数据通过相关预处理后输入该目标特征匹配模型/规则101,即可得到多个用户的唇部运动的图像序列特征分别与待识别的语音特征之间的匹配度/置信度。本发明实施例中的目标特征匹配模型/规则101具体可以为时空卷积网络(STCNN),在本申请提供的实施例中,该时空卷积网络可以是通过训练卷积神经网络得到的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标特征匹配模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本发明实施例的限定。
如图3所示,根据训练设备120训练得到目标特征匹配模型/规则101,该目标特征匹配模型/规则101在本发明实施例中可以称之为视听交叉卷积神经网络(V&A Cross CNN)/时空卷积神经网络。具体的,本发明实施例提供的目标特征匹配模型可以包括:第一模型、第二模型和第三模型,其中第一模型用于进行语音特征的提取,第二模型用于多个用户(本申请中为N个用户)唇部运动的图像序列特征的提取,第三模型则用于上述语音特征和N个用户的图像序列特征之间的匹配度/置信度的计算。在本发明实施例提供的目标特征匹配模型中,所述第一模型、所述第二模型和所述第三模型都可以是卷积神经网络即可以理解为目标特征匹配模型/规则101自身可以看作是一个整体的时空卷积神经网络,而该时空卷积神经网络中又包含了多个独立网络,如上述第一模型、第二模型和第三模型。
除了上述的模型训练和执行方式,本发明实施例还可以通过其他的模型训练和执行方案来实现。
同上面已经介绍过的训练方法的样本数据采集来源一样,本发明实施例中训练数据(即模型训练侧的输入数据)可包括:视频样本数据和语音样本数据,即分别为本发明实施例中的训练用户的唇部运动信息以及M个语音信息,其中,唇部运动信息包括不同用户的各种语音指令语句对应的唇部运动信息,语音信息包括不同用户发出的语音指令语句。可选的也可以包括一些负样本,即不是语音指令的语句对应的唇部运动信息,以及不是语音指令的语音信息。这里的语音指令是指车载系统能够识别并作 出应道相应的语音信息,可以是关键词,也可以是完整的句子。而上述视频样本数据和音频样本数据可以是由数据采集设备160采集的,也可以是从云端下载下来的,或者是第三方数据持有者提供的。进一步地,数据采集设备160将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标特征匹配模型/规则101(此处的目标特征匹配模型101即为本发明实施例中的所述目标特征匹配模型,例如,为经过上述训练阶段训练得到的模型,可以用于语音和唇部运动轨迹之间的特征匹配的神经网络模型)。
下面将更详细地描述训练设备120如何基于训练数据得到目标特征匹配模型/规则101,该目标特征匹配模型/规则101能够用于实现本发明实施例提供任意一种语音匹配方法,即,将由数据采集设备160获取的音频数据和图像数据通过相关预处理后输入该目标特征匹配模型/规则101,即可得到多个用户的唇部运动的图像序列特征分别与待识别的语音特征之间的匹配度/置信度。本发明实施例中的目标特征匹配模型/规则101具体可以为卷积网络(CNN),在本申请提供的实施例中。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标特征匹配模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本发明实施例的限定。
如图3所示,根据训练设备120训练得到目标特征匹配模型/规则101。具体的,本发明实施例提供的目标特征匹配模型可以包括:第一模型、第二模型,其中第一模型用于进行语音指令的匹配识别语音指令所对应的指令信息,具体的可以是指令标识,或者指令的文本特征,第二模型用于基于N个用户的图像序列特征分别识别于各个唇部运动信息所对应的语音指令的对应关系,例如能够匹配的输出对应的指令的标识及其匹配度,最终根据语音指令对应的指令标识和各个用户唇部运动信息对应的语音标识及其匹配度输出发出语音指令的目标用户。在本发明实施例提供的目标特征匹配模型中,其中所述第一模型,第二模型可以是CNN、RNN、DBN、DNN等。
所述第一模型的训练是以语音指令为输入语音指令所对应的指令标识(标识的表现形式可以为编码)为标签进行训练。第二模型是以用户的唇部运动信息为输入(唇部运动信息具体可以是唇部运动图像序列特征如唇部按照时间采样的开合幅度为向量序列),唇部运动信息对应的指令标识及其匹配度为输出,其中指令标识可以是指令对应的编码,匹配度可以为输出的匹配数值,根据匹配数值进行是否匹配的判断,例如,数值大于0.5是匹配,小于0.5则为不匹配。
根据训练设备120训练得到的目标特征匹配模型/规则101可以应用于不同的系统或设备中,如应用于图4所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(Augmented Reality,AR)/虚拟现实(Virtual Reality,VR),智能可穿戴设备、智能机器人、车载终端、智能座舱环境等,还可以是服务器或者云端等。在附图4中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140(本申请中的客户设备也可以包括麦克风、摄像头等数据采集设备)向I/O接口112输入数据,所述输入数据(即模型应用 侧的输入数据)在本发明实施例中可以包括:待识别的语音信息和N个用户的唇部运动信息,即分别为本发明实施例中的目标时间段内的语音波形序列和N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列。例如,当前需要识别一群人中具体是哪个人讲了“明天天气怎么样,适合到哪里出游”的语音信息,则该“明天天气怎么样,适合到哪里出游”对应的语音波形序列,以及在场所有人对应的唇部运动的图像序列则作为输入数据。可以理解的是,此处的输入数据,可以是用户输入的,也可以是由相关数据库提供的,具体依据应用场景的不同而不同,本发明实施例对此不作具体限定。
在本发明实施例中,客户设备140可以和执行设备110在同一个设备上,数据采集设备160、数据库130和训练设备120也可以和执行设备110和客户设备140在同一个设备上。以本申请中的执行主体为机器人为例,机器人在通过客户设备140(包括麦克风和摄像头以及处理器)将采集的音频数据和图像数据,进行提取获得待识别的语音信息及N个用户的唇部运动信息之后,则可以通机器人内部的执行设备110,进一步对上述提取的语音信息和唇部运动信息之间进行特征匹配,最终输出结果至客户设备140,由客户设备140中的处理器分析得到所述待识别的语音信息在所述N个用户用所属的目标用户。并且,模型训练侧的设备(数据采集设备160、数据库130和训练设备120)可以在机器人内部,也可以在云端,当在机器人内部时,则可以认为机器人拥有可以实现模型训练或者模型更新优化的功能,此时,机器人既有模型训练侧的功能,又有模型应用侧的功能;当在云端,则可以认为机器人侧仅有模型应用侧的功能。可选的,客户设备140和执行设备110也可以不在同一个设备上,即采集音频数据和图像数据、以及提取待识别的语音信息和N个用户的唇部运动信息可以由客户设备140(例如智能手机、智能机器人等)来执行,而对待识别的语音信息和N个用户的唇部运动信息之间进行特征匹配的过程,则可以由执行设备110(例如云端服务器、服务器等)来执行。或者,可选的,采集音频数据和图像数据由客户设备140来执行,而提取待识别的语音信息和N个用户的唇部运动信息,以及对待识别的语音信息和N个用户的唇部运动信息之间进行特征匹配的过程均由执行设备110来完成。
在附图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端(例如为麦克风、摄像头),采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
预处理模块113用于根据I/O接口112接收到的输入数据(如所述语音数据)进行预处理,在本发明实施例中,预处理模块113可以用于对语音数据进行预处理,例 如从语音数据中提取待识别的语音信息。
预处理模块114用于根据I/O接口112接收到的输入数据,如(所述图像数据)进行预处理,在本发明实施例中,预处理模块114可以用于对图像数据进行预处理,例如从图像数据中提取与上述待识别的语音信息对应的N个用户的唇部运动信息。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。最后,I/O接口112将输出结果,如本发明实施例中的N个用户的唇部运动信息分别与待识别的语音信息之间的匹配度,或者其中最高的一个匹配度的目标用户ID返回给客户设备140,客户设备140从而根据上述匹配度,确定目标用户的用户信息,从而基于该用户信息生成与该用户信息匹配的控制指令。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标特征匹配模型/规则101,该相应的目标特征匹配模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
值得注意的是,附图4仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图4中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
基于上述系统架构的介绍,以下描述本发明实施例中模型训练侧和模型应用侧所涉及的神经网络模型机即卷积神经网络,卷积神经网络CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。
如图4所示,图4为本发明实施例提供的一种卷积神经网络示意图,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220,以及神经网络层230,其中池化层为可选的。
卷积层/池化层220:
如图4所示卷积层/池化层120可以包括如示例221-226层,在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
卷积层:
以卷积层221为例,卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两 个像素接着两个像素……,这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以被称为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图4中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层220的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的输出。因此,在神经网络层230中可以包括多层隐含层(如图4所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误 差,一旦整个卷积神经网络200的前向传播(图4中由210至240的传播为前向传播)完成,反向传播(图4中由240至210的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图4所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层230进行处理。
本申请中的归一化层,作为CNN的功能层,原则上可以在上述CNN中的任何一层之后,或者任何一层之前进行,并以上一层输出的特征矩阵作为输入,其输出也可以作为CNN中任何一层功能层的输入。但在实际CNN应用中,归一化层一般在卷积层之后进行,并以前面卷积层输出的特征矩阵作为输入矩阵。
基于上述图3和图4中对系统架构100以及对卷积神经网络200的相关功能描述,下面结合上述应用场景、系统架构、卷积神经网络的结构、神经网络处理器的结构,从模型训练侧和模型应用侧对本申请提供的神经网络的训练方法、语音匹配方法的实施例进行描述,以及对本申请中提出的技术问题进行具体分析和解决。
参见图5,图5是本发明实施例提供的一种神经网络的训练方法的流程示意图,该方法可应用于上述图1、图2中所述的应用场景及系统架构中,具体可应用于上述图3的训练设备120中。下面结合附图5以执行主体为上述图3中的训练设备120或者包含训练设备120的设备为例进行描述。该方法可以包括以下步骤S701-步骤S702。
S701:获取训练样本,所述训练样本包括训练用户的唇部运动信息以及M个指令信息。
具体地,例如,训练用户的唇部运动信息为用户小方发出语音信息:“你好,我的名字叫小方,来自中国湖南,你呢?”所对应的唇部运动信息也即是唇部运动视频或唇部连续运动的图像序列,或者是可以体现唇部开合运动的上下唇之间的距离按照时序关系所组成的向量参数那么,所述M个指令信息则包括上述“空调温度调高一点”的指令信息的波形序列或者文本信息作为指令样本,以及其它的指令信息,如“座椅后背角度调低一点”、“打开车窗”“把音乐关掉”等语音信息作为负样本。可选的,所述M个指令信息包括与所述训练用户的唇部运动信息所匹配的指令信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的指令信息。例如,上述唇部运动信息为用户A在发出指令信息:“空调温度调高一点”所对应的连续的唇部运动的图像序列(即发音口型的视频),而上述M个指令信息则包括上述语音正样本的语音波形序列,和M-1个负样本的语音波形序列。可以理解的是,上述M个指令信息中也可以包括多个正样本和负样本,即对正样本和负样本的数量不作具体限定,只要均包含即可。
S702:以所述训练用户的唇部运动信息以及所述M个语音信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型。
具体地,例如,上述训练用户的唇部运动信息与正样本的指令信息“空调温度调高一点”之间的标签为“匹配度=1”,而上述训练用户的唇部运动信息与其他负样本 的指令信息“座椅后背角度调低一点”、“打开车窗”“把音乐关掉”之间的标签为“匹配度=0.2”、“匹配度=0”“匹配度=0”等,此处不再赘述。也即是通过上述训练输入和预先设置的标签,可以将初始化的神经网络模型训练得到本申请中所需要使用的目标特征匹配模型,该目标特征匹配模型可以用于匹配待识别的指令信息与多个用户的唇部运动信息之间的匹配关系,用于实现本申请中的任意一种语音匹配方法。
在一种可能的实现方式中,所述以所述训练用户的唇部运动信息以及所述M个指令信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个指令信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型,包括:将所述训练用户的唇部运动信息以及所述M个指令信息输入到所述初始化的神经网络中,计算得到所述M个指令信息分别与所述训练用户的唇部运动信息之间的匹配度;将计算得到的所述M个指令信息分别与所述训练用户的唇部运动信息之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;所述将所述训练用户的唇部运动信息以及所述M个指令信息输入到所述初始化的神经网络中,计算得到所述M个指令信息分别与所述训练用户的唇部运动信息之间的匹配度,包括:将所述M个指令信息输入到所述第一模型中,得到M个语音特征,所述M个语音特征中的每一个语音特征均为K维语音特征,K为大于0的整数;将所述训练用户的唇部运动信息输入到所述第二模型中,得到所述训练用户的图像序列特征,所述训练用户的图像序列特征为K维图像序列特征;将所述M个语音特征和所述训练用户的图像序列特征输入到第三模型中,计算得到所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度。
关于上述具体如何从初始化的神经网络模型训练成为本申请中的目标特征匹配模型,在后续图7对应的模型应用侧的方法实施例中一并进行描述,此处不作详述。
本发明实施例,通过将某个训练用户的唇部运动信息,以及与之匹配的指令信息和多个不匹配的指令信息作为初始化的神经网络的输入,并基于上述M个指令信息与该训练用户的唇部运动信息的实际匹配度作为标签,对上述初始的神经网络模型进行训练得到的目标特征匹配模型,例如,完全匹配对应的匹配度即标签为1,不匹配对应的匹配度即标签为0,当通过训练后的初始化的神经网络计算得到的训练用户的唇部运动信息分别与M个指令信息之间的匹配度越接近所述M个标签,则该训练后的初始化的神经网络越接近所述目标特征匹配模型。
参见图9,图9是本发明实施例提供的又一种语音指令控制方法的流程示意图,主要适用车内成员车内车载设备的语音交互控制的场景,通常在车内存在多个车内成员的场景下,车载设备接受到用于车载设备控制的语音指令,当车载设备需要确定指令是基于哪个位置的车内成员发出,并针对特定位置区域进行响应控制的场景,可以基于本方案来准确识别出是哪个位置上的成员发出的语音指令。该方法可应用于车舱内的应用场景及系统架构中,以及具体可应用于上述图3的客户设备140以及执行设备110中,可以理解的是,客户设备140和执行设备110均可以设置在车内。下面结合附图9以执行主体为智能车辆为例进行描述。该方法可以包括以下步骤S1601-步骤S1605
步骤S1601:获取车内音频数据。
具体地,获取车载麦克风采集的车内音频数据。音频数据包括有车内的环境音,如扬声器的音乐,发动机空调等的噪声,车外的声音等等环境音以及用户发出的语音指令。
通常智能车辆的车舱内存在麦克风阵列,即车内存在多个麦克风分布于车舱内不同位置,因此当车内存在麦克风阵列时,此时步骤S1601具体可以为:
S1601a:获取车内多个麦克风采集的音频数据。
在进行人机交互的场景下麦克风阵列会进行音频数据的采集,或者车内麦克风阵列在车辆启动后便实时处于音频数据采集状态下,或者是通过车内成员例如车主进行特定操作后,例如开启音频采集功能后,麦克风阵列进入音频采集状态。麦克风阵列采集音频数据的方式为,多个麦克风分别在车舱内的不同位置采集音频数据。
S1601b:基于多个麦克风采集的音频数据获取目标音频数据。
车内麦克风阵列通常为多个麦克风设置于车内不的不同位置,因此在车内环境下的音频数据的获取可以有多个音频源供选择,由于不用位置采集到的音频数据的效果是不一样的,例如当发出语音指令的人是坐在车辆的后排,而前排副驾驶位的成员在听歌,副驾驶位的扬声器此时正在播放歌曲,那么这个时候副驾驶位采集到的音频数据会因为副驾驶位的扬声器而音乐声比较大,后排乘客的指令信息比较微小,而后排的扬声器则相对会采集到一个比较明确的语音信号仅伴随较小的音乐声,此时在进行音频数据的获取时,通常会对各个麦克风采集到的音频数据进行预处理后,通过分析比较,选择出目标音频数据。例如因为环境噪声,音乐声,和语音指令所处的频段不同,所述预处理可以是对多个麦克风采集到的音频数据进行滤波处理,选择滤波处理后语音信号最强的音频信号作为目标音频信号。
这里还可以是通过其他现有的预处理方式来判断哪个麦克风采集的音频数据中的语音指令相关信号的信号质量最佳,及选择此音频信号作为目标音频信号。所选择的目标音频信号可以是原始的麦克风采集的音频信号,也可以是经过预处理后的音频信号。
步骤S1602:当识别出所述音频数据中包括第一类型指令信息,获取图像数据。
所述识别出所述S1601步骤中获取的音频数据中是否包括第一类型指令信息的方式有多种,例如可以基于RNN模型来进行音频信息的语义识别,然后基于识别出的文本信息进行指令内容的识别,根据指令内容判断指令类型,或者直接根据文本信息中的特征信息,如关键词,判断指令类型,具体基于语音进行指令识别的方案在现有技术中已有多种,在此不一一列举。所述用于进行模型输入的音频数据可以是对采集的音频数据进行环境噪声滤除等预处理后的音频数据,也可以直接基于采集到的音频数据直接进行输入。也可以是现有技术中的其他的语音识别方式来判断是否包括有指令信息。
本实施例中的第一类型指令信息是指车载设备能够接收并识别,并需要通过判断指令发起者所处的位置对所述位置区域进行相应操作响应的指令信息,即通常为车舱内部设施的调控指令,如车舱内的空调调节指令,声音调节,音频内容选择调节相关的指令。
指令信息可以是音频数据中对应于指令时间段内的语音波形序列,或者为从所述音频数据中提取的在所述目标时间段内的文本信息的文本特征序列。在本文的模型介绍时,会提到待识别语音信息,其实质也是语音指令发出的对应时间段内的语音波形序列,因此图9中提到指令信息,当其为语音波形序列形式时,也是一种待识别语音信息。
获取车内音图像数据是同麦克风进行音频采集一样,可以是再车辆启动后就开始自动进行实时采集,也可以是根据用户的指示开启实时的采集功能,或者是默认音频采集启动时同时开启图像数据的采集。车辆上通常会设置有多个摄像头,也会设置有不同种类的摄像头,如单目相机,双目相机,TOF相机,红外相机等,在本方案中并不限定采集车内图像数据的摄像头的部署位置和个数以及摄像头的类型,本领域技术人员可以根据具体方案实现的需要进行相应的选择部署。其中S1601步骤中的麦克风可以是独立设置的麦克风,也可以是集成在摄像头中的麦克风。
所述图像数据可以是车载处理系统在通过麦克风获取语音数据的同时通过摄像头获取图像数据。即上述音频数据和图像数据是在某个时间段内的原始音频数据及图像数据,即音频数据源和图像数据源。可选的,该音频数据和图像数据是针对同一个场景下的同一个时间段内所采集的。
由于车舱内通常是2排以上的座位,因此从一个摄像头获取图像数据时,往往容易存在成员之间的遮挡情况,因此为了清楚采集到每一个成员的唇部运动信息,往往需要通过车舱内处于不同位置的多个摄像头进行图像数据的采集。但音频数据源的数量和图像数据源的数量并不要求一定要相匹配,比如,可以通过设置在车内各个位置的麦克风采集音频数据,通过车内的全局摄像头采集车舱内的图像数据,也可以是通过某个指定的麦克风采集音频数据,通过车内多个位置的摄像头采集车舱内的图像数据。
步骤S1603:从所述图像数据中提取车内N个位置上的成员的唇部运动信息。
具体地,根据所采集到的车内视频信息,判断车内成员的位置分布,并提取各个位置上的成员的唇部运动信息,所述唇部运动信息携带有对应的位置标识。
所述车内多个成员的唇部运动信息中的每一个成员的唇部运动信息包括对应的用户在对应目标时间段内唇部运动的图像序列,目标时间段为音频中所述指令信息对应的时间段。即从原始图像数据中提取的各个成员的唇部视频,即连续的唇部运动的图像序列,包含了对应成员的连续的口型变化特征。例如,通过摄像头采集的图像数据中的每一帧图像的格式为24位BMP位图,其中,BMP图像文件(Bitmap-File)格式是Windows采用的图像文件存储格式,而24位图像则是使用3字节保存颜色值,每一个字节代表一种颜色,按红(R)、绿(R)、蓝(B)排列,并将RGB彩色图像转换成灰度图。智能设备从上述摄像头采集的图像数据中,基于人脸识别算法,获取至少一个人脸区域,并进一步以每个人脸区域为单位,为每个人脸区域赋予一个人脸ID(和机器人或者智能音箱场景不同,所述人脸ID用于对应车内的位置),提取嘴部区域的视频序列流,其中,视频的帧速率为30f/s(帧率(Frame rate)=帧数(Frames)/时间(Time),单位为帧每秒(f/s,frames per second,fps))。9个连续的图像帧形成0.3秒的视频流。将9帧图像图像数据(视频速度30fps)拼接(concat)为一个尺寸为9×60×100的cube, 其中9是表示时间信息的帧数(时序特征)。每个通道都是口腔区域的60×100灰度图像(2d空间特征)。以此N个用分别对应的0.3s内的唇部运动的图像序列作为视频特征的输入,其中0.3s则为目标时间段。
具体如何从所述图像数据中提取多个成员的唇部运动信息,参考本发明的在先实施例中的相应技术方案描述。
步骤S1604:将所述指令信息和所述N个成员的唇部运动信息输入到目标特征匹配模型中,得到所述N个成员的唇部运动信息分别与所述指令信息之间的匹配度。
具体地,将指令信息和N个成员各自在所述目标时间段内唇部运动的图像序列,分别作为音频特征的输入和视频特征的输入,输入到目标特征匹配模型中,分别计算该指令信息分别和N个成员的唇部运动特征之间的匹配度。匹配度具体的可以是大于等于0小于等于1的数值。
此处的指令信息可以是如图6所示,图6为本发明实施例提供的一种声音波形示例图,或者是指令的标识信息的形式,如序列编号的形式,或指令语句的形式等。在一种可能的实现方式中,假设音频数据中有多个用户在同时讲话,那么此时需要判断其中的某一段语音信息是由哪个用户发出的,则需要先识别、提取音频数据中的目标语音信息即上述待识别的语音信息。或者,假设该音频数据中包括了某一个用户讲的多段语音信息,而智能设备只需要识别其中某一段语音信息,则该段语音信息为待识别的语音信息。例如,智能设备从S801中的麦克风阵列获取的音频数据中提取音频特征,具体方法可以使用梅尔频率倒谱系数进行语音特征提取,使用梅尔频率倒谱系数(MFCC)对帧长为20ms数据提取40维的特征,帧与帧之间没有重叠(non-overlapping),每15帧(对应0.3秒的音频片段)拼接(concat)为一个维数为15×40×3的cube(其中,15是时序特征,40×3是2d空间特征),以该0.3s内的语音波形序列作为音频特征的输入,其中0.3s则为目标时间段。除了上述方式以外现有技术中也还有其他方式可以用于分离一段语音中的目标语句。
关于目标特征匹配模型的具体实现,会在下文中具体介绍,其中模型结构参见后续关于图7的描述,以及前文图3中对于模型训练和获取的描述。
步骤S1605:根据匹配度确定对哪个位置区域执行所述指令信息对应的指令。
由于匹配度通常来说是数值的形式,因此S1605相应的确定策略可以是将匹配度最高的成员的唇部运动信息对应的成员所处的车内位置,确定为执行所述指令信息的目标区域,执行所述。
如当指令调低空调的,则仅对于目标区域执行调低出风口的温度或者风量的操作。
此外S1604-S1605还可以是:
S1604:将所述指令信息和其中一个成员的唇部运动信息输入到目标特征匹配模型中,得到所述成员的唇部运动信息与所述指令信息之间的匹配度。
步骤S1605:当匹配度大于已知门限值则在成员所在位置区域执行所述指令信息对应的指令。
若匹配度小于已知门限值,则继续按照一定的规则,继续判断车内另一位置上的成员的唇部运动信息和所述指令信息的匹配度,直到得到匹配度大于已知门限值的唇部运动信息,或者是匹配完所有的车内成员,则结束匹配过程。
车舱内除了如上述实施例一样,需要识别指令发起者所在的位置,针对特定的位置执行相应的操作外,还存在需要判断指令的发起者的身份的场景,例如当识别到一个车辆控制相关的语音指令,此时需要判断是否是驾驶员发出的指令,从而判断是否能够执行该指令。对于这类场景,具体的实施方式如下:
参见图10,图10是本发明实施例提供的又一种语音匹配方法的流程示意图,主要适用基于语音指令对车辆执行行驶方面的操作控制的场景,由于通常车内存在多个成员,通常认为只有驾驶员具备对车辆行驶进行语音操作控制的权限,为了避免误操作,误识别,当车载设备接受到用于车辆行驶控制的语音指令,需要判断是否是驾驶员发出的语音指令,然后基于识别结果判断是否执行所述车辆行驶指令。该方法可以包括以下步骤S1701-步骤S1705。
步骤S1701:获取车内音频数据。
其中S1701的具体实现和S1601相同。
步骤S1702:当识别出所述音频数据中包括第二类指令信息,获取图像数据。
S1702中的第二类指令信息只要是指车辆的形式控制相关的指令信息,如车辆转弯、加速、启动,驾驶模式的切换等。当识别是这一类的指令信息时,需要获取驾驶位成员的图像数据。
具体的指令识别方式和图像数据获取方式参见S1602.
步骤S1703:从所述图像数据中提取第一位置成员的唇部运动信息。其中如何提起唇部运动信息以及如何对唇部运动信息进行标识参见S1603。
步骤S1704:将所述指令信息和所述第一位置成员的唇部运动信息输入到目标特征匹配模型中,得到驾驶位成员的唇部运动信息分别与所述指令信息之间的匹配度。
步骤S1705:根据匹配度确定是否执行所述指令信息对应的指令。
S1705有多种判断方式,由于匹配度通常来说是数值的形式,S1705可以是根据匹配度是否高于预设门限值来判断是否执行所述指令信息。即,可以是,当匹配度大于预设的门限值,则认为所述指令是第一位置成员发出的,则执行所述车辆形式控制指令。否则不执行所述指令。
S1705还可以是判断所述第一位置成员唇部信息的匹配度是不是所有车内成员唇部运动信息和指令信息匹配度中匹配度最高的。若是这种情况,则需要在S1703中除了提取第一位置成员的唇部运动信息外,还要提起车内其他成员的唇部运动信息。同样S1704中,除了将所述指令信息和所述第一位置成员的唇部运动信息输入到目标特征匹配模型中,还要将其他成员的唇部信息也输入到目标特征匹配模型中,获取相应的匹配度。
在方案具体实现的时候,上述实施例中的第一位置通常是驾驶位,例如,可以车内控制系统的初始设定默认是驾驶位成员具备语音控制车辆行驶操作的权限,也可以基于用户的人工设定基于每次乘车时的具体位置分布情况进行更改,例如设置驾驶位和副驾驶位都具备车辆行驶控制权限。此时则第一位置是驾驶位和副驾驶位。
或者本发明实施例在具体实现时也可以是在车辆初始化的时候,会在车辆上根据 车辆提示要求,或者车主主动设置录入家庭会使用车辆的成员的图像信息和权限信息,此时本发明实施例的方案在具体实现时,可以是车辆行驶之前或者就开始启动之后,车内摄像头获取具有驾驶操控权限的登记成员的位置信息,然后当识别到车辆控制相关的指令时,则基于具备操控权限的成员所处位置上的唇部运动信息判断是否是所述位置上的成员的发出的语音指令。
本发明的实施例除了判断车辆行驶控制类语音指令是不是驾驶位成员发出,还可以应用于,其他类型的指令的是否可执行的判断,例如对于呼叫功能,可以手动或者车辆默认设置为只有车主或者驾驶员可执行语音控制,以上实施例仅为具体的示例,并不限定具体的指令类型,或者具体的固定位置。
通过上述两个车内交互的实施例,可以通过判断是那个车内座位上的成员发出的指令,来针对性的实施指令操作,为用户提供更精准的车舱内交互控制。
对于车辆行驶操控上的语音控制,能够很好的防止误操作,误识别,确保只有驾驶员才能够进行相应的车辆行驶控制,提供了车辆形式控制的安全性。
本发明实施例还提供又一种语音指令控制方法,该方法可应用于上述图1、图2的车内的应用场景及系统架构中,以及具体可应用于上述图3的执行设备110中,可以理解的是,此时,客户设备140和执行设备110可以不在同一个物理设备上,如图8所示,图8为本发明实施例提供的一种语音指令控制系统架构图,在该系统中,例如,包括智能车辆800,作为音频数据和图像数据的采集设备,进一步地,还可以作为待识别指令信息以及N个用户的唇部信息的提取设备;而关于上述提取后的待识别指令信息以及N个用户的唇部信息之间的匹配则可以在执行设备110所在的服务器/服务设备/服务装置/云端服务设备801上执行。可选的,上述待识别指令信息以及N个用户的唇部信息的提取也可以在执行设备110所在的设备侧执行,本发明实施例对此不作具体限定。下面以包含图8中的云端服务设备801为例进行描述。该方法如图11所示,可以包括以下步骤S1001-步骤S1003。
步骤S1001:获取指令信息和位于车舱内N个车内成员的唇部运动信息;
上述步骤中,指令信息为根据车舱内采集的音频数据获取,车内成员的唇部运动信息为当判断所述指令信息对应的指令为第一类型指令时获取,唇部运动信息包括所述位于车内第一位置的车内成员在目标时间段内的唇部运动的图像序列,所述目标时间段为所述指令在所述音频数据中对应的时间段。
步骤S1002:将所述指令信息和所述位于车舱内N个车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度;
步骤S1003:将匹配度最高的用户的唇部运动信息对应的成员所处的位置作为执行所述指令信息对应的指令的目标位置。
除此之外还有如图12所示的,需要识别具体发出指令的成员的权限,从而判断指令的目标执行区域的云端方案:
步骤S1021:获取指令信息和位于车内第一位置的车内成员的唇部运动信息;
上述步骤中的指令信息为根据车舱内采集的音频数据获取,位于车内第一位置的车内成员的唇部运动信息为当识别所述指令信息对应的指令为第二类型指令时获取,唇部运动信息包括所述位于车内第一位置的车内成员在目标时间段内的唇部运动的图像序列,目标时间段为所述指令在所述音频数据中对应的时间段。
步骤S1022:将所述指令信息和所述位于车内第一位置的车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述位于车内第一位置的车内成员的唇部运动信息与所述指令信息之间的第一匹配度;
步骤S1023:根据所述第一匹配度确定是否执行所述指令信息对应的指令。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;
所述将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度,包括:
将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;
将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;
将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个指令信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
在一种可能的实现方式中,所述方法还包括:
确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;
基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述方法还包括:从图像数据中提取N个用户的唇部运动信息;进一步地,所述从所述图像数据中提取N个用户的唇部运动信息,包括:
基于人脸识别算法,识别所述图像数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;
基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述方法还包括:从所述音频数据中提取待识别的语音信息;进一步地,所述从所述音频数据中提取待识别的语音信息,包括:
基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
需要说明的是,本发明实施例中所描述的云端服务设备所执行的方法流程可参见上述图9-图12中所述的相关方法实施例,此处不再赘述。
请参见图13,图13是本发明实施例提供的一种智能设备的结构示意图,图13是本发明实施例提供的一种智能设备的功能原理示意图。该智能设备可以为车载设备,车载系统,智能车辆。该智能设备40中可包括处理器401,以及耦合于该处理器401的麦克风402、摄像头403,当是智能车辆或者是车内的语音处理系统时,所述麦克风402、摄像头403通常为多个,如对应图12的应用场景,其中,
麦克风402,用于采集音频数据;
摄像头403,用于采集图像数据,所述音频数据与所述图像数据为针对同一场景下采集的;
处理器401,获取所述车舱内音频数据,当识别出所述车舱内音频数据中包括第一类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息;用于将所述第一类型指令对应的指令信息和所述车舱内N个位置上的车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的成员所处的位置作为执行所述指令信息对应的指令的目标位置。
如对应图12的应用场景,麦克风402,用于采集音频数据;
摄像头403,用于采集图像数据,所述音频数据与所述图像数据为针对同一场景下采集的;
处理器401,获取所述车舱内音频数据,当识别出所述车舱内音频数据中包括第二类型指令,获取所述车舱内图像数据,从所述车舱内图像数据中获取第一图像数据,所述第一图像数据为包括位于车内第一位置的车内成员的图像数据,从所述第一图像数据中提取所述位于车内第一位置的车内成员的唇部运动信息;用于将所述第二类型指令对应的指令信息和所述位于车内第一位置的车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述位于车内第一位置的车内成员的唇部运动信息与所述指令信息之间的第一匹配度,根据所述第一匹配度确定是否执行所述指令信息对应的指令。
在一种可能的实现方式中,所述待识别的语音信息包括目标时间段内的语音波形序列;所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列。
在一种可能的实现方式中,处理器401,具体用于:将所述待识别的语音信息和所述N个用户的唇部运动信息输入到目标特征匹配模型中,得到所述N个用户的唇部运动信息分别与所述待识别的语音信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的用户,确定为所述待识别的语音信息所属的目标用户。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;处理器401,具体用于:将所述待识别的语音信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;将所述语音特征和所述N个图 像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型,其中,所述M个语音信息包括与所述训练用户的唇部运动信息所匹配的语音信息。
在一种可能的实现方式中,处理器401还用于:确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,处理器401,具体用于:基于人脸识别算法,识别所述图像数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,处理器401,具体用于:基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
需要说明的是,本发明实施例中所描述的智能设备40中相关模块的功能可参见上述图9-图12中所述的相关方法实施例,此处不再赘述。
请参见图14,图14是本发明实施例提供的一种神经网络的训练装置的结构示意图,图14是本发明实施例提供的一种智能设备的功能原理示意图。该神经网络的训练装置所训练的模型可以用于车载设备,车载系统,智能车辆,云端服务器等。该神经网络的训练装置60中可包括获取单元601和训练单元602;其中,
获取单元601,用于获取训练样本,所述训练样本包括训练用户的唇部运动信息以及M个指令信息;可选的,所述M个指令信息包括与所述训练用户的唇部运动信息所匹配的指令信息以及(M-1)个与所述训练用户的唇部运动信息不匹配的指令信息;
训练单元602,用于以所述训练用户的唇部运动信息以及所述M个指令信息为训练输入,以所述训练用户的唇部运动信息分别与所述M个指令信息之间的匹配度为M个标签,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述训练用户的唇部运动信息包括所述训练用户的唇部运动图像序列,所述M个指令信息包括一个与所述训练用户的唇部运动图像序列匹配的语音波形序列以及(M-1)个与所述训练用户的唇部运动图像序列不匹配的语音波形序列。
在一种可能的实现方式中,训练单元602,具体用于:
将所述训练用户的唇部运动信息以及所述M个指令信息输入到所述初始化的神经网络中,计算得到所述M个指令信息分别与所述训练用户的唇部运动信息之间的匹配度;
将计算得到的所述M个指令信息分别与所述训练用户的唇部运动信息之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;训练单元602,具体用于:
将所述M个指令信息输入到所述第一模型中,得到M个语音特征,所述M个语音特征中的每一个语音特征均为K维语音特征,K为大于0的整数;
将所述训练用户的唇部运动信息输入到所述第二模型中,得到所述训练用户的图像序列特征,所述训练用户的图像序列特征为K维图像序列特征;
将所述M个语音特征和所述训练用户的图像序列特征输入到第三模型中,计算得到所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度;
将计算得到的所述M个语音特征分别与所述训练用户的图像序列特征之间的匹配度与所述M个标签进行比较,对初始化的神经网络进行训练,得到目标特征匹配模型。
请参见图15,图15是本发明实施例提供的系统结构图,包括智能设备70和服务器设备80的结构示意图,该智能设备可以为智能车辆。该智能设备70中可包括处理器701,以及耦合于该处理器701的麦克风702、摄像头703;其中,
麦克风702,用于采集音频数据;
摄像头703,用于采集图像数据;
处理器701,用于获取音频数据以及图像数据;
从所述音频数据中提取待识别的语音信息,所述待识别的语音信息包括目标时间段内的语音波形序列;
从所述图像数据中提取N个用户的唇部运动信息,所述N个用户的唇部运动信息中的每一个用户的唇部运动信息包括对应的用户在所述目标时间段内唇部运动的图像序列,N为大于1的整数;
当应用于,智能车辆或者车内语音交互系统,处理器701,用于获取音频数据,当音频数据中包括目标指令时,从获取车舱内图像数据;从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。此处可以是获取车内成员的唇部运动信息发送给服务装置,也可以是发送采集的车内图像信息给服务装置,由服务装置来提取唇部运动信息。
或者处理器701,用于获取车舱内音频数据;当识别出所述音频数据中包括第二类型指令,获取第一图像数据,所述第一图像数据为包括位于车内第一位置的车内成员的图像数据;从所述第一图像数据中提取所述位于车内第一位置的车内成员的唇部运动信息。
需要说明的是,本发明实施例中所描述的智能设备70中相关模块的功能可参见上述图9-图12中所述的相关方法实施例,此处不再赘述。
图15提供一种服务装置的结构示意图,该服务装置可以为服务器、云端服务器等。该服务装置80中可包括处理器;可选的,该处理器可由神经网络处理器和与该神经网络处理器耦合的处理器802组成,或者直接由处理器组成;其中,
对应于车内实现场景,神经网络处理器801,用于:
将所述第一类型指令对应的指令信息和所述车舱内N个位置上的车内成员的唇部 运动信息输入到目标特征匹配模型中,得到所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度;将匹配度最高的用户的唇部运动信息对应的成员所处的位置作为执行所述指令信息对应的指令的目标位置。
或者是用于:将所述第二类型指令对应的指令信息和所述位于车内第一位置的车内成员的唇部运动信息输入到目标特征匹配模型中,得到所述位于车内第一位置的车内成员的唇部运动信息与所述指令信息之间的第一匹配度;根据所述第一匹配度确定是否执行所述指令信息对应的指令
在一种可能的实现方式中,所述目标特征匹配模型包括第一模型、第二模型和第三模型;处理器802,具体用于:将所述待识别的语音信息或者指令信息输入到所述第一模型中,得到语音特征,所述语音特征为K维语音特征,K为大于0的整数;将所述N个用户的唇部运动信息输入到所述第二模型中,得到N个图像序列特征,所述N个图像序列特征中的每一个图像序列特征均为K维图像序列特征;将所述语音特征和所述N个图像序列特征输入到第三模型中,得到所述N个图像序列特征分别与所述语音特征之间的匹配度。
在一种可能的实现方式中,所述目标特征匹配模型为以训练用户的唇部运动信息以及M个语音信息为输入、以所述训练用户的唇部运动信息分别与所述M个语音信息之间的匹配度为M个标签,训练得到的特征匹配模型。
在一种可能的实现方式中,所述服务器还包括处理器802;处理器802用于:确定所述目标用户的用户信息,所述用户信息包括人物属性信息、与所述待识别的语音信息对应面部表情信息、与所述待识别的语音信息对应的环境信息中的一种或多种;基于所述用户信息,生成与所述用户信息匹配的控制指令。
在一种可能的实现方式中,所述服务器还包括处理器802;处理器802,还用于:基于人脸识别算法,识别图像数据中的N个人脸区域,提取所述N个人脸区域中每个人脸区域中的唇部运动视频;基于所述每个人脸区域中的唇部运动视频确定所述N个用户的唇部运动信息。
在一种可能的实现方式中,所述服务器还包括处理器802;处理器802,还用于:基于频谱识别算法,识别所述音频数据中的不同频谱的音频数据,并将目标频谱的音频数据识别为所述待识别的语音信息。
本发明实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时包括上述方法实施例中记载的任意一种的部分或全部步骤。
本发明实施例还提供一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述方法实施例中记载的任意一种的部分或全部步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其它顺序或者同时进行。其次,本领域 技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-Only Memory,缩写:ROM)或者随机存取存储器(Random Access Memory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (36)

  1. 一种语音指令控制方法,其特征在于,包括:
    获取第一类型指令和位于车舱内N个位置上的车内成员在目标时间段内的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述车内成员的唇部运动信息为当从所述目标音频数据中识别出所述第一类型指令时获取,所述目标时间段为所述第一类型指令在所述音频数据中对应的时间段;
    将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,所述目标位置为所述匹配结果指示唇部运动信息与所述第一类型指令匹配的车内成员所处的位置;
    发送指示针对目标位置执行所述第一类指令的指示信息。
  2. 根据权利要求1所述的方法,其特征在于:
    所述获取第一类型指令和位于车舱内N个位置上的车内成员的唇部运动信息,具体为:
    获取车舱内所述目标音频数据;
    当识别出所述目标音频数据中包括第一类型指令,获取车舱内图像数据;
    从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。
  3. 根据权利要求2所述的方法,其特征在于:
    所述从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息,具体为:
    当识别车内成员大于1时,从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。
  4. 根据权利要求1-3任一所述的方法,其特征在于:
    所述将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,具体为:
    根据所述第一类型指令和所述位于车舱内N个车内成员的唇部运动信息,得到每个所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度,N为大于1的整数;
    将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置。
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述第一类型指令为从所述音频数据中提取的语音波形序列或者根据所述音频数据识别出的文本指令信息。
  6. 根据权利要求1-4任一所述的方法,其特征在于,所述车舱内N个位置上的车内成员的唇部运动信息为所述车舱内N个位置上的车内成员在所述目标时间段内的唇 部运动的图像序列。
  7. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    生成所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系;
    所述将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:
    获取匹配度最高的目标唇部运动信息;
    根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系将所述目标唇部运动信息所对应的位置确定为目标位置。
  8. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    生成所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系;
    所述将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:
    获取匹配度最高的目标唇部运动信息;
    根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系确定目标车内成员;
    将所述目标车内成员的位置信息确定为目标位置,所述目标车内成员的位置信息为根据车内的传感器数据确定。
  9. 根据权利要求1-6任一所述的方法,其特征在于,
    所述车舱内音频数据为根据车舱内多个麦克风采集的数据获得,或者
    所述车舱内音频数据为根据车舱内指定位置区域的麦克风所采集的音频数据获得。
  10. 根据权利要求1-9任一所述的方法,其特征在于,所述第一类型指令为车舱内操控指令。
  11. 一种语音指令控制方法,其特征在于,包括:
    获取第一类型指令和位于车内第一位置的车内成员的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述第一位置的车内成员的唇部运动信息为当当从所述目标音频数据中识别出所述第二类型指令时获取,所述目标时间段为所述第二类型指令在所述音频数据中对应的时间段;
    将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果;
    当根据所述匹配结果确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息。
  12. 根据权利要求11所述的方法,其特征在于,所述第一位置为驾驶位。
  13. 根据权利要求11-12所述的方法,其特征在于:
    所述获取第二类型指令和位于车内第一位置的车内成员的唇部运动信息,具体为:
    获取车舱内所述目标音频数据;
    当识别出所述目标音频数据中包括第二类型指令,获取车舱内图像数据;
    从所述车舱内图像数据中提取第一位置的车内成员的唇部运动信息。
  14. 根据权利要求11-12所述的方法,其特征在于:
    所述将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为:
    根据所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配度和预设阈值确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配结果。
  15. 根据权利要求11-12,14任一所述的方法,其特征在于,所述第二类型指令为从所述音频数据中提取的语音波形序列或者根据所述音频数据识别出的文本指令信息。
  16. 根据权利要求11-14任一所述的方法,其特征在于,所述第一位置的车内成员的唇部运动信息为所述第一位置的车内成员在所述目标时间段内的唇部运动的图像序列。
  17. 根据权利要求11-16任一所述的方法,其特征在于,所述方法还包括:
    当所述音频数据中包括所述第二类型指令,获取车内其他N个位置车内成员的图像数据;
    从所述车内其他N个位置车内成员的图像数据的所述目标时间段内提取所述车内其他N个位置车内成员的唇部运动信息;
    所述将所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为
    将所述第二类型指令与所述位于车内第一位置的车内成员的唇部运动信息和所述N个位置车内成员的唇部运动信息进行匹配,得到N+1个车内成员的唇部运动信息分别与所述第二类型指令之间的匹配度,获取匹配度最高的唇部运动信息;
    所述当根据所述匹配结果确定所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息,具体为:
    当所述匹配度最高的唇部运动信息为所述位于车内第一位置的车内成员的唇部运动信息,则发送指示执行所述第二类指令的指示信息。
  18. 根据权利要求11-17任一所述的方法,其特征在于,所述车舱内音频数据为根据车舱内多个麦克风采集的数据获得,或者
    所述车舱内音频数据为根据车舱内指定位置区域的麦克风所采集的音频数据获得。
  19. 根据权利要求11-18任一所述的方法,其特征在于,所述第二类型指令为车辆行驶操控指令。
  20. 一种语音指令控制设备,其特征在于,包括处理器;所述处理器用于:
    获取第一类型指令和位于车舱内N个位置上的车内成员在目标时间段内的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述车内成员的唇部运动信息为当从所述目标音频数据中识别出所述第一类型指令时获取,所述目标时间段为所述第一类型指令在所述音频数据中对应的时间段;
    将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,所述目标位置为所述匹配结果指示唇部运动信息与所述第一类型指令匹配的车内成员所处的位置;
    发送指示针对目标位置执行所述第一类指令的指示信息。
  21. 根据权利要求20所述的设备,其特征在于,
    所述处理器用于,获取车舱内所述目标音频数据;当识别出所述目标音频数据中包括第一类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。
  22. 根据权利要求20所述的设备,其特征在于:
    所述处理器还用于,当识别车内成员大于1时,从所述车舱内图像数据中提取车舱内N个位置上的车内成员的唇部运动信息。
  23. 根据权利要求20-22任一所述的设备,其特征在于:
    所述处理器将所述第一类型指令和所述车舱内N个位置上的车内成员的唇部运动信息进行匹配,根据所述N个位置上的车内成员的唇部运动信息与所述第一类型指令之间的匹配结果获取目标位置,具体为:
    所述处理器根据所述第一类型指令和所述位于车舱内N个车内成员的唇部运动信息,得到每个所述N个位置上的车内成员的唇部运动信息分别与所述指令信息之间的匹配度,N为大于1的整数;将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置。
  24. 根据权利要求20-23所述的设备,其特征在于,
    所述处理器还用于生成所述N个位置上的车内成员的唇部运动信息与所述N个位置的对应关系;
    所述处理器将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:
    所述处理器获取匹配度最高的目标唇部运动信息;根据所述所述N个位置上的车 内成员的唇部运动信息与所述N个位置的对应关系将所述目标唇部运动信息所对应的位置确定为目标位置。
  25. 根据权利要求20-24所述的设备,其特征在于,
    所述处理器生成所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系;
    所述处理器将匹配度最高的唇部运动信息对应的车内成员所处的位置作为目标位置,包括:
    所述处理器获取匹配度最高的目标唇部运动信息;根据所述所述N个位置上的车内成员的唇部运动信息与所述N个位置上的车内成员的身份的对应关系确定目标车内成员;将所述目标车内成员的位置信息确定为目标位置,所述目标车内成员的位置信息为根据车内的传感器数据确定。
  26. 一种语音指令控制设备,其特征在于,包括处理器;所述处理器用于:
    获取第一类型指令和位于车内第一位置的车内成员的唇部运动信息,所述第一类型指令为根据车舱内采集的目标音频数据获取,所述第一位置的车内成员的唇部运动信息为当当从所述目标音频数据中识别出所述第二类型指令时获取,所述目标时间段为所述第二类型指令在所述音频数据中对应的时间段;
    将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果;
    当根据所述匹配结果确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息。
  27. 根据权利要求26所述的设备,其特征在于,所述第一位置为驾驶位。
  28. 根据权利要求26-27所述的设备,其特征在于:
    所述处理器获取第二类型指令和位于车内第一位置的车内成员的唇部运动信息,具体为:
    所述处理器获取车舱内所述目标音频数据;当识别出所述目标音频数据中包括第二类型指令,获取车舱内图像数据;从所述车舱内图像数据中提取第一位置的车内成员的唇部运动信息。
  29. 根据权利要求26-28任一所述的设备,其特征在于:
    所述处理器将所述第二类型指令和所述第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为:
    所述处理器根据所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配度和预设阈值确定所述第二类型指令和所述第一位置的车内成员的唇部运动信息的匹配结果。
  30. 根据权利要求26-29任一所述的设备,其特征在于,所述第二类型指令为从所述音频数据中提取的语音波形序列或者根据所述音频数据识别出的文本指令信息。
  31. 根据权利要求26-29任一所述的设备,其特征在于,所述第一位置的车内成员的唇部运动信息为所述第一位置的车内成员在所述目标时间段内的唇部运动的图像序列。
  32. 根据权利要求26-31任一所述的设备,其特征在于,
    所述处理器还用于,当所述音频数据中包括所述第二类型指令,获取车内其他N个位置车内成员的图像数据;从所述车内其他N个位置车内成员的图像数据的所述目标时间段内提取所述车内其他N个位置车内成员的唇部运动信息;
    所述处理器将所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息进行匹配,得到匹配结果,具体为
    所述处理器将所述第二类型指令与所述位于车内第一位置的车内成员的唇部运动信息和所述N个位置车内成员的唇部运动信息进行匹配,得到N+1个车内成员的唇部运动信息分别与所述第二类型指令之间的匹配度,获取匹配度最高的唇部运动信息;
    所述处理器当根据所述匹配结果确定所述第二类型指令和所述位于车内第一位置的车内成员的唇部运动信息为匹配,发送指示执行所述第二类指令的指示信息,具体为:
    所述处理器当所述匹配度最高的唇部运动信息为所述位于车内第一位置的车内成员的唇部运动信息,则发送指示执行所述第二类指令的指示信息。
  33. 根据权利要求26-32任一所述的设备,其特征在于,所述第二类型指令为车辆行驶操控指令。
  34. 一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行时,权利要求1-19中任意一项所述的方法得以实现。
  35. 一种计算机可读存储介质,其特征在于,所述计算机可读介质用于存储程序代码,所述程序代码包括用于执行如权利要求1-19任一项所述的方法。
  36. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被执行时,使得如权利要求1-19中的任意一项所述的方法得以实现。
PCT/CN2021/091138 2020-07-03 2021-04-29 一种车舱内语音指令控制方法及相关设备 WO2022001347A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21833146.0A EP4163913A4 (en) 2020-07-03 2021-04-29 VOICE COMMAND CONTROL METHOD IN A VEHICLE AND ASSOCIATED DEVICE
KR1020237002403A KR20230027252A (ko) 2020-07-03 2021-04-29 차량 캐빈에서의 음성 명령 제어 방법 및 관련 디바이스
US18/146,662 US20230129816A1 (en) 2020-07-03 2022-12-27 Speech instruction control method in vehicle cabin and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010631879.5 2020-07-03
CN202010631879.5A CN113963692A (zh) 2020-07-03 2020-07-03 一种车舱内语音指令控制方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/146,662 Continuation US20230129816A1 (en) 2020-07-03 2022-12-27 Speech instruction control method in vehicle cabin and related device

Publications (1)

Publication Number Publication Date
WO2022001347A1 true WO2022001347A1 (zh) 2022-01-06

Family

ID=79317362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091138 WO2022001347A1 (zh) 2020-07-03 2021-04-29 一种车舱内语音指令控制方法及相关设备

Country Status (5)

Country Link
US (1) US20230129816A1 (zh)
EP (1) EP4163913A4 (zh)
KR (1) KR20230027252A (zh)
CN (1) CN113963692A (zh)
WO (1) WO2022001347A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115520201A (zh) * 2022-10-26 2022-12-27 深圳曦华科技有限公司 车辆主驾驶位功能动态响应方法及相关装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758654B (zh) * 2022-03-14 2024-04-12 重庆长安汽车股份有限公司 一种基于场景的汽车语音控制系统及控制方法
CN114940124A (zh) * 2022-05-17 2022-08-26 上海安亭地平线智能交通技术有限公司 感知系统的分时复用方法、装置、电子设备和存储介质
CN117221983A (zh) * 2022-06-02 2023-12-12 中兴通讯股份有限公司 座舱单元控制方法、系统及计算机存储介质
CN117636908B (zh) * 2024-01-26 2024-03-26 长春黄金设计院有限公司 数字化矿山生产管控系统
CN118042355B (zh) * 2024-04-11 2024-07-26 江西天创智能科技有限公司 一种舞台智能声控音响的自动化控制系统及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104011735A (zh) * 2011-12-26 2014-08-27 英特尔公司 基于车辆的对乘员音频和可视输入的确定
CN108986806A (zh) * 2018-06-30 2018-12-11 上海爱优威软件开发有限公司 基于声源方向的语音控制方法及系统
US20190251970A1 (en) * 2018-02-15 2019-08-15 DMAI, Inc. System and method for disambiguating a source of sound based on detected lip movement
CN111091824A (zh) * 2019-11-30 2020-05-01 华为技术有限公司 一种语音匹配方法及相关设备
CN111223479A (zh) * 2019-10-11 2020-06-02 华为技术有限公司 一种操作权限控制方法及相关设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9747900B2 (en) * 2013-05-24 2017-08-29 Google Technology Holdings LLC Method and apparatus for using image data to aid voice recognition
JP2017090613A (ja) * 2015-11-09 2017-05-25 三菱自動車工業株式会社 音声認識制御システム
JP2017090612A (ja) * 2015-11-09 2017-05-25 三菱自動車工業株式会社 音声認識制御システム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104011735A (zh) * 2011-12-26 2014-08-27 英特尔公司 基于车辆的对乘员音频和可视输入的确定
US20190251970A1 (en) * 2018-02-15 2019-08-15 DMAI, Inc. System and method for disambiguating a source of sound based on detected lip movement
CN108986806A (zh) * 2018-06-30 2018-12-11 上海爱优威软件开发有限公司 基于声源方向的语音控制方法及系统
CN111223479A (zh) * 2019-10-11 2020-06-02 华为技术有限公司 一种操作权限控制方法及相关设备
CN111091824A (zh) * 2019-11-30 2020-05-01 华为技术有限公司 一种语音匹配方法及相关设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115520201A (zh) * 2022-10-26 2022-12-27 深圳曦华科技有限公司 车辆主驾驶位功能动态响应方法及相关装置

Also Published As

Publication number Publication date
EP4163913A4 (en) 2023-11-01
EP4163913A1 (en) 2023-04-12
KR20230027252A (ko) 2023-02-27
US20230129816A1 (en) 2023-04-27
CN113963692A (zh) 2022-01-21

Similar Documents

Publication Publication Date Title
WO2022001347A1 (zh) 一种车舱内语音指令控制方法及相关设备
CN111091824B (zh) 一种语音匹配方法及相关设备
US20210035586A1 (en) System and method of correlating mouth images to input commands
US11854550B2 (en) Determining input for speech processing engine
CN109941231B (zh) 车载终端设备、车载交互系统和交互方法
US11164586B2 (en) Artificial intelligence apparatus and method for recognizing utterance voice of user
US11495214B2 (en) Artificial intelligence device for providing voice recognition service and method of operating the same
WO2021203880A1 (zh) 一种语音增强方法、训练神经网络的方法以及相关设备
US20190392851A1 (en) Artificial intelligence-based apparatus and method for controlling home theater speech
US20210173614A1 (en) Artificial intelligence device and method for operating the same
US11355101B2 (en) Artificial intelligence apparatus for training acoustic model
JP2022028772A (ja) オーディオデータおよび画像データに基づいて人の発声を解析する車載装置および発声処理方法、ならびにプログラム
Miao et al. Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach.
Biswas et al. Multiple cameras audio visual speech recognition using active appearance model visual features in car environment
US20230077245A1 (en) Artificial intelligence apparatus
CN108665907A (zh) 声音识别装置、声音识别方法、记录介质以及机器人
KR20190118539A (ko) 발화 스타일을 고려하여 음성을 인식하는 인공 지능 장치 및 그 방법
CN113611318A (zh) 一种音频数据增强方法及相关设备
KR20210048271A (ko) 복수 객체에 대한 자동 오디오 포커싱 방법 및 장치
US11211079B2 (en) Artificial intelligence device with a voice recognition
CN117083581A (zh) 人机交互方法、装置以及终端设备
CN115050375A (zh) 一种设备的语音操作方法、装置和电子设备
CN112860210A (zh) 人工智能设备及其操作方法
KR20210097346A (ko) 끝점 검출을 위한 인공 지능 장치
CN116610212A (zh) 一种多模态娱乐交互方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21833146

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021833146

Country of ref document: EP

Effective date: 20230103

NENP Non-entry into the national phase

Ref country code: DE