WO2023116087A1 - 语音交互指令处理方法、装置及计算机可读存储介质 - Google Patents

语音交互指令处理方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2023116087A1
WO2023116087A1 PCT/CN2022/119828 CN2022119828W WO2023116087A1 WO 2023116087 A1 WO2023116087 A1 WO 2023116087A1 CN 2022119828 W CN2022119828 W CN 2022119828W WO 2023116087 A1 WO2023116087 A1 WO 2023116087A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice interaction
vehicle
responded
voice
interaction instruction
Prior art date
Application number
PCT/CN2022/119828
Other languages
English (en)
French (fr)
Inventor
朱长宝
牛建伟
余凯
Original Assignee
北京地平线机器人技术研发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京地平线机器人技术研发有限公司 filed Critical 北京地平线机器人技术研发有限公司
Publication of WO2023116087A1 publication Critical patent/WO2023116087A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present disclosure relates to the technical field of driving, and in particular to a voice interaction instruction processing method, device and computer-readable storage medium.
  • the personnel inside the vehicle can control the vehicle by voice through voice interaction.
  • the interior of the vehicle cannot completely isolate the sound outside the vehicle, which will cause people outside the vehicle to also be able to perform voice control on the vehicle. How to avoid voice control of the vehicle by people outside the vehicle is an urgent problem to be solved by those skilled in the art. question.
  • Embodiments of the present disclosure provide a voice interaction instruction processing method, device, computer-readable storage medium, and electronic equipment.
  • a method for processing voice interaction instructions including:
  • a voice interaction instruction processing device including:
  • the first obtaining module is used to obtain the voice interaction instruction to be responded to;
  • a first determining module configured to determine the generation position of the voice interaction instruction to be responded acquired by the first acquiring module
  • the second determination module is configured to determine whether to respond to the voice interaction instruction to be responded acquired by the first acquisition module based on the generation location of the voice interaction instruction to be responded determined by the first determination module.
  • a computer-readable storage medium stores a computer program, and the computer program is used to execute the above-mentioned voice interaction instruction processing method.
  • an electronic device including:
  • the processor is configured to read the executable instruction from the memory, and execute the instruction to implement the above-mentioned voice interaction instruction processing method.
  • the voice interaction instruction processing method, device, computer-readable storage medium, and electronic device provided by the above-mentioned embodiments of the present disclosure, it can be determined whether to respond to the voice interaction instruction to be responded based on the location where the voice interaction instruction to be responded is generated, so that it can be distinguished Voice interaction instructions are generated inside the vehicle and outside the vehicle, so as to distinguish the situation where the voice interaction instructions come from people inside the vehicle and people outside the vehicle, and do not respond to voice interaction instructions from people outside the vehicle, thereby avoiding Voice control of the vehicle by a person outside the vehicle.
  • Fig. 1 is a schematic flowchart of a method for processing voice interaction instructions provided by an exemplary embodiment of the present disclosure.
  • Fig. 2 is a schematic flowchart of a method for processing voice interaction instructions provided by another exemplary embodiment of the present disclosure.
  • Fig. 3 is a schematic flowchart of a method for processing voice interaction instructions provided by yet another exemplary embodiment of the present disclosure.
  • Fig. 4 is a schematic flowchart of a method for processing a voice interaction instruction provided by another exemplary embodiment of the present disclosure.
  • Fig. 5 is a schematic flowchart of a method for processing a voice interaction instruction provided by another exemplary embodiment of the present disclosure.
  • Fig. 6 is a schematic flowchart of a method for processing a voice interaction instruction provided by another exemplary embodiment of the present disclosure.
  • Fig. 7 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by an exemplary embodiment of the present disclosure.
  • Fig. 8 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by another exemplary embodiment of the present disclosure.
  • Fig. 9 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by yet another exemplary embodiment of the present disclosure.
  • Fig. 10 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by another exemplary embodiment of the present disclosure.
  • Fig. 11 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by another exemplary embodiment of the present disclosure.
  • Fig. 12 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by yet another exemplary embodiment of the present disclosure.
  • Fig. 13 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by yet another exemplary embodiment of the present disclosure.
  • Fig. 14 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • plural may refer to two or more than two, and “at least one” may refer to one, two or more than two.
  • Embodiments of the present disclosure may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well known terminal devices, computing systems, environments and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick client computers, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.
  • the personnel inside the vehicle can control the vehicle by voice through voice interaction.
  • the window can be opened by saying "open the window”.
  • Fig. 1 is a schematic flowchart of a method for processing voice interaction instructions provided by an exemplary embodiment of the present disclosure.
  • the method shown in FIG. 1 includes step 110, step 120 and step 130, each step will be described below.
  • Step 110 acquire the voice interaction instruction to be responded.
  • a voice collection device such as a microphone can be called to collect voice signals, and by detecting the collected voice signals, it can be determined whether the collected voice signals include voice signals carrying voice interaction instructions. If yes, the voice interaction instruction can be obtained based on the collected voice signal, and the voice interaction instruction can be used as the voice interaction instruction to be responded.
  • the voice collection device can be installed inside or outside the vehicle.
  • Step 120 determining the generation location of the voice interaction instruction to be responded to.
  • Step 130 Determine whether to respond to the voice interaction instruction to be responded based on the location where the voice interaction instruction to be responded is generated.
  • the voice interaction instruction to be responded to is generated inside the vehicle, it means that the voice interaction instruction to be responded comes from the personnel inside the vehicle, then it can be determined to respond to the voice interaction instruction to be responded to meet the requirements of the personnel inside the vehicle on the vehicle.
  • Voice control requirements in the case that the voice interaction instruction to be responded is generated outside the vehicle, it means that the voice interaction instruction to be responded comes from a person outside the vehicle, then it can be determined not to respond to the voice interaction instruction to be responded to avoid Voice control of vehicles by personnel.
  • voice interaction instruction processing method it is possible to determine whether to respond to the voice interaction instruction to be responded based on the location where the voice interaction instruction to be responded is generated, so that it is possible to distinguish whether the voice interaction instruction is generated inside the vehicle or outside the vehicle In this way, it is possible to distinguish between voice interaction commands from people inside the vehicle and people outside the vehicle. Voice interaction commands from people outside the vehicle may not be responded to, thereby avoiding voice control of the vehicle by people outside the vehicle.
  • the voice interaction instruction to be responded to includes a first voice interaction instruction for detecting and obtaining a voice signal collected inside the vehicle, and a second voice interaction instruction for detecting and obtaining a voice signal collected outside the vehicle, And the interval between the second voice interaction instruction and the acquisition moment of the first voice interaction instruction is less than a preset duration.
  • the second voice interaction instruction may be acquired earlier than the first voice interaction instruction, or the second voice interaction instruction may be acquired later than the first voice interaction instruction.
  • the preset duration may be 0.2 seconds, 0.3 seconds, 0.5 seconds or other shorter durations, which will not be listed here.
  • a first voice collection device may be provided inside the vehicle, and a second voice collection device may be provided outside the vehicle, and the first voice interaction instruction may be obtained by detecting the voice signal collected by the first voice collection device, and the second The voice interaction instruction can be obtained by detecting the voice signal collected by the second voice collection device.
  • voice enhancement may be performed on the voice signal collected by the first voice collection device first, and then the first voice interaction instruction may be obtained by detecting the enhanced voice signal.
  • voice enhancement may be performed on it first, and then the second voice interaction instruction may be obtained by detecting the enhanced voice signal.
  • the second voice interaction instruction and the first voice interaction instruction are likely to be based on the same voice acquired by different voice collection devices
  • Interactive instructions of course, in some cases (for example, when the person inside the vehicle says “drive the window” and the person outside the vehicle says “play music"), the second voice interaction instruction and the first voice interaction instruction also It may be based on different voice interaction instructions acquired by different voice collection devices.
  • step 120 includes step 1201 , step 1202 and step 1203 .
  • Step 1201 determine first voice feature information of a first voice interaction instruction.
  • the first voice feature information of the first voice interaction instruction includes but is not limited to voice signal energy, voice signal confidence, voiceprint information, etc.; wherein, the voice signal energy can be characterized by sound pressure level, and the voice signal confidence It can be determined by using a traditional acoustic model, and the voiceprint information can be determined by using a voiceprint recognition technology.
  • Step 1202 determine second voice feature information of the second voice interaction instruction.
  • Step 1203 based on the first voice feature information and the second voice feature information, determine whether the location where the voice interaction command to be responded is generated is inside the vehicle or outside the vehicle.
  • step 1203 includes:
  • the generation location of the voice interaction command to be responded is inside the vehicle or outside the vehicle.
  • the voice signal energy may be extracted from the first voice feature information as the first voice signal energy
  • the voice signal energy may be extracted from the second voice feature information as the second voice signal energy
  • the first voice signal energy and The energy of the second voice signal is compared in size, and based on the size relationship obtained through the size comparison, the generation position of the voice interaction command to be responded is determined.
  • the sound pressure level of the first voice interaction command can be characterized in terms of magnitude If it is lower than the sound pressure level of the second voice interaction command, it is determined that the generation position of the voice interaction command to be responded is outside the vehicle, and the size relationship indicates that the sound pressure level of the first voice interaction command is higher than that of the second voice interaction command. In the case of the sound pressure level, it is determined that the location where the voice interaction instruction to be responded is generated is inside the vehicle.
  • the voice signal energy of the obtained result is different.
  • the voice signal energy of different voice interaction commands is often different.
  • the above characteristics can be used in this embodiment. Referring to the first voice interaction command and the second The size relationship between the voice signal energies of the two voice interaction commands efficiently and reliably determines whether the location where the voice interaction to be responded is generated is inside the vehicle or outside the vehicle.
  • step 1203 includes:
  • the generation location of the voice interaction command to be responded is inside the vehicle or outside the vehicle.
  • voice signal confidence level of any voice interaction instruction can be used to characterize the credibility of the voice interaction instruction itself.
  • the voice signal confidence can be extracted from the first voice feature information as the first voice signal confidence
  • the voice signal confidence can be extracted from the second voice feature information as the second voice signal confidence
  • the second voice signal confidence can be calculated. The difference between the confidence level of the second voice signal and the confidence level of the first voice signal, and based on the calculated difference, determine the generation position of the voice interaction instruction to be responded to.
  • the difference can be compared with a first preset threshold (such as 0.2, 0.3, etc.), and when the size relationship obtained through the size comparison indicates that the difference is greater than the preset threshold, it is determined that the voice interaction instruction to be responded.
  • a first preset threshold such as 0.2, 0.3, etc.
  • the confidence level of the voice signal of the obtained result is in addition, the voice signal confidence of different voice interaction instructions is often different.
  • the above characteristics can be used in this embodiment, referring to the second voice interaction instruction and the first voice interaction instruction.
  • the difference of the confidence level of the voice signal can efficiently and reliably determine whether the location where the voice interaction to be responded is generated is inside the vehicle or outside the vehicle.
  • the first voice interaction instruction and the second voice interaction instruction with a short acquisition time interval can be obtained.
  • the voice feature information with the second voice interaction instruction it can be efficiently and reliably determined whether the location where the voice interaction to be responded is generated is inside the vehicle or outside the vehicle.
  • the first voice collection device and the second voice collection device can be pre-calibrated to the same sensitivity to ensure that the first voice collection device and the second voice collection device collect the same sound pressure Level voice signal, through the analog-to-digital converter (Analog-to-digital converter, ADC) into the voice interaction system of the vehicle can obtain the sound of the same energy, eliminating the difference between the first voice collection device and the second voice collection device Different gains lead to differences in sound pickup inside and outside the vehicle, so as to ensure the accuracy and reliability of the determination results when determining the generation position of the voice interaction command to be responded based on the magnitude relationship between the energy of the first voice signal and the energy of the second voice signal sex.
  • ADC analog-to-digital converter
  • step 120 includes step 1204 , step 1205 and step 1206 .
  • Step 1204 determine the direction of the sound source to be responded to the voice interaction instruction.
  • the sound source localization technology may be used to determine the direction of the sound source to be responded to the voice interaction instruction.
  • the sound source localization technology may be a microphone array technology.
  • Step 1205 determine the seat position of each occupant inside the vehicle.
  • each seat of the vehicle may be provided with a seat sensor, and in step 1205, it may be determined which seats in the vehicle have occupants and which seats do not have occupants based on the detection signals of each seat sensor. , so as to determine the seat position of each occupant inside the vehicle accordingly.
  • the specific implementation of step 1205 is not limited thereto.
  • an image sensor located inside the vehicle can be called to collect an image (such as the first image hereinafter), and the face frame in the first image is detected by an existing algorithm, based on Based on the detection results, it can be determined which seats in the vehicle have occupants and which seats have no occupants, so as to determine the seat positions of the occupants in the vehicle.
  • Step 1206 based on the first degree of matching between the direction of the sound source and the seat position of each occupant, determine whether the generation location of the voice interaction command to be responded is inside or outside the vehicle.
  • a first degree of matching between the direction of the sound source to be responded to the voice interaction instruction and the seat positions of the occupants inside the vehicle may be determined.
  • the direction of the sound source can be characterized by using a direction vector whose starting point is the origin in a coordinate system (for the convenience of description, it will be referred to as the first direction vector in the following).
  • the seat position of a passenger inside the vehicle it can be represented by Determine the corresponding point of the occupant’s position in the coordinate system, and determine the direction vector pointing to the corresponding point from the origin (for ease of description, it will be referred to as the second direction vector later), and then the second direction vector and the first direction vector can be calculated The included angle between them, and the included angle is mapped to the specified interval from 0 to 1, and the obtained mapping value can be used as the first matching degree between the sound source direction and the seat position.
  • the first matching degree between the direction of the sound source to be responded to the voice interaction instruction and the seat position of each occupant inside the vehicle there is a case where the first matching degree is greater than a second preset threshold (for example, 0.7, 0.8, 0.9, etc.)
  • a second preset threshold for example, 0.7, 0.8, 0.9, etc.
  • it can be considered that the voice interaction command to be responded comes from an occupant inside the vehicle, therefore, it can be determined that the voice interaction command to be responded to is generated inside the vehicle, otherwise, it can be considered that the voice interaction command to be responded does not come from any occupant inside the vehicle An occupant, therefore, can determine that the generation location of the voice interaction instruction to be responded to is outside the vehicle.
  • the generation position of the voice interaction command to be responded to can be efficiently and reliably determined by judging the matching between the sound source direction of the voice interaction command to be responded to and the seat position of each occupant inside the vehicle.
  • step 120 includes:
  • Step 1207 input the voice interaction instruction to be responded to into the pre-trained voice recognition model, and obtain the generation location classification information output by the voice recognition model.
  • the speech recognition model used to recognize the generation position of the input speech can be obtained through model training in advance.
  • multiple sets of sample data can be obtained, and each set of sample data includes a voice data (which can be collected by a voice collection device installed inside the specific vehicle) and a tag data, the tag data is used to characterize the Whether the voice data is actually generated inside the vehicle or outside the vehicle, if the voice data is actually generated inside the vehicle, the tag data can be 1, and if the voice data is actually generated outside the vehicle, the tag data can be 0 .
  • speech data in multiple sets of sample data can be used as input data
  • label data in multiple sets of sample data can be used as output data for training, so as to obtain a speech recognition model.
  • the voice interaction command is in the form of voice data.
  • the voice interaction command to be responded can be input into a pre-trained voice recognition model, and the voice recognition model can perform calculations based on the voice interaction command to be responded to output the voice interaction command to be responded to.
  • the location classification information is generated in response to the voice interaction instruction.
  • the generation position classification information of the voice interaction instruction to be responded to may be used to characterize the confidence level that the generation position of the voice interaction command to be responded to is inside the vehicle. ), it can be considered that the generation position of the voice interaction command to be responded is inside the vehicle, otherwise, the generation position of the voice interaction command can be considered as the outside of the vehicle.
  • the recognition model in this way, based on the speech recognition model, can efficiently and reliably determine whether the generation location of the voice interaction command to be responded is inside the vehicle or outside the vehicle.
  • the voice data generated inside the vehicle can also be used as negative samples, and the voice data generated outside the vehicle can be used as positive samples. It can be used to characterize the confidence that the location of the voice interaction instruction to be responded to is outside the vehicle.
  • the method further includes step 111 and step 112 .
  • Step 111 acquiring a first image captured by an image sensor located inside the vehicle (for ease of description, it will be referred to as a first image sensor hereinafter).
  • the first image sensor may be a Charge Coupled Device (Charge Coupled Device, CCD) or a Complementary Metal Oxide Semiconductor (Complementary Metal Oxide Semiconductor, CMOS).
  • CCD Charge Coupled Device
  • CMOS Complementary Metal Oxide Semiconductor
  • Step 112 based on the first image, acquire attribute information of the first person.
  • the face frame in the first image can be detected by an existing algorithm, and the attributes of the faces in the face frame can be judged to obtain the first person attribute information.
  • the first person attribute information includes but is not limited to gender information, age information, Identity Document (ID) information, voiceprint information, etc.
  • the voiceprint information can be acquired in the following two ways:
  • Method 1 The occupant information can be pre-recorded, registered to obtain the voiceprint information, and the obtained voiceprint information can be entered into the system, and the voiceprint information entered into the system can be associated with the ID information; Obtaining the ID information through face recognition, and then calling the voiceprint information corresponding to the ID information as the voiceprint information in the first person attribute information;
  • the lip movement information of the face in the face frame can be obtained through the existing algorithm. If the lip movement indicates that the occupant corresponding to the face is speaking, then you can The voice information spoken by the occupant is collected during the lip movement, and the voiceprint information of the occupant is extracted according to the voice information, and the voiceprint information can be used as the voiceprint information in the first personnel attribute information.
  • Step 120 includes step 1208 and step 1209 .
  • Step 1208 based on the voice interaction instruction to be responded to, acquire attribute information of the second person.
  • voice recognition technology may be used to analyze the voice interaction command to be responded to obtain second personal attribute information, which includes but not limited to gender information, age information, ID information, voiceprint information, etc.
  • Step 1209 based on the second matching degree between the first personnel attribute information and the second personnel attribute information, determine that the generation location of the voice interaction command to be responded is inside the vehicle or outside the vehicle.
  • the second matching degree between the first personal attribute information and the second personal attribute information may be determined, for example, based on whether the gender information in the first personal attribute information is the same as the gender information in the second personal attribute information , can determine the matching score of the gender dimension (which can be between 0 and 1), based on the age difference between the age represented by the age information in the first person attribute information and the age represented by the age information in the second person attribute information value, you can determine the matching score of the age dimension (it can be between 0 and 1), and the matching scores of other dimensions can be deduced by analogy, and then the matching scores of all dimensions can be weighted and averaged, and the weighted average result can be used as A second matching degree between the first personal attribute information and the second personal attribute information.
  • the generation location of the voice interaction instruction to be responded is inside the vehicle or outside the vehicle.
  • the method further includes step 113 and step 114 .
  • Step 113 acquiring a second image collected by an image sensor located outside the vehicle (for ease of description, it will be referred to as a second image sensor hereinafter).
  • the second image sensor can be CCD or CMOS.
  • Step 114 based on the second image, acquire attribute information of the third person.
  • step 114 For the specific implementation manner of step 114, reference may be made to the introduction of the specific implementation manner of step 112, which will not be repeated here.
  • Step 1209 including:
  • Step 12091 based on the second matching degree between the first personnel attribute information and the second personnel attribute information, and the third matching degree between the third personnel attribute information and the second personnel attribute information, determine the voice interaction instruction to be responded to The generation location is inside the vehicle or outside the vehicle.
  • the third degree of matching between the attribute information of the third person and the attribute information of the second person may be determined.
  • the generation position of the voice interaction instruction to be responded may be determined. Specifically, it may be determined that the pending The generation location of the response voice interaction instruction is inside the vehicle; otherwise, it is determined that the generation location of the voice interaction instruction to be responded is outside the vehicle.
  • the matching judgment between the first personnel attribute information and the second personnel attribute information, and the matching judgment between the third personnel attribute information and the second personnel attribute information can combine the image inside the vehicle, the image outside the vehicle and the voice interaction command to be responded to accurately and reliably determine the generation position of the voice interaction command to be responded to.
  • the generation location of the voice interaction instruction for example, in the case where the second matching degree is greater than the third preset threshold, it can be determined that the generation location of the voice interaction instruction to be responded is inside the vehicle; otherwise, it can be determined that the generation location of the voice interaction instruction to be responded is Outside the vehicle, for another example, when the third matching degree is less than the fourth preset threshold, it can be determined that the generation location of the voice interaction command to be responded is inside the vehicle; otherwise, it can be determined that the generation location of the voice interaction instruction to be responded is outside the vehicle.
  • step 130 includes:
  • the voice interaction command to be responded is generated inside the vehicle, it is determined to respond to the voice interaction command to be responded to, and based on the voice interaction command to be responded to, the vehicle and/or the vehicle-mounted equipment on the vehicle are controlled.
  • the vehicle and/or the vehicle-mounted equipment on the vehicle may be controlled in response to the voice interaction command to be responded to, for example, the voice interaction command to be responded to
  • the window can be controlled to open.
  • the voice interaction instruction to be responded is used to instruct to play music
  • the on-board entertainment device on the vehicle can be called to play music, which can Effectively meet the voice control needs of the people inside the vehicle.
  • image acquisition may be performed based on the first image sensor, so as to obtain the occupant situation in the vehicle based on the collected image (for example, based on whether the vehicle can be detected in the image) occupants in the vehicle), or, based on the detection signals of each seat sensor in the vehicle, the occupants in the vehicle can be acquired. If the condition of the occupants in the vehicle indicates that there are no occupants in the vehicle, the voice control function of the vehicle can be turned off; Produced at:
  • Method 1 Judgment is made according to the occupant’s voice attribute information to obtain the above-mentioned second personnel attribute information.
  • the first personnel attribute information can be obtained based on the image collected by the first image sensor.
  • the second degree of matching between information determines whether the voice interaction instruction to be responded is generated inside or outside the vehicle, and if it is outside the vehicle, the voice interaction instruction to be responded may not be responded to.
  • Method 2 Use microphone array technology to locate the sound source and determine the direction of the sound source to be responded to the voice interaction command. If the direction of the sound source does not come from any occupant in the vehicle, it can be determined that the location of the voice interaction command to be responded to is the vehicle external, then the voice interaction command to be responded may not be responded to.
  • Method 3 Microphones can be arranged inside and outside the vehicle at the same time. If the same voice interaction command is collected inside and outside the vehicle within a certain period of time, and the sound pressure level of the voice interaction command collected outside the vehicle is higher than that inside the vehicle, or If the voice quality is higher than that inside the vehicle, it can be determined that the voice interaction instruction to be responded is generated outside the vehicle, and then the voice interaction instruction to be responded may not be responded to.
  • Method 4 Using the pre-trained speech recognition model, determine whether the voice interaction instruction to be responded is generated inside the vehicle or outside the vehicle. If it is outside the vehicle, the voice interaction instruction to be responded may not be responded to.
  • the second image sensor can be used to collect images first, so as to determine whether there is a person outside the vehicle based on the collected image. Response to the location where the voice interaction command is generated.
  • Any voice interaction instruction processing method provided by the embodiments of the present disclosure may be executed by any appropriate device with data processing capabilities, including but not limited to: terminal devices and servers.
  • any voice interaction instruction processing method provided in the embodiments of the present disclosure may be executed by a processor, for example, the processor executes any voice interaction instruction processing method mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. I won't go into details below.
  • Fig. 7 is a schematic structural diagram of an apparatus for processing voice interaction instructions provided by an exemplary embodiment of the present disclosure.
  • the apparatus shown in FIG. 7 includes a first acquiring module 710 , a first determining module 720 and a second determining module 730 .
  • the first acquiring module 710 is configured to acquire the voice interaction instruction to be responded to;
  • the first determination module 720 is configured to determine the generation position of the voice interaction instruction to be responded acquired by the first acquisition module 710;
  • the second determination module 730 is configured to determine whether to respond to the voice interaction instruction to be responded acquired by the first acquisition module 710 based on the generation position of the voice interaction instruction to be responded determined by the first determination module 720 .
  • the voice interaction instruction to be responded to acquired by the first acquisition module 710 includes a first voice interaction instruction for detecting and acquiring a voice signal collected inside the vehicle, and a first voice interaction instruction for detecting and acquiring a voice signal collected outside the vehicle
  • the second voice interaction instruction, and the interval between the acquisition moment of the second voice interaction instruction and the first voice interaction instruction is less than the preset duration
  • the first determining module 720 includes:
  • the first determination sub-module 7201 is configured to determine the first voice feature information of the first voice interaction instruction included in the voice interaction instruction to be responded to and acquired by the first acquisition module 710;
  • the second determination sub-module 7202 is configured to determine the second voice characteristic information of the second voice interaction instruction included in the voice interaction instruction to be responded to and acquired by the first acquisition module 710;
  • the third determination sub-module 7203 is configured to determine the voice interaction to be responded acquired by the first acquisition module 710 based on the first voice feature information determined by the first determination sub-module 7201 and the second voice feature information determined by the second determination sub-module 7202 The place where the instruction is generated is inside the vehicle or outside the vehicle.
  • the third determination submodule 7203 includes:
  • the first acquiring unit is configured to acquire the energy of the first voice signal based on the first voice feature information determined by the first determining submodule 7201, and acquire the second voice signal based on the second voice feature information determined by the second determining submodule 7202 energy;
  • the first determination unit is configured to determine, based on the size relationship between the energy of the first voice signal acquired by the first acquisition unit and the energy of the second voice signal, that the generation position of the voice interaction instruction to be responded acquired by the first acquisition module 710 is inside the vehicle or the exterior of the vehicle;
  • the third determination submodule 7203 includes:
  • the second acquiring unit is configured to acquire the confidence degree of the first voice signal based on the first voice feature information determined by the first determining submodule 7201, and acquire the second voice based on the second voice feature information determined by the second determining submodule 7202 signal confidence;
  • the second determining unit is configured to determine that the generation position of the voice interaction command to be responded acquired by the first acquiring module 710 is inside the vehicle based on the difference between the confidence of the second voice signal acquired by the second acquiring unit and the confidence of the first voice signal. or the exterior of the vehicle.
  • the first determination module 720 includes:
  • the fourth determination sub-module 7204 is configured to determine the sound source direction of the voice interaction instruction to be responded to that is acquired by the first acquisition module 710;
  • the fifth determination sub-module 7205 is used to determine the seat position of each occupant inside the vehicle
  • the sixth determination sub-module 7206 is configured to determine, based on the first matching degree between the direction of the sound source determined by the fourth determination sub-module 7204 and the seat position of each occupant determined by the fifth determination sub-module 7205 , to determine that the first acquisition module 710 obtains The location where the voice interaction command to be responded is generated inside or outside the vehicle.
  • the first determining module 720 includes:
  • the input sub-module 7207 is configured to input the voice interaction instruction to be responded acquired by the first acquisition module 710 into the pre-trained voice recognition model;
  • the first obtaining sub-module 7208 is used to obtain the generation location classification information output by the speech recognition model.
  • the device further includes:
  • the second acquiring module 711 is configured to acquire the first image collected by the image sensor inside the vehicle;
  • the third acquiring module 712 is configured to acquire the attribute information of the first person based on the first image acquired by the second acquiring module 711;
  • the first determination module 720 includes:
  • the second acquisition sub-module 7209 is configured to acquire second personnel attribute information based on the voice interaction instruction to be responded acquired by the first acquisition module 710;
  • the seventh determining sub-module 7210 is configured to determine the first obtaining module 710 based on the second matching degree between the first personal attribute information obtained by the third obtaining module 712 and the second personal attribute information obtained by the second obtaining sub-module 7209.
  • the acquired generation location of the voice interaction instruction to be responded is inside or outside the vehicle.
  • the device further includes:
  • a fourth acquiring module 713 configured to acquire a second image collected by an image sensor located outside the vehicle;
  • the fifth obtaining module 714 is configured to obtain third person attribute information based on the second image obtained by the fourth obtaining module 713;
  • the seventh determination sub-module 7210 is specifically used for:
  • the second acquisition sub-module 7209 acquires the third degree of matching between the attribute information of the second person, and determines that the generation position of the voice interaction command to be responded is inside the vehicle or outside the vehicle.
  • the second determining module 730 includes:
  • the eighth determination sub-module 7301 is used to determine and respond to the voice interaction instruction to be responded acquired by the first acquisition module 710 if the generation location of the voice interaction instruction to be responded acquired by the first acquisition module 710 is inside the vehicle;
  • the control sub-module 7302 is configured to control the vehicle and/or the vehicle-mounted equipment on the vehicle based on the voice interaction instruction to be responded acquired by the first acquisition module 710 .
  • the electronic device may be either or both of the first device and the second device, or a stand-alone device independent of them, and the stand-alone device may communicate with the first device and the second device to receive collected data from them. input signal.
  • FIG. 14 illustrates a block diagram of an electronic device according to an embodiment of the present disclosure.
  • an electronic device 1400 includes one or more processors 1401 and a memory 1402 .
  • the processor 1401 may be a central processing unit (CPU) or other forms of processing units having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1400 to perform desired functions.
  • CPU central processing unit
  • Other components in the electronic device 1400 may control other components in the electronic device 1400 to perform desired functions.
  • Memory 1402 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1401 may execute the program instructions to implement the voice interaction instruction processing method and/or the various embodiments of the present disclosure described above or other desired functionality.
  • Various contents such as input signal, signal component, noise component, etc. may also be stored in the computer-readable storage medium.
  • the electronic device 1400 may further include: an input device 1403 and an output device 1404, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 1403 may be a microphone or a microphone array.
  • the input device 13 may be a communication network connector for receiving collected input signals from the first device and the second device.
  • the input device 1403 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 1404 can output various information to the outside, including determined distance information, direction information, and the like.
  • the output device 1404 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.
  • the electronic device 1400 may further include any other appropriate components.
  • embodiments of the present disclosure may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the above-mentioned "exemplary method" of this specification. Steps in the voice interaction instruction processing method described in the section according to various embodiments of the present disclosure.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and apparatuses of the present disclosure may be implemented by software, hardware, firmware or any combination of software, hardware, and firmware.
  • the above sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise.
  • the present disclosure can also be implemented as programs recorded in recording media, the programs including machine-readable instructions for realizing the method according to the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
  • each component or each step can be decomposed and/or reassembled. These decompositions and/or recombinations should be considered equivalents of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Traffic Control Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音交互指令处理方法、装置及计算机可读存储介质,该方法包括:获取待响应语音交互指令(110);确定待响应语音交互指令的产生位置(120);基于待响应语音交互指令的产生位置,确定是否响应待响应语音交互指令(130)。该方法能够避免车辆外部的人员对车辆的语音控制。

Description

语音交互指令处理方法、装置及计算机可读存储介质
本公开要求在2021年12月21日提交的、申请号为202111574733.2、发明名称为“语音交互指令处理方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及驾驶技术领域,尤其涉及一种语音交互指令处理方法、装置及计算机可读存储介质。
背景技术
基于车辆的语音控制功能,车辆内部的人员可以通过语音交互的方式,对车辆进行语音控制。然而,车辆内部不能完全隔绝车辆外部的声音,这样会导致车辆外部的人员也能够对车辆进行语音控制,如何避免车辆外部的人员对车辆的语音控制对于本领域技术人员而言是一个亟待解决的问题。
发明内容
为了解决上述技术问题,提出了本公开。本公开的实施例提供了一种语音交互指令处理方法、装置、计算机可读存储介质及电子设备。
根据本公开实施例的一个方面,提供了一种语音交互指令处理方法,包括:
获取待响应语音交互指令;
确定所述待响应语音交互指令的产生位置;
基于所述待响应语音交互指令的产生位置,确定是否响应所述待响应语音交互指令。
根据本公开实施例的另一个方面,提供了一种语音交互指令处理装置,包括:
第一获取模块,用于获取待响应语音交互指令;
第一确定模块,用于确定所述第一获取模块获取的所述待响应语音交互指令的产生位置;
第二确定模块,用于基于所述第一确定模块确定的所述待响应语音交互指令的产生位置,确定是否响应所述第一获取模块获取的所述待响应语音交互指令。
根据本公开实施例的再一个方面,提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述语音交互指令处理方法。
根据本公开实施例的又一个方面,提供了一种电子设备,包括:
处理器;
用于存储所述处理器可执行指令的存储器;
所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述语音交互指令处理方法。
基于本公开上述实施例提供的一种语音交互指令处理方法、装置、计算机可读存储介质及电子设备,可以基于待响应语音交互指令的产生位置,确定是否响应待响应语音交互指令,这样可以区分语音交互指令产生于车辆内部和车辆外部的情况,从而区分语音交互指令来自于车辆内部的人员和车辆外部的人员的情况,对于来自于车辆外部的人员的语音交互指令可以不予响应,进而避免车辆外部的人员对车辆的语音控制。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开一示例性实施例提供的语音交互指令处理方法的流程示意图。
图2是本公开另一示例性实施例提供的语音交互指令处理方法的流程示意图。
图3是本公开再一示例性实施例提供的语音交互指令处理方法的流程示意图。
图4是本公开又一示例性实施例提供的语音交互指令处理方法的流程示意图。
图5是本公开又一示例性实施例提供的语音交互指令处理方法的流程示意图。
图6是本公开又一示例性实施例提供的语音交互指令处理方法的流程示意图。
图7是本公开一示例性实施例提供的语音交互指令处理装置的结构示意图。
图8是本公开另一示例性实施例提供的语音交互指令处理装置的结构示意图。
图9是本公开再一示例性实施例提供的语音交互指令处理装置的结构示意图。
图10是本公开又一示例性实施例提供的语音交互指令处理装置的结构示意图。
图11是本公开又一示例性实施例提供的语音交互指令处理装置的结构示意图。
图12是本公开又一示例性实施例提供的语音交互指令处理装置的结构示意图。
图13是本公开又一示例性实施例提供的语音交互指令处理装置的结构示意图。
图14是本公开一示例性实施例提供的电子设备的结构图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
公开概述
基于车辆的语音控制功能,车辆内部的人员可以通过语音交互的方式,对车辆进行语音控制,例如,车辆内部的人员可以通过说“播音乐”实现车载娱乐设备对音乐的播放,车辆内部的人员可以通过说“开车窗”实现车窗的开启。
由于车辆内部不能完全隔绝车辆外部的声音,如果车辆外部的人员说“播音乐”,车载娱乐设备也 会播放音乐,而车辆内部的人员当前可能并不希望听音乐,如果车辆外部的人员说“开车窗”,车窗也会开启,这样会导致车辆内部的财物存在遗失风险,因此,有必要提供一种方法,避免车辆外部的人员对车辆的语音控制。
示例性方法
图1是本公开一示例性实施例提供的语音交互指令处理方法的流程示意图。图1所示的方法包括步骤110、步骤120和步骤130,下面分别对各步骤进行说明。
步骤110,获取待响应语音交互指令。
在步骤110中,可以调用麦克风等语音采集设备进行语音信号的采集,通过对采集到的语音信号进行检测,可以确定采集到的语音信号中是否包括承载有语音交互指令的语音信号,如果确定结果为是,则可以基于采集到的语音信号,获取语音交互指令,该语音交互指令即可作为待响应语音交互指令。可选地,语音采集设备既可以设置于车辆内部,也可以设置于车辆外部。
步骤120,确定待响应语音交互指令的产生位置。
需要说明的是,待响应语音交互指令的产生位置可以有两种情况,分别是车辆内部和车辆外部,并且,待响应语音交互指令的产生位置的确定方式多样,为了布局清楚,后续进行举例介绍。
步骤130,基于待响应语音交互指令的产生位置,确定是否响应待响应语音交互指令。
在待响应语音交互指令的产生位置为车辆内部的情况下,这说明待响应语音交互指令来自于车辆内部的人员,那么,可以确定响应待响应语音交互指令,以满足车辆内部的人员对车辆的语音控制需求;在待响应语音交互指令的产生位置为车辆外部的情况下,这说明待响应语音交互指令来自于车辆外部的人员,那么,可以确定不响应待响应语音交互指令,以避免车辆外部的人员对车辆的语音控制。
基于本公开上述实施例提供的一种语音交互指令处理方法,可以基于待响应语音交互指令的产生位置,确定是否响应待响应语音交互指令,这样可以区分语音交互指令产生于车辆内部和车辆外部的情况,从而区分语音交互指令来自于车辆内部的人员和车辆外部的人员的情况,对于来自于车辆外部的人员的语音交互指令可以不予响应,进而避免车辆外部的人员对车辆的语音控制。
在一个可选示例中,待响应语音交互指令包括对在车辆内部采集的语音信号进行检测获取的第一语音交互指令,以及对在车辆外部采集的语音信号进行检测获取的第二语音交互指令,且第二语音交互指令与第一语音交互指令的获取时刻之间的间隔时长小于预设时长。
可选地,第二语音交互指令的获取时刻可以早于第一语音交互指令,或者,第二语音交互指令的获取时刻可以晚于第一语音交互指令。
可选地,预设时长可以为0.2秒、0.3秒、0.5秒或者其它较短的时长,在此不再一一列举。
需要说明的是,车辆内部可以设置有第一语音采集设备,车辆外部可以设置有第二语音采集设备,第一语音交互指令可以通过对第一语音采集设备采集的语音信号进行检测获取,第二语音交互指令可以通过对第二语音采集设备采集的语音信号进行检测获取。可选地,对于第一语音采集设备采集的语音信号,可以先对其进行语音增强,然后通过对经增强后的语音信号进行检测来获取第一语音交互指令。类似地,对于第二语音采集设备采集的语音信号,也可以先对其进行语音增强,然后通过对经增强后的语音信号进行检测来获取第二语音交互指令。
需要指出的是,由于第二语音交互指令与第一语音交互指令的获取时刻之间的间隔非常短,第二语音交互指令与第一语音交互指令很可能是基于不同语音采集设备获取的相同语音交互指令,当然,在一些情况下(例如车辆内部的人员说“开车窗”的同时,车辆外部的人员说“播音乐”的情况下), 第二语音交互指令与第一语音交互指令也可能是基于不同语音采集设备获取的不同语音交互指令。
在图1所示实施例的基础上,如图2所示,步骤120,包括步骤1201、步骤1202和步骤1203。
步骤1201,确定第一语音交互指令的第一语音特征信息。
可选地,第一语音交互指令的第一语音特征信息包括但不限于语音信号能量、语音信号置信度、声纹信息等;其中,语音信号能量可以利用声压级进行表征,语音信号置信度可以利用传统声学模型进行确定,声纹信息可以利用声纹识别技术确定。
步骤1202,确定第二语音交互指令的第二语音特征信息。
需要说明的是,第二语音交互指令的第二语音特征信息包含的信息和确定方式参照对第一语音特征信息的相关说明即可,在此不再赘述。
步骤1203,基于第一语音特征信息和第二语音特征信息,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
在一种具体实施方式中,步骤1203,包括:
基于第一语音特征信息,获取第一语音信号能量,以及基于第二语音特征信息,获取第二语音信号能量;
基于第一语音信号能量与第二语音信号能量之间的大小关系,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
本实施方式中,可以从第一语音特征信息中提取语音信号能量以作为第一语音信号能量,从第二语音特征信息中提取语音信号能量作为第二语音信号能量,将第一语音信号能量与第二语音信号能量进行大小比较,并基于通过大小比较得到的大小关系,确定待响应语音交互指令的产生位置。
假设第一语音信号能量利用第一语音交互指令的声压级表征,第二语音信号能量利用第二语音交互指令的声压级表征,则可以在大小关系表征第一语音交互指令的声压级低于第二语音交互指令的声压级的情况下,确定待响应语音交互指令的产生位置为车辆外部,并在大小关系表征第一语音交互指令的声压级高于第二语音交互指令的声压级的情况下,确定待响应语音交互指令的产生位置为车辆内部。
需要说明的是,由于声音在传播过程中是存在能量损失的,对于相同语音交互指令,基于对在车辆内部采集的语音信号进行检测和对车辆外部采集的语音信号进行检测来对其进行获取时,获取结果的语音信号能量是存在差异的,另外,不同语音交互指令的语音信号能量往往也是存在差异的,有鉴于此,本实施方式中可以利用上述特性,参考第一语音交互指令与第二语音交互指令两者的语音信号能量之间的大小关系,高效可靠地确定待响应语音交互的产生位置具体是车辆内部还是车辆外部。
在另一种具体实施方式中,步骤1203,包括:
基于第一语音特征信息,获取第一语音信号置信度,以及基于第二语音特征信息,获取第二语音信号置信度;
基于第二语音信号置信度和第一语音信号置信度的差值,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
需要说明的是,任一语音交互指令的语音信号置信度可以用于表征该语音交互指令本身的可信程度。
本实施方式中,可以从第一语音特征信息中提取语音信号置信度以作为第一语音信号置信度,从第二语音特征信息中提取语音信号置信度以作为第二语音信号置信度,计算第二语音信号置信度与第一语音信号置信度的差值,并基于通过计算得到的差值,确定待响应语音交互指令的产生位置。
可选地,可以将差值与第一预设阈值(例如0.2、0.3等)进行大小比较,在通过大小比较得到的大小关系表征差值大于预设阈值的情况下,确定待响应语音交互指令的产生位置为车辆外部,并在通过大小比较得到的大小关系为差值小于或等于预设阈值的情况下,确定待响应语音交互指令的产生位置为车辆内部。
需要说明的是,理论上来说,对于相同语音交互指令,基于对在车辆内部采集的语音信号进行检测和对车辆外部采集的语音信号进行检测来对其进行获取时,获取结果的语音信号置信度是存在差异的,另外,不同语音交互指令的语音信号置信度往往也是存在差异的,有鉴于此,本实施例中可以利用上述特性,参考第二语音交互指令与第一语音交互指令两者的语音信号置信度的差值,高效可靠地确定待响应语音交互的产生位置具体是车辆内部还是车辆外部。
可见,本公开的实施例中,通过在车辆内部和车辆外部均进行语音信号的采集,可以得到获取时刻间隔较短的第一语音交互指令和第二语音交互指令,通过对第一语音交互指令和第二语音交互指令进行语音特征信息的比对,可以高效可靠地确定待响应语音交互的产生位置具体是车辆内部还是车辆外部。
需要说明的是,图2所示实施例中,第一语音采集设备和第二语音采集设备可以预先校准至相同的灵敏度,以保证第一语音采集设备和第二语音采集设备采集到同样声压级的语音信号,通过模拟数字转换器(Analog-to-digital converter,ADC)传入车辆的语音交互系统后能够获取到同样能量的声音,消除由于第一语音采集设备和第二语音采集设备的增益不同而导致车辆内外声音拾取的差异,从而保证后续基于第一语音信号能量与第二语音信号能量之间的大小关系,确定待响应语音交互指令的产生位置时,确定结果的准确性和可靠性。
在图1所示实施例的基础上,如图3所示,步骤120,包括步骤1204、步骤1205和步骤1206。
步骤1204,确定待响应语音交互指令的声源方向。
在步骤1204中,可以利用声源定位技术,确定待响应语音交互指令的声源方向。可选地,声源定位技术可以为传声器阵列技术。
步骤1205,确定车辆内部各乘员的座位位置。
可选地,车辆的每个座椅均可以设置有座椅传感器,在步骤1205中,可以基于各座椅传感器的检测信号,确定车辆中的哪些座椅上有乘员,哪些座椅上没有乘员,从而据此确定车辆内部各乘员的座位位置。当然,步骤1205的具体实施方式并不局限于此,例如,可以调用位于车辆内部图像传感器采集图像(例如下文中的第一图像),通过现有算法检测第一图像中的人脸框,基于检测结果,可以确定车辆中的哪些座椅上有乘员,哪些座椅上没有乘员,从而据此确定车辆内部各乘员的座位位置。
步骤1206,基于声源方向和各乘员的座位位置之间的第一匹配度,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
在步骤1206中,可以确定待响应语音交互指令的声源方向与车辆内部各乘员的座位位置之间的第一匹配度。可选地,声源方向可以利用一坐标系中起点为原点的一方向向量(为了便于描述,后续将其称为第一方向向量)进行表征,对于车辆内部某一乘员的座位位置,可以在该坐标系中确定该乘员位置的对应点,确定由原点指向该对应点的方向向量(为了便于描述,后续将其称为第二方向向量),之后可以计算第二方向向量与第一方向向量之间的夹角大小,并将夹角大小映射至0至1这个指定区间,得到的映射值可以作为声源方向与该座位位置之间的第一匹配度。
在待响应语音交互指令的声源方向与车辆内部各乘员的座位位置之间的第一匹配度中,存在大于第二预设阈值(例如0.7、0.8、0.9等)的第一匹配度的情况下,可以认为待响应语音交互指令来 自于车辆内部的某一乘员,因此,可以确定待响应语音交互指令的产生位置为车辆内部,否则,可以认为待响应语音交互指令不来自于车辆内部的任一乘员,因此,可以确定待响应语音交互指令的产生位置为车辆外部。
需要说明的是,在待响应语音交互指令来自于车辆内部的某一乘员的情况下,待响应语音交互指令的声源方向与该乘员的座位位置理论上是相匹配的,有鉴于此,本公开的实施例中,可以通过待响应语音交互指令的声源方向与车辆内部各乘员的座位位置之间的匹配性判断,高效可靠地确定待响应语音交互指令的产生位置。
在图1所示实施例的基础上,如图4所示,步骤120,包括:
步骤1207,将待响应语音交互指令输入预先训练好的语音识别模型,获取语音识别模型输出的产生位置分类信息。
需要说明的是,可以预先通过模型训练,得到用于识别输入语音的产生位置的语音识别模型。具体地,对于特定车辆,可以获取多组样本数据,每组样本数据中包括一语音数据(其可以由设置于特定车辆内部的语音采集设备采集)和一标签数据,该标签数据用于表征该语音数据实际产生于车辆内部还是车辆外部,在该语音数据实际产生于车辆内部的情况下,该标签数据可以为1,在该语音数据实际产生于车辆外部的情况下,该标签数据可以为0。实际训练时,可以多组样本数据中的语音数据作为输入数据,以多组样本数据中的标签数据作为输出数据进行训练,从而得到语音识别模型。
一般而言,语音交互指令呈语音数据的形式,在步骤1207中,可以将待响应语音交互指令输入预先训练好的语音识别模型,语音识别模型可以基于待响应语音交互指令进行运算,从而输出待响应语音交互指令的产生位置分类信息。可选地,待响应语音交互指令的产生位置分类信息可以用于表征待响应语音交互指令的产生位置为车辆内部的置信度,在该置信度大于预设置信度(例如0.75、0.8、0.9等)的情况下,可以认为待响应语音交互指令的产生位置为车辆内部,否则,可以认为语音交互指令的产生位置为车辆外部。
可见,本公开的实施例中,通过以产生于车辆内部的语音数据作为正样本,以产生于车辆外部的语音数据作为负样本进行训练,能够生成用于识别输入的语音数据的产生位置的语音识别模型,这样,基于语音识别模型,能够高效可靠地确定待响应语音交互指令的产生位置为车辆内部还是车辆外部。
需要指出的是,在进行模型训练时,也可以将产生于车辆内部的语音数据作为负样本,以产生于车辆外部的语音数据作为正样本,这时,待响应语音交互指令的产生位置分类信息可以用于表征待响应语音交互指令的产生位置为车辆外部的置信度。
在图1所示实施例的基础上,如图5所示,该方法还包括步骤111和步骤112。
步骤111,获取位于车辆内部图像传感器(为了便于描述,后续将其称为第一图像传感器)采集的第一图像。
可选地,第一图像传感器可以为电荷耦合器件(Charge Coupled Device,CCD)或者互补金属氧化物半导体(Complementary Metal Oxide Semiconductor,CMOS)。
步骤112,基于第一图像,获取第一人员属性信息。
在步骤112中,可以通过现有算法检测第一图像中的人脸框,对人脸框中的人脸进行属性判决,以得到第一人员属性信息,第一人员属性信息包括但不限于性别信息、年龄信息、身份标识(Identity Document,ID)信息、声纹信息等。
需要说明的是,为了使第一人员属性信息包括声纹信息,可以通过以下两种方式进行声纹信息 的获取:
方式一:可以预先录制乘员信息,进行注册以得到声纹信息,将得到的声纹信息录入系统,录入系统中的声纹信息可以与ID信息关联;在步骤112中进行属性判决时,可以先通过人脸识别获取ID信息,之后通过调取该ID信息对应的声纹信息作为第一人员属性信息中的声纹信息;
方式二:在检测出第一图像中的人脸框之后,可以通过现有算法获取人脸框中的人脸的唇动信息,如果唇动则表明该人脸对应的乘员在说话,那么可以在唇动期间采集该乘员说的语音信息,并根据语音信息提取该乘员的声纹信息,该声纹信息可以作为第一人员属性信息中的声纹信息。
步骤120,包括步骤1208和步骤1209。
步骤1208,基于待响应语音交互指令,获取第二人员属性信息。
在步骤1208中,可以采用语音识别技术,对待响应语音交互指令进行分析,以得到第二人员属性信息,第二人员属性信息包括但不限于性别信息、年龄信息、ID信息、声纹信息等。
步骤1209,基于第一人员属性信息与第二人员属性信息之间的第二匹配度,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
在步骤1209中,可以确定第一人员属性信息与第二人员属性信息之间的第二匹配度,例如,基于第一人员属性信息中的性别信息和第二人员属性信息中的性别信息是否相同,可以确定性别维度的匹配度得分(其可以位于0至1之间),基于第一人员属性信息中的年龄信息所表示的年龄和第二人员属性信息中的年龄信息所表示的年龄的差值,可以确定年龄维度的匹配度得分(其可以位于0至1之间),其它维度的匹配度得分依此类推,之后可以将所有维度的匹配度得分进行加权平均,加权平均结果即可作为第一人员属性信息与第二人员属性信息之间的第二匹配度。
之后,可以基于第一人员属性信息与第二人员属性信息之间的第二匹配度,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
在图5所示实施例的基础上,如图6所示,该方法还包括步骤113和步骤114。
步骤113,获取位于车辆外部图像传感器(为了便于描述,后续将其称为第二图像传感器)采集的第二图像。
可选地,第二图像传感器可以为CCD或者CMOS。
步骤114,基于第二图像,获取第三人员属性信息。
需要说明的是,步骤114的具体实施方式参照对步骤112的具体实施方式的介绍即可,在此不再赘述。
步骤1209,包括:
步骤12091,基于第一人员属性信息与第二人员属性信息之间的第二匹配度,以及第三人员属性信息与第二人员属性信息之间的第三匹配度,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
在步骤12091中,可以确定第三人员属性信息与第二人员属性信息之间的第三匹配度,具体确定方式参照对第二匹配度的确定方式的说明即可,在此不再赘述。之后,可以基于第二匹配度和第三匹配度,确定待响应语音交互指令的产生位置。具体地,可以在第二匹配度大于第三预设阈值(例如0.7、0.8、0.9等),第三匹配度小于第四预设阈值(例如0.1、0.2、0.3等)的情况下,确定待响应语音交互指令的产生位置为车辆内部,否则,确定待响应语音交互指令的产生位置为车辆外部。
需要说明的是,图6所示实施例中,通过第一人员属性信息与第二人员属性信息之间的匹配性判断,以及第三人员属性信息与第二人员属性信息之间的匹配性判断,能够结合车辆内部图像、车辆外部图像和待响应语音交互指令三者,准确可靠地确定待响应语音交互指令的产生位置。当然, 具体实施时,也可以仅结合车辆内部图像和待响应语音交互指令两者,确定待响应语音交互指令的产生位置,或者仅结合车辆外部图像和待响应语音交互指令两者,确定待响应语音交互指令的产生位置,例如,可以在第二匹配度大于第三预设阈值的情况下,确定待响应语音交互指令的产生位置为车辆内部,否则,确定待响应语音交互指令的产生位置为车辆外部,再例如,可以在第三匹配度小于第四预设阈值的情况下,确定待响应语音交互指令的产生位置为车辆内部,否则,确定待响应语音交互指令的产生位置为车辆外部。
可见,本公开的实施例中,通过结合图像和语音,进行属性信息相关的匹配,能够准确可靠地确定待响应语音交互指令的产生位置。
在一个可选示例中,步骤130,包括:
若待响应语音交互指令的产生位置为车辆内部,确定响应待响应语音交互指令,并基于待响应语音交互指令,对车辆和/或车辆上的车载设备进行控制。
本公开的实施例中,对于待响应语音交互指令的产生位置为车辆内部的情况,可以响应于待响应语音交互指令,对车辆和/或车辆上的车载设备进行控制,例如,在待响应语音交互指令用于指示开启车窗的情况下,可以控制车窗开启,再例如,在待响应语音交互指令用于指示播放音乐的情况下,可以调用车辆上的车载娱乐设备进行音乐播放,这样能够有效地满足车辆内部的人员对车辆的语音控制需求。
在一个可选示例中,为了避免车辆外部的人员对车辆的语音控制,可以先基于第一图像传感器进行图像采集,以基于采集的图像获取车内乘员情况(例如基于该图像中是否能够检测到人脸框来获取车内乘员情况),或者,可以基于车辆内的各座椅传感器的检测信号,获取车内乘员情况。如果车内乘员情况表征车辆内没有乘员,则可以关闭车辆的语音控制功能,如果车内乘员情况表征车辆内有乘员,可以采用以下四种方式中的任一种,确定待响应语音交互指令的产生位置:
方式一:根据乘员声音属性信息进行判决,以得到上文中的第二人员属性信息,另外可以基于第一图像传感器采集的图像得到第一人员属性信息,基于第一人员属性信息与第二人员属性信息之间的第二匹配度,确定待响应语音交互指令的产生位置为车辆内部还是车辆外部,如果是车辆外部,则可以不响应待响应语音交互指令。
方式二:通过传声器阵列技术进行声源定位,确定待响应语音交互指令的声源方向,如果声源方向不来自于车辆内的任何一个乘员,则可以确定待响应语音交互指令的产生位置为车辆外部,那么可以不响应待响应语音交互指令。
方式三:可以同时在车辆内外布置麦克风,如果在某个时间段在车辆内部和车辆外部采集到了相同的语音交互指令,且在车辆外部采集的语音交互指令的声压级高于车辆内部,或者语音质量高于车辆内部,则可以确定待响应语音交互指令的产生位置为车辆外部,那么可以不响应待响应语音交互指令。
方式四:通过预先训练得到的语音识别模型,确定待响应语音交互指令的产生位置为车辆内部还是车辆外部,如果是车辆外部,则可以不响应待响应语音交互指令。
在一个可选示例中,可以先通过第二图像传感器进行图像采集,以基于采集的图像确定车辆外部是否有人员,如果车辆外部有人员,再采用上述四种方式中的任一种,确定待响应语音交互指令的产生位置。
需要强调的是,本公开的实施例中,所涉及的人员属性信息的获取、存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。
本公开实施例提供的任一种语音交互指令处理方法可以由任意适当的具有数据处理能力的设备 执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种语音交互指令处理方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种语音交互指令处理方法。下文不再赘述。
示例性装置
图7是本公开一示例性实施例提供的语音交互指令处理装置的结构示意图。图7所示的装置包括第一获取模块710、第一确定模块720和第二确定模块730。
第一获取模块710,用于获取待响应语音交互指令;
第一确定模块720,用于确定第一获取模块710获取的待响应语音交互指令的产生位置;
第二确定模块730,用于基于第一确定模块720确定的待响应语音交互指令的产生位置,确定是否响应第一获取模块710获取的待响应语音交互指令。
在一个可选示例中,第一获取模块710获取的待响应语音交互指令包括对在车辆内部采集的语音信号进行检测获取的第一语音交互指令,以及对在车辆外部采集的语音信号进行检测获取的第二语音交互指令,且第二语音交互指令与第一语音交互指令的获取时刻之间的间隔时长小于预设时长;
如图8所示,第一确定模块720,包括:
第一确定子模块7201,用于确定第一获取模块710获取的待响应语音交互指令包括的第一语音交互指令的第一语音特征信息;
第二确定子模块7202,用于确定第一获取模块710获取的待响应语音交互指令包括的第二语音交互指令的第二语音特征信息;
第三确定子模块7203,用于基于第一确定子模块7201确定的第一语音特征信息和第二确定子模块7202确定的第二语音特征信息,确定第一获取模块710获取的待响应语音交互指令的产生位置为车辆内部或车辆外部。
在一个可选示例中,
第三确定子模块7203,包括:
第一获取单元,用于基于第一确定子模块7201确定的第一语音特征信息,获取第一语音信号能量,以及基于第二确定子模块7202确定的第二语音特征信息,获取第二语音信号能量;
第一确定单元,用于基于第一获取单元获取的第一语音信号能量与第二语音信号能量之间的大小关系,确定第一获取模块710获取的待响应语音交互指令的产生位置为车辆内部或车辆外部;
或者,
第三确定子模块7203,包括:
第二获取单元,用于基于第一确定子模块7201确定的第一语音特征信息,获取第一语音信号置信度,以及基于第二确定子模块7202确定的第二语音特征信息,获取第二语音信号置信度;
第二确定单元,用于基于第二获取单元获取的第二语音信号置信度和第一语音信号置信度的差值,确定第一获取模块710获取的待响应语音交互指令的产生位置为车辆内部或车辆外部。
在一个可选示例中,如图9所示,第一确定模块720,包括:
第四确定子模块7204,用于确定第一获取模块710获取的待响应语音交互指令的声源方向;
第五确定子模块7205,用于确定车辆内部各乘员的座位位置;
第六确定子模块7206,用于基于第四确定子模块7204确定的声源方向和第五确定子模块7205确定的各乘员的座位位置之间的第一匹配度,确定第一获取模块710获取的待响应语音交互指令的产生位置为车辆内部或车辆外部。
在一个可选示例中,如图10所示,第一确定模块720,包括:
输入子模块7207,用于将第一获取模块710获取的待响应语音交互指令输入预先训练好的语音识别模型;
第一获取子模块7208,用于获取语音识别模型输出的产生位置分类信息。
在一个可选示例中,如图11所示,该装置还包括:
第二获取模块711,用于获取位于车辆内部图像传感器采集的第一图像;
第三获取模块712,用于基于第二获取模块711获取的第一图像,获取第一人员属性信息;
第一确定模块720,包括:
第二获取子模块7209,用于基于第一获取模块710获取的待响应语音交互指令,获取第二人员属性信息;
第七确定子模块7210,用于基于第三获取模块712获取的第一人员属性信息与第二获取子模块7209获取的第二人员属性信息之间的第二匹配度,确定第一获取模块710获取的待响应语音交互指令的产生位置为车辆内部或车辆外部。
在一个可选示例中,如图12所示,该装置还包括:
第四获取模块713,用于获取位于车辆外部图像传感器采集的第二图像;
第五获取模块714,用于基于第四获取模块713获取的第二图像,获取第三人员属性信息;
第七确定子模块7210,具体用于:
基于第三获取模块712获取的第一人员属性信息与第二获取子模块7209获取的第二人员属性信息之间的第二匹配度,以及第五获取模块714获取的第三人员属性信息与第二获取子模块7209获取的第二人员属性信息之间的第三匹配度,确定待响应语音交互指令的产生位置为车辆内部或车辆外部。
在一个可选示例中,如图13所示,第二确定模块730,包括:
第八确定子模块7301,用于若第一获取模块710获取的待响应语音交互指令的产生位置为车辆内部,确定响应第一获取模块710获取的待响应语音交互指令;
控制子模块7302,用于基于第一获取模块710获取的待响应语音交互指令,对车辆和/或车辆上的车载设备进行控制。
示例性电子设备
下面,参考图14来描述根据本公开实施例的电子设备。该电子设备可以是第一设备和第二设备中的任一个或两者、或与它们独立的单机设备,该单机设备可以与第一设备和第二设备进行通信,以从它们接收所采集到的输入信号。
图14图示了根据本公开实施例的电子设备的框图。
如图14所示,电子设备1400包括一个或多个处理器1401和存储器1402。
处理器1401可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备1400中的其他组件以执行期望的功能。
存储器1402可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器1401可以运行所述程序指令,以实现上文所述的本公开的各个实施例的语音交互指令处理方 法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备1400还可以包括:输入装置1403和输出装置1404,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,在该电子设备是第一设备或第二设备时,该输入装置1403可以是麦克风或麦克风阵列。在该电子设备是单机设备时,该输入装置13可以是通信网络连接器,用于从第一设备和第二设备接收所采集的输入信号。
此外,该输入装置1403还可以包括例如键盘、鼠标等等。
该输出装置1404可以向外部输出各种信息,包括确定出的距离信息、方向信息等。该输出装置1404可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图14中仅示出了该电子设备1400中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备1400还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的语音交互指令处理方法中的步骤。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。

Claims (11)

  1. 一种语音交互指令处理方法,包括:
    获取待响应语音交互指令;
    确定所述待响应语音交互指令的产生位置;
    基于所述待响应语音交互指令的产生位置,确定是否响应所述待响应语音交互指令。
  2. 根据权利要求1所述的方法,其中,所述待响应语音交互指令包括对在车辆内部采集的语音信号进行检测获取的第一语音交互指令,以及对在所述车辆外部采集的语音信号进行检测获取的第二语音交互指令,且所述第二语音交互指令与所述第一语音交互指令的获取时刻之间的间隔时长小于预设时长;
    所述确定所述待响应语音交互指令的产生位置,包括:
    确定所述第一语音交互指令的第一语音特征信息;
    确定所述第二语音交互指令的第二语音特征信息;
    基于第一语音特征信息和第二语音特征信息,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部。
  3. 根据权利要求2所述的方法,其中,
    所述基于第一语音特征信息和第二语音特征信息,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部,包括:
    基于所述第一语音特征信息,获取第一语音信号能量,以及基于所述第二语音特征信息,获取第二语音信号能量;
    基于所述第一语音信号能量与所述第二语音信号能量之间的大小关系,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部;
    或者,
    所述基于第一语音特征信息和第二语音特征信息,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部,包括:
    基于所述第一语音特征信息,获取第一语音信号置信度,以及基于所述第二语音特征信息,获取第二语音信号置信度;
    基于所述第二语音信号置信度和所述第一语音信号置信度的差值,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部。
  4. 根据权利要求1所述的方法,其中,所述确定所述待响应语音交互指令的产生位置,包括:
    确定所述待响应语音交互指令的声源方向;
    确定车辆内部各乘员的座位位置;
    基于所述声源方向和所述各乘员的座位位置之间的第一匹配度,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部。
  5. 根据权利要求1所述的方法,其中,所述确定所述待响应语音交互指令的产生位置,包括:
    将所述待响应语音交互指令输入预先训练好的语音识别模型,获取所述语音识别模型输出的产生位置分类信息。
  6. 根据权利要求1所述的方法,还包括:
    获取位于车辆内部图像传感器采集的第一图像;
    基于所述第一图像,获取第一人员属性信息;
    所述确定所述待响应语音交互指令的产生位置,包括:
    基于所述待响应语音交互指令,获取第二人员属性信息;
    基于所述第一人员属性信息与所述第二人员属性信息之间的第二匹配度,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部。
  7. 根据权利要求6所述的方法,还包括:
    获取位于所述车辆外部图像传感器采集的第二图像;
    基于所述第二图像,获取第三人员属性信息;
    所述基于所述第一人员属性信息与所述第二人员属性信息之间的第二匹配度,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部,包括:
    基于所述第一人员属性信息与所述第二人员属性信息之间的第二匹配度,以及所述第三人员属性信息与所述第二人员属性信息之间的第三匹配度,确定所述待响应语音交互指令的产生位置为所述车辆内部或所述车辆外部。
  8. 根据权利要求1-7中任一所述的方法,其中,所述基于所述待响应语音交互指令的产生位置,确定是否响应所述待响应语音交互指令,包括:
    若所述待响应语音交互指令的产生位置为车辆内部,确定响应所述待响应语音交互指令,并基于所述待响应语音交互指令,对所述车辆和/或所述车辆上的车载设备进行控制。
  9. 一种语音交互指令处理装置,包括:
    第一获取模块,用于获取待响应语音交互指令;
    第一确定模块,用于确定所述第一获取模块获取的所述待响应语音交互指令的产生位置;
    第二确定模块,用于基于所述第一确定模块确定的所述待响应语音交互指令的产生位置,确定是否响应所述第一获取模块获取的所述待响应语音交互指令。
  10. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-8中任一所述的语音交互指令处理方法。
  11. 一种电子设备,包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-8中任一所述的语音交互指令处理方法。
PCT/CN2022/119828 2021-12-21 2022-09-20 语音交互指令处理方法、装置及计算机可读存储介质 WO2023116087A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111574733.2A CN114255753A (zh) 2021-12-21 2021-12-21 语音交互指令处理方法、装置及计算机可读存储介质
CN202111574733.2 2021-12-21

Publications (1)

Publication Number Publication Date
WO2023116087A1 true WO2023116087A1 (zh) 2023-06-29

Family

ID=80793824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/119828 WO2023116087A1 (zh) 2021-12-21 2022-09-20 语音交互指令处理方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114255753A (zh)
WO (1) WO2023116087A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986806A (zh) * 2018-06-30 2018-12-11 上海爱优威软件开发有限公司 基于声源方向的语音控制方法及系统
CN109545219A (zh) * 2019-01-09 2019-03-29 北京新能源汽车股份有限公司 车载语音交互方法、系统、设备及计算机可读存储介质
CN109781134A (zh) * 2018-12-29 2019-05-21 百度在线网络技术(北京)有限公司 导航控制方法、装置、车机端及存储介质
CN111653277A (zh) * 2020-06-10 2020-09-11 北京百度网讯科技有限公司 车辆语音控制方法、装置、设备、车辆及存储介质
CN112164395A (zh) * 2020-09-18 2021-01-01 北京百度网讯科技有限公司 车载语音启动方法、装置、电子设备和存储介质
CN112581981A (zh) * 2020-11-04 2021-03-30 北京百度网讯科技有限公司 人机交互方法、装置、计算机设备和存储介质
CN112655000A (zh) * 2020-04-30 2021-04-13 华为技术有限公司 车内用户定位方法、车载交互方法、车载装置及车辆
CN112802468A (zh) * 2020-12-24 2021-05-14 广汽蔚来新能源汽车科技有限公司 汽车智能终端的交互方法、装置、计算机设备和存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214219B2 (en) * 2006-09-15 2012-07-03 Volkswagen Of America, Inc. Speech communications system for a vehicle and method of operating a speech communications system for a vehicle
JP2017193207A (ja) * 2016-04-18 2017-10-26 アイシン精機株式会社 車室内会話支援装置
CN109941231B (zh) * 2019-02-21 2021-02-02 初速度(苏州)科技有限公司 车载终端设备、车载交互系统和交互方法
CN112201259B (zh) * 2020-09-23 2022-11-25 北京百度网讯科技有限公司 声源定位方法、装置、设备和计算机存储介质
CN112735381B (zh) * 2020-12-29 2022-09-27 四川虹微技术有限公司 一种模型更新方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986806A (zh) * 2018-06-30 2018-12-11 上海爱优威软件开发有限公司 基于声源方向的语音控制方法及系统
CN109781134A (zh) * 2018-12-29 2019-05-21 百度在线网络技术(北京)有限公司 导航控制方法、装置、车机端及存储介质
CN109545219A (zh) * 2019-01-09 2019-03-29 北京新能源汽车股份有限公司 车载语音交互方法、系统、设备及计算机可读存储介质
CN112655000A (zh) * 2020-04-30 2021-04-13 华为技术有限公司 车内用户定位方法、车载交互方法、车载装置及车辆
CN111653277A (zh) * 2020-06-10 2020-09-11 北京百度网讯科技有限公司 车辆语音控制方法、装置、设备、车辆及存储介质
CN112164395A (zh) * 2020-09-18 2021-01-01 北京百度网讯科技有限公司 车载语音启动方法、装置、电子设备和存储介质
CN112581981A (zh) * 2020-11-04 2021-03-30 北京百度网讯科技有限公司 人机交互方法、装置、计算机设备和存储介质
CN112802468A (zh) * 2020-12-24 2021-05-14 广汽蔚来新能源汽车科技有限公司 汽车智能终端的交互方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN114255753A (zh) 2022-03-29

Similar Documents

Publication Publication Date Title
US11240331B2 (en) Ending communications session based on presence data
JP6938784B2 (ja) オブジェクト識別の方法及びその、コンピュータ装置並びにコンピュータ装置可読記憶媒体
US10490195B1 (en) Using system command utterances to generate a speaker profile
US10192553B1 (en) Initiating device speech activity monitoring for communication sessions
US11854550B2 (en) Determining input for speech processing engine
US9899025B2 (en) Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
US11152001B2 (en) Vision-based presence-aware voice-enabled device
US9293133B2 (en) Improving voice communication over a network
US12046237B2 (en) Speech interaction method and apparatus, computer readable storage medium and electronic device
CN112074901A (zh) 语音识别登入
JP2022529783A (ja) 発話認識エンジンのための入力の識別
JP7370014B2 (ja) 収音装置、収音方法、及びプログラム
EP4139816B1 (en) Voice shortcut detection with speaker verification
WO2023020620A1 (zh) 基于音频的处理方法和装置
CN112307816B (zh) 车内图像获取方法、装置以及电子设备、存储介质
US11641544B2 (en) Lightweight full 360 audio source location detection with two microphones
CN110673096A (zh) 语音定位方法和装置、计算机可读存储介质、电子设备
CN116547752A (zh) 虚假音频检测
CN111863005A (zh) 声音信号获取方法和装置、存储介质、电子设备
US11722571B1 (en) Recipient device presence activity monitoring for a communications session
WO2023116087A1 (zh) 语音交互指令处理方法、装置及计算机可读存储介质
CN113889091A (zh) 语音识别方法、装置、计算机可读存储介质及电子设备
US11513767B2 (en) Method and system for recognizing a reproduced utterance
CN117765945A (zh) 应用于车辆的语音处理方法及装置
CN118430537A (zh) 一种防干扰的语音识别方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909405

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE