WO2022188560A1 - Methods for distance relationship determination, device control and model training, and related apparatuses - Google Patents
Methods for distance relationship determination, device control and model training, and related apparatuses Download PDFInfo
- Publication number
- WO2022188560A1 WO2022188560A1 PCT/CN2022/072703 CN2022072703W WO2022188560A1 WO 2022188560 A1 WO2022188560 A1 WO 2022188560A1 CN 2022072703 W CN2022072703 W CN 2022072703W WO 2022188560 A1 WO2022188560 A1 WO 2022188560A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- distance
- data
- target
- voice
- devices
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 98
- 238000012549 training Methods 0.000 title claims abstract description 84
- 230000004044 response Effects 0.000 claims description 84
- 239000013598 vector Substances 0.000 claims description 65
- 230000006870 function Effects 0.000 claims description 61
- 238000012545 processing Methods 0.000 claims description 39
- 238000000605 extraction Methods 0.000 claims description 32
- 230000015654 memory Effects 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 19
- 230000004927 fusion Effects 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000005457 optimization Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000001629 suppression Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 2
- 238000004891 communication Methods 0.000 description 21
- 238000010586 diagram Methods 0.000 description 17
- 238000004422 calculation algorithm Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 14
- 230000009286 beneficial effect Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 8
- 230000003993 interaction Effects 0.000 description 8
- 230000004807 localization Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 230000001360 synchronised effect Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- GNFTZDOKVXKIBK-UHFFFAOYSA-N 3-(2-methoxyethoxy)benzohydrazide Chemical compound COCCOC1=CC=CC(C(=O)NN)=C1 GNFTZDOKVXKIBK-UHFFFAOYSA-N 0.000 description 1
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012367 process mapping Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Definitions
- the present application belongs to the technical field of voice assistants, and in particular relates to a method for distance relationship determination, device control, and model training and related devices.
- voice assistants As artificial intelligence technology ushered in the third wave, voice assistants have gradually entered all aspects of life, and are equipped with smart devices such as mobile phones, watches, speakers, and TVs. Due to the rich variety of devices, there may be multiple devices with voice assistant functions in the same space.
- the microphone array is used to measure the distance of the sound source, and then the device closest to the sound source is determined by distance comparison, and the device is woken up to execute user instructions.
- the present application provides a method and related device for distance relationship determination, device control, and model training, in order to improve the comprehensiveness, efficiency and application convenience of the nearest wake-up product solution arbitration device for calculating the distance between the sound source target and the device.
- the present application provides a method for determining a distance relationship, which is applied to an arbitration device, and the method includes:
- each relative distance identifier in the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence
- the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy for distances, The distance refers to the distance between the device and the sound source target.
- the arbitration device firstly obtains multiple voice collection data from multiple devices, and secondly, according to the reference voice data and device identifiers in the multiple voice collection data and the pre-trained distance comparison model, it determines whether it is compatible with multiple devices. At least one relative distance identifier corresponding to at least one device one-to-one. Since each relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance. The distance refers to the distance between the device and the sound source.
- the model can predict the relative position of the distance between the device and the sound source target in the distance sequence, such as device 1, device 2, and device 3 according to the distance from the sound source target.
- the device distance relationship sequence is device 3 ⁇ device 2 ⁇ device 1
- the prediction result can indicate that the distance between device 3 and the sound source target is the closest by indicating that the relative distance of device 3 is 1, which is relative to the isolated prediction absolute distance of the existing model.
- the distance comparison model is used to predict the relative distance relationship including global information in the multi-device voice interaction scene.
- the present application provides a device control method, which is applied to a target device, and the method includes:
- the at least one relative distance identifier is generated in the case of a voice command associated with the voice of the source and target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operations: acquiring multiple voice collection data corresponding to the multiple devices one-to-one, the Each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the reference speech data in the plurality of sound collection data and
- the device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence.
- the device distance relationship sequence is
- the operation indicated by the sound-related voice instruction of the sound source target is performed according to the instruction information.
- the target device first obtains the indication information of the arbitration device, and secondly, according to the indication information, executes the operation indicated by the voice instruction associated with the sound of the sound source target.
- the indication information is generated by the arbitration device in the case of determining the voice command of the target device in the plurality of devices for executing the sound association of the sound source target according to the at least one relative distance identifier corresponding to at least one device in the plurality of devices, and
- the relative distance identifier is used to indicate the position of the distance between the corresponding device and the sound source target in the distance sequence
- the distance sequence is used to indicate the sequence formed by sorting multiple distances according to the preset sorting strategy.
- each device and the sound source target compared with the existing scheme of determining the nearest wake-up device based on the absolute distance between the device and the sound source target, the present application realizes the prediction of the global information contained in the multi-device voice interaction scene through the distance comparison model and determine the target device to be awakened according to the relative distance relationship.
- each sound collection data includes the reference voice data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment, it can be seen that the collection
- the reference speech data has no limitation on the number of channels, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
- the present application provides a method for training a distance comparison model, including:
- Acquire training data where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the multiple reference voice data
- Each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, and the sound collection environments at least include the location of the sound source target;
- a preset distance comparison model is trained according to the reference speech data of the multiple speech data sets and a preset loss function, and a trained distance comparison model is obtained.
- the prediction accuracy dimension of the relative distance relationship with the sound source target in the sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.
- the device first obtains training data, and secondly, trains a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, and obtains a trained distance comparison model.
- the training data includes multiple voice data sets
- each voice data set includes multiple reference voice data corresponding to multiple devices one-to-one
- each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target
- the multiple voice data sets correspond to the voice data sets collected in different sound collection environments
- the loss function is used to predict the relative distance relationship between the two devices in the device pairing group and the sound source target in the same sound collection environment
- the accuracy dimension represents the loss of the distance comparison model.
- the device pairing group consists of any two devices among multiple devices, so the distance prediction model has the ability to predict the relative distance between the device and the sound source target, and the reference speech data does not have the number of channels Therefore, it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
- the present application provides an apparatus for determining a distance relationship, which is applied to an arbitration device, and the apparatus includes:
- the acquisition unit is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the obtained sound. the equipment identification of the said equipment;
- a determination unit configured to determine at least one relative distance identifier corresponding to at least one of the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model , each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is that the multiple devices are sorted according to a preset sorting strategy of distances And the formed sequence, the distance refers to the distance between the device and the sound source target.
- the present application provides a device control device, which is applied to an electronic device, and the device includes:
- An acquiring unit configured to acquire indication information of the arbitration device, where the indication information is when the arbitration device determines the target in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices.
- the at least one relative distance identifier is generated when the device is used to execute a voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring multiple sounds corresponding to the multiple devices one-to-one Collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data
- the reference voice data and device identification and the pre-trained distance comparison model are determined, and the at least one relative distance identification is determined, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence.
- the position of the device distance relationship is a sequence formed by sorting
- An execution unit configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
- the present application provides a training device for a distance comparison model, including:
- an acquiring unit configured to acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the Each reference voice data in the multiple reference voice data is voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to voice data sets collected under different sound collection environments, the The sound collection environment at least includes the location where the sound source target is located;
- a training unit configured to train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to obtain the trained distance comparison model from the device pairing group
- the prediction accuracy dimension of the relative distance relationship between two devices and the sound source target in the same sound collection environment represents the loss of the distance comparison model
- the device pairing group is composed of any two devices in the plurality of devices.
- the present application provides an electronic device and one or more processors
- the one or more memories and the program are configured to control the electronic device by the one or more processors to perform any method in the first aspect or the second aspect or the third aspect of the embodiments of the present application step instruction.
- the present application provides a chip, including: a processor for calling and running a computer program from a memory, so that a device installed with the chip executes the first aspect or the second aspect or the first aspect of the embodiments of the present application.
- the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the first embodiment of the present application.
- the present application provides a computer program, wherein the computer program is operable to cause a computer to execute part or all of the methods described in the first aspect, the second aspect or the third aspect of the embodiments of the present application step.
- the computer program may be a software installation package.
- 1a is a schematic diagram of user control in a multi-device scenario provided by an embodiment of the present application
- FIG. 1b is an architecture diagram of a device control system 10 provided by an embodiment of the present application.
- 1c is a schematic diagram of a functional interface of an intelligent voice assistant provided by an embodiment of the present application.
- 1d is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- FIG. 2 is a schematic flowchart of a method for determining a distance relationship provided by an embodiment of the present application
- FIG. 3 is a schematic flowchart of a device control method provided by an embodiment of the present application.
- FIG. 4 is a schematic flowchart of a training method for a distance comparison model provided by an embodiment of the present application
- FIG. 5 is a block diagram of functional units of a device for determining a distance relationship provided by an embodiment of the present application
- FIG. 6 is a block diagram of functional units of another distance relationship determination device provided by an embodiment of the present application.
- FIG. 7 is a block diagram of functional units of a device control device provided by an embodiment of the present application.
- FIG. 8 is a block diagram of functional units of another device control apparatus provided by an embodiment of the present application.
- FIG. 9 is a block diagram of functional units of a training device for a distance comparison model provided by an embodiment of the present application.
- FIG. 10 is a block diagram of functional units of another distance comparison model training device provided by an embodiment of the present application.
- FIG. 1a there are smart speakers (0.5m away from the user), smart TV 1 (0.6m away from the user), computer (1.2m away from the user), and smart TV 2 in the space where the user is located.
- the distance from the user is 0.55m
- the current intelligent voice assistant can only measure the distance between the device containing the microphone array and the sound source. If the computer does not contain a microphone array, the intelligent voice The assistant can't calculate the distance and can't accurately realize the control of the nearest wake-up.
- the embodiments of the present application provide a method and related apparatus for distance relationship determination, device control, and model training, which are described in detail below with reference to the accompanying drawings.
- FIG. 1b is a device control system 10 provided by an embodiment of the present application.
- the device control system 10 includes an electronic device 100 (such as a smart TV, a smart speaker, a smart phone, etc.) with a sound collection capability, an arbitration device 200 installed with an intelligent voice assistant, and a server 300.
- the arbitration device may be one of the electronic devices 100. It can also be any one of the mobile devices, such as the user's mobile phone, or it can be a dedicated control box in the smart home scene, it can also be a server in the cloud, or it can be composed of multiple devices that jointly complete data processing.
- a device group, the arbitration device 200 is connected in communication with the electronic device 100 and the server 300 to form a device control network in a smart home scenario.
- the intelligent voice assistant can be installed on various devices such as mobile phones to support the device control method of the present application, and the specific function names and interface interaction methods presented by the intelligent voice assistant can be various, which are not uniquely limited here. , for example, it is installed on a mobile phone and presents the setting function interface of the "Breeno" smart assistant as shown in Figure 1c.
- the legend includes the function settings of the one-key command, including the navigation home function, the nearby function, the reminder function when arriving home, the screenshot with the shell, and the multi-device and control function.
- the Digital tags can be used to identify the proximity of the device to the target of the sound source, the user.
- the arbitration device 200 can exchange data and signaling with other devices (for example, the electronic device 100 and the server 300) in various ways, which are not discussed here. Make the only limit.
- the arbitration device 200 may be directly connected with the electronic device 100 to obtain corresponding information, and the arbitration device 200 may be connected to the server 300 through a mobile communication network to realize corresponding information exchange and the like.
- FIG. 1d is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- the electronic device is applied to the above-mentioned device control system 10.
- the electronic device includes an application processor 120, a memory 130, a communication module 140, and one or more programs 131.
- the application processor 120 communicates with the memory through an internal communication bus. 130.
- the communication modules 140 are all connected in communication.
- the one or more programs 131 are stored in the above-mentioned memory 130 and configured to be executed by the above-mentioned application processor 120, and the one or more programs 131 include a program for executing any step in the above-mentioned method embodiments. instruction.
- the application processor 120 may be, for example, a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a field Field Programmable Gate Array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It may implement or execute the various exemplary logical blocks, units and circuits described in connection with this disclosure.
- the processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
- the communication unit may be a communication module 140 , a transceiver, a transceiver circuit, etc., and the storage unit may be the memory 130 .
- the memory 130 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
- the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
- Volatile memory may be random access memory (RAM), which acts as an external cache.
- RAM random access memory
- SRAM static random access memory
- DRAM dynamic random access memory
- DRAM synchronous dynamic random access memory
- SDRAM synchronous dynamic random access memory
- DDR SDRAM double data rate synchronous dynamic random access memory
- enhanced SDRAM enhanced synchronous dynamic random access memory
- SLDRAM synchronous connection dynamic random access memory Fetch memory
- direct memory bus random access memory direct rambus RAM, DR RAM
- the application processor 120 is configured to perform any step performed by the arbitration device, the target device, or the model training device in the method embodiment of the present application.
- FIG. 2 is a schematic flowchart of a method for determining a distance relationship provided by an embodiment of the present application, which is applied to the arbitration device 200 in the device control system 10. As shown in the figure, the device control method includes the following operations.
- Step 201 Acquire a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the The device ID of the device.
- the plurality of devices may be the electronic devices 100 in the device control system 10 described above, and neither the number nor the device type is uniquely limited.
- the sound source target includes a user or a pronunciation device, which is not uniquely limited here.
- the sound of the sound source target may be a wake-up voice, such as "Hello Xiaoou” and the like.
- each device can wait for the wake-up voice at the same time.
- the device obtains a monophonic segment of the respective wake-up speech.
- the reference voice data may be first subjected to 4KHz low-pass filtering to suppress non-human voice audio parts.
- the reference voice data may be voice data before frequency response capability alignment, feature extraction, and feature fusion processing.
- the arbitration device uniformly performs relevant preprocessing on the reference voice data of multiple devices.
- the reference voice data can also be the voice data processed by each device through frequency response capability alignment, feature extraction, and feature fusion. After the reference speech data and device identifiers of multiple devices are obtained, a pre-trained distance comparison model is invoked to predict at least one relative distance identifier of at least one device.
- Step 202 Determine at least one relative distance identifier that corresponds one-to-one with at least one of the multiple devices according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, so Each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is formed by sorting the plurality of devices according to a preset sorting strategy for distances. , the distance is the distance between the device and the sound source target.
- the preset sorting strategy may include from large to small or from small to large, etc., which is not uniquely limited here.
- the reference voice data includes monophonic voice data or multi-channel voice data, that is, there is no need to make higher requirements on the voice collection capability of the device, and the application is more convenient.
- the relative distance identifiers may be numbers (such as 1/2/3/4), graphics (such as line segments with different lengths), etc., which are not uniquely limited here.
- the prediction result can be: the relative distance of device A is marked as 1 (closest), and the relative distance of device B is marked as 3 , the relative distance of device C is identified as 4 (farthest), and the relative distance of device D is identified as 2.
- the arbitration device can directly select a target device with the closest relative distance to the sound source target as the wake-up device, and execute the user's voice instruction through the target device.
- the arbitration device may further query the preset device wake-up priority set to determine the device to be woken up first. In this way, the accuracy and success rate of device control can be improved.
- the output of the distance comparison model is the relative distance identification of each voice data
- the arbitration device can further determine the corresponding device according to the corresponding relationship between the voice data and the device identification. Relative distance identification.
- the device identification only needs the device identification corresponding to the voice data in the prediction result. That is, the voice data corresponds to the device ID, the model input data does not include the device ID, the prediction result corresponds to the voice data one-to-one, and the prediction result and the device ID correspond indirectly through the voice data.
- the input of the distance comparison model contains audio data information and device identification, such as mobile phone 1, mobile phone 2, mobile phone 3, the device type of mobile phone 1 and mobile phone 2 is type 1, and the device type of mobile phone 3 is type 2, then the device type of mobile phone 1 is The identification can be type 1 + name of mobile phone 1, the identification of mobile phone 2 is type 1 + name of mobile phone 2, and the device identification of mobile phone 3 is type 2 + name of mobile phone 3, then the output of the distance comparison model is directly the relative identification of each device.
- the distance identification that is, the prediction result can directly correspond to the equipment identification.
- a relative distance identifier comprising: aligning the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data to obtain the frequency response capability corresponding to the multiple voice collection data one-to-one The aligned multiple target voice data; according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model, determine whether to match at least one of the multiple devices. At least one relative distance identifier in a one-to-one correspondence between devices.
- the voice signals obtained by heterogeneous devices are adapted to the same standard through the frequency response capability alignment process, which provides a unified data basis for distance comparison. Compared with the existing solution that simply normalizes only the speech energy ratio, it is beneficial to improve the accuracy of the model prediction result.
- the performing frequency response capability alignment on multiple reference voice data in the multiple sound collection data to obtain multiple target voice data after frequency response capability alignment including:
- Each reference voice data in the sound collection data performs the following operations to obtain a plurality of target voice data after frequency response capability alignment: obtain the preset frequency of the current device relative to the reference device according to the device identifier of the currently processed reference voice data.
- the unit impulse response of the frequency response is obtained; the convolution operation is performed on the reference speech data and the unit impulse response of the frequency response, and the gain is adjusted to obtain the target speech data after the frequency response capability is aligned.
- the voice after adjustment by the reference voice data remains unchanged.
- the reference device can be a device specified in multiple devices.
- the frequency response unit impulse response of the current acquisition device relative to the reference device is determined through the following steps a1 to a5:
- step a2 the statistical average of all frequency response curves of each device is obtained as the final frequency response curve of the device:
- K is the number of points in the frequency response curve, which is a positive integer.
- a device is selected as a reference device, and the type is denoted as b ⁇ [1,2,...,L], and the selection method may be the device most frequently used by the user. Calculate the ratio of the frequency response curve of each device to the reference device, and obtain the frequency response transfer function between devices:
- Step a4 perform inverse discrete fast Fourier transform (IDFT) on the frequency response transfer function to obtain the unit impulse response corresponding to the frequency response of the transfer function:
- IDFT inverse discrete fast Fourier transform
- Step a5 save the frequency response unit impulse response t l (n) in the corresponding device.
- the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model determine whether to have at least one device in the plurality of devices.
- a corresponding at least one relative distance identifier including: performing multi-dimensional feature extraction on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one , each voice feature set includes multi-dimensional feature extraction results; feature fusion is performed on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; according to the multiple fused voice features, the The device identifiers in the plurality of sound collection data and the pre-trained distance comparison model are used to determine at least one relative distance identifier corresponding to at least one device among the plurality of devices.
- performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one including : perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice feature of the currently processed target voice data; reduce the vector voice feature Dimension and quadratic feature extraction to obtain vector-derived speech features.
- the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data includes: preprocessing the currently processed target speech data to obtain preprocessed target speech data; extracting the Scalar voice features and vector voice features of the preprocessed target voice data; wherein the preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through high-frequency filters, frame segmentation processing, and windowing processing.
- the purpose of the silence suppression (Voice Activity Detection, VAD) processing is to identify and eliminate long silence periods therefrom, so as to extract the most effective voice segments for use in subsequent steps, for example, the WebRTC VAD algorithm can be used.
- the purpose of the pre-emphasis processing is to compensate for the high frequency components of the speech signal lost due to the influence of the pronunciation system, and to highlight the high frequency formants. Pre-emphasis is achieved by a high-frequency filter whose transfer function is:
- z is the Z transform of the speech signal
- ⁇ is the coefficient of pre-emphasis, which is generally between 0.9 and 1.0, and may be 0.97 in this application, for example.
- the frame length is set to 25ms
- the frame shift is set to 10ms, that is, there is an overlap area of 15ms between two adjacent frames, which can avoid the influence of excessive changes in the speech signals of the two adjacent frames.
- s i [n] to represent the data of the i-th frame.
- N is the length of si [n].
- the above-mentioned scalar speech features can extract 5 kinds of scalar speech features, as shown in Table 1.
- the scalar speech feature is extracted according to its definition, and the specific calculation process will not be repeated.
- the above scalar speech features can all be eigenvalues in the form of (1,), which mainly describe the current sound field loop characteristics of the user, such as reverberation, reflection, sound transmission gain, noise, and the like.
- the form of (1,) represents the data dimension
- (1,) represents a 1 ⁇ 1-dimensional vector
- (245,) represents a 245 ⁇ 1-dimensional vector, of which the 245 ⁇ 1 vector is spliced from the previous features.
- (245,) represents a 245 ⁇ 1-dimensional vector, of which the 245 ⁇ 1 vector is spliced from the previous features.
- (245,) represents a 245 ⁇ 1-dimensional vector, of which the 245 ⁇ 1 vector is spliced from the previous features.
- (245,) represents a 245 ⁇ 1-dimensional vector, of which the 245 ⁇ 1 vector is spliced from the previous features.
- (245,) represents a 1 ⁇ 1-dimensional vector
- (245,) represents a 245 ⁇
- N is a positive integer
- each feature corresponds to each batch of features, and the dimension changes: (1,) ⁇ (1,N), That is, 1 ⁇ N dimension, and similarly (245,) ⁇ (245,N), that is, 245 ⁇ N dimension.
- the above-mentioned vector speech features can extract 8 kinds of vector speech features, as shown in Table 2.
- the vector speech features are extracted according to existing methods, and the specific calculation process will not be repeated.
- the above vector speech features are all feature vectors in the shape of (12,N f ), where 12 is the dimension of the feature, indicating that the vector speech features have 12-dimensional feature components, and N f represents the number of frames of speech.
- the vector speech feature can describe the pickup characteristics of different pickup devices, such as spectrum, pitch, formant, and the like.
- Vector derived feature extraction includes three functional modules: feature component screening, differential feature calculation, and vector feature scalarization, which will be introduced separately below.
- r Fisher is the Fisher ratio of the feature components, the larger the value, the stronger the distinguishing ability of the feature components in this dimension;
- ⁇ b represents the inter-class variance of the feature components, that is, the variance of the mean of the speech feature components of different distance types;
- the intra-class variance of the feature components that is, the variance of the mean of the speech feature components of the same distance type, the calculation formulas of ⁇ b and ⁇ w are as follows:
- M represents the number of samples
- m k represents the mean value of the k-th dimension component of a certain feature vector of all samples
- n i represents the frame number of a certain speech sample
- the feature component screening module uses the Fisher criterion to select the 3 feature components with the largest Fisher ratio from the 8 vector speech features, so that each vector speech feature is converted from (12, N f ) to ( 3, N f ), which greatly reduces the amount of characteristic parameters.
- the above 8 kinds of vector speech features can only reflect the static characteristics of speech, and their dynamic characteristics can be described by the differential parameters of the vector speech features.
- the formula for calculating the differential parameters is as follows:
- c l represents the l-th dimension feature component of a certain vector speech feature
- the first-order differences of the eight vector speech features can be obtained, which are represented by ⁇ MFCC, ⁇ LPCC, ⁇ MHEC, ⁇ BFCC, ⁇ LFCC, ⁇ GFCC, ⁇ NGCC, and ⁇ MSRCC, respectively.
- These vector difference features are in the form of (3, N f ).
- this scheme uses the vector feature scalarization module to perform scalarization processing on the above-mentioned vector features.
- Step a Use GMM and its EM algorithm to cluster each dimension feature component of MFCC, the number of clusters is 4, and four cluster center values of each dimension feature component will be obtained, and a feature of the shape (4, ) will be obtained, marked as where l ⁇ (1,2,3) represents the label of the feature component;
- Step b Calculate the maximum value, minimum value and end value of each dimension feature component of the MFCC, and obtain a feature in the form of (3,), denoted as
- Step c calculate the maximum value, minimum value and square sum of each dimension feature component of the vector difference feature ⁇ MFCC of the MFCC, and obtain a feature in the shape of (3,), denoted as
- Step d obtain from a, b, and c
- the three eigenvectors are spliced to obtain a feature of the shape (10,), denoted as F l ;
- Step e splicing the respective features F1 of the feature components to obtain a new feature in the shape of (30'), denoted as F MFCC , for characterizing the original MFCC feature;
- Step f utilize above-mentioned steps a-step e., operate on other 7 kinds of vector speech features, can obtain respective new features corresponding to each other, denoted as F LPCC , F MHEC , F BFCC , F BFCC , F LFCC , F GFCC , FNGCC , FMSRCC ;
- Step g splicing and merging the above-mentioned 8 new features in the shape of (30,) to obtain a vector-derived speech feature in the shape of (240,), denoted as F D , and each feature value in F D has a specific value. physical meaning.
- the above feature fusion step is used to fuse scalar speech features and vector-derived speech features.
- the 5 kinds of scalar speech features are spliced to obtain a feature vector in the shape of (5,), which is denoted as F S ; then F S and the vector derived feature FD are spliced and fused, and finally the shape of (245, ) of the fusion feature, denoted as F Fusion , that is, each speech is finally extracted into a feature vector of the shape (245,), which is used for the training of the distance relationship model between the speaker and the heterogeneous device.
- the fusion speech features are less affected by the characteristics of the ambient sound field and random noise, and are suitable for a variety of scenarios and heterogeneous distributed devices. powerful.
- the fusion speech feature can be used for training the distance relationship model between speakers and distributed devices, so as to judge the distance between different smart devices and users, so that "human-machine distance" can become an important decision-making dimension in multi-device wake-up, and improve the user's experience. experience.
- the user has multiple distributed devices that support the same wake-up word. After the user speaks the wake-up word, the device closest to the user responds to wake up nearby; , device service capability, user intent and other dimensions, make a comprehensive judgment, and select the most suitable device to respond to the user's request.
- a relative distance identifier comprising: performing multi-dimensional feature extraction on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of sound collection data one-to-one.
- Each voice feature set in the voice feature sets includes multi-dimensional feature extraction results; perform feature fusion on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features;
- the feature, the device identifier in the plurality of sound collection data, and the pre-trained distance comparison model are used to determine at least one relative distance identifier that corresponds one-to-one with at least one device among the plurality of devices.
- the feature extraction can also be used for other distance relationship determination methods, and the calculation results can be used for other application scenarios.
- the multi-dimensional feature extraction is performed on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of sound collection data one-to-one, including : perform frequency response capability alignment on each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain multiple targets whose frequency response capabilities are aligned one-to-one with the multiple voice collection data Voice data; perform multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one.
- the frequency response capability alignment is performed on each of the multiple reference voice data in the multiple sound collection data to obtain the frequency response capability corresponding to the multiple voice collection data one-to-one.
- the multiple target voice data after the sound capability alignment includes: performing the following operations on each reference voice data in the multiple voice collection data to obtain the aligned frequency response capabilities corresponding to the multiple voice collection data one-to-one.
- a plurality of target speech data obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier associated with the currently processed reference speech data; compare the reference speech data with the frequency response unit impulse response Do the convolution operation and adjust the gain to obtain the target speech data after the frequency response capability is aligned.
- performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one including : perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice feature of the currently processed target voice data; reduce the vector voice feature Dimension and quadratic feature extraction to obtain vector-derived speech features.
- the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data includes: preprocessing the currently processed target speech data to obtain preprocessed target speech data; extracting the Scalar voice features and vector voice features of the preprocessed target voice data; wherein the preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through high-frequency filters, frame segmentation processing, and windowing processing.
- the arbitration device can first perform feature extraction and feature fusion on the voice data, and optionally further perform frequency response capability alignment on the processed voice data to improve the flexibility of voice data preprocessing.
- the method further includes: determining, according to the at least one relative distance identification, a target device of the plurality of devices for executing the voice command associated with the sound source target; If the target device is a device other than the arbitration device, send indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction; if the target device is detected For the arbitration device, perform the operation indicated by the voice instruction.
- the voice command associated with the sound of the sound source target may be various user commands such as "play music", which is not uniquely limited here.
- the arbitration device preferentially selects the device with the closest distance to the sound source target as the target device. That is to say, in this application scenario, the at least one relative distance identifier should at least include the relative distance identifier of the device that is closest to the sound source target.
- the specific representation form of the at least one relative distance identifier may be various quotas, which is not uniquely limited here.
- the arbitration device can intelligently determine the target device for executing the voice command associated with the sound source target according to at least one relative distance identifier, which improves the convenience and intelligence of device control.
- the arbitration device firstly obtains multiple voice collection data of multiple devices, and secondly, according to the reference voice data and device identification in the multiple voice collection data and the pre-trained distance comparison model, determine At least one relative distance identifier in one-to-one correspondence with at least one device among the plurality of devices. Since each relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance. The distance refers to the distance between the device and the sound source.
- the visible model can predict the relative position of the device in the device distance relationship sequence, such as the device distance relationship sequence formed by device 1, device 2, and device 3 according to the distance from the sound source target from near to far. is device 3 ⁇ device 2 ⁇ device 1, then the prediction result can indicate that the distance between device 3 and the sound source target is the closest by indicating that the relative distance of device 3 is 1.
- the distance comparison model is used to predict the relative distance relationship containing global information in the multi-device voice interaction scene.
- each sound collection data includes the reference voice data obtained by the corresponding device to collect the sound of the sound source target and the device identification of the device .
- FIG. 3 is a schematic flowchart of a device control method provided by an embodiment of the present application, which is applied to a target device in the device control system 10. As shown in the figure, the device control method includes the following operations.
- Step 301 Acquire indication information of the arbitration device, where the indication information indicates that the arbitration device is used to determine the target device in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices.
- the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring a plurality of sound collection data corresponding to the plurality of devices one-to-one , each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding device collecting the sound of the sound source target and the device identification of the device; and according to the reference voice data in the plurality of sound collection data
- the voice data, the device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence , the device distance relationship sequence is a
- Step 302 Execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
- the target device is the device closest to the sound source target among the multiple devices. This situation applies to the nearest wake-up product scheme.
- the target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.
- the arbitration device directly generates indication information, and executes the operation indicated by the voice instruction associated with the sound of the sound source target according to the indication information.
- the arbitration device If the target device is a device other than the arbitration device among the multiple devices, the arbitration device generates indication information, sends the indication information to the target device, and the target device executes the sound source target according to the indication information. The action indicated by the voice associated with the sound.
- the method further includes: if it is detected that the distance between the target device and the sound source target is greater than a preset distance, outputting a prompt message to prompt the user to approach the target device; and/or, if it is detected that the If the distance between the target device and the sound source target is greater than the preset distance, the output volume of the target device is increased.
- the output volume of the target device is lowered. In this way, the intelligence of device control can be improved, and the user experience can be improved.
- the preset distance may be, for example, 5 meters, 10 meters, or the like.
- the target device first obtains the indication information of the arbitration device, and secondly, according to the indication information, executes the operation indicated by the voice command associated with the sound of the sound source target.
- the indication information is generated by the arbitration device in the case of determining the voice command of the target device in the plurality of devices for executing the sound association of the sound source target according to the at least one relative distance identifier corresponding to at least one device in the plurality of devices, and
- the relative distance identifier is used to indicate the position of the distance between the corresponding device and the sound source target in the distance sequence
- the distance sequence is used to indicate the sequence formed by sorting multiple distances according to the preset sorting strategy.
- each device and the sound source target compared with the existing scheme of determining the nearest wake-up device based on the absolute distance between the device and the sound source target, the present application realizes the prediction of the global information contained in the multi-device voice interaction scene through the distance comparison model and determine the target device to be awakened according to the relative distance relationship.
- each sound collection data includes the reference voice data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment, it can be seen that the collection
- the reference speech data has no limitation on the number of channels, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
- FIG. 4 is a schematic flowchart of a training method for a distance comparison model provided by an embodiment of the present application, which is applied to a model training device.
- the device control method includes the following operations.
- Step 401 Acquire training data, where the training data includes multiple voice data sets, and each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one.
- each reference speech data is speech data obtained by the corresponding device collecting the sound of the sound source target, and the multiple speech data sets correspond to speech data sets collected in different sound collection environments, and the sound collection The environment includes at least the location where the sound source target is located.
- the sound collection environment refers to the acoustic environment in which sound data is collected, and the sound collection environment can be diversified.
- the difference can be further constructed by the difference of at least one of the following features Sound collection environment: room size, noise level, etc.
- the wake-up voice in quiet environment and noisy environment is collected.
- Noise is artificially added, and Gaussian white noise, electrical noise (such as fans, air conditioners), and traffic noise in the noise database can be selected.
- the signal-to-noise ratio can be set to -15dB, -10dB, -5dB, 0dB, 5dB, 10dB, 15dB, etc.
- Step 402 Train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to select two from the equipment pairing group.
- the prediction accuracy dimension of the relative distance relationship between the device and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.
- the distance comparison model may specifically be a deep neural network, and the deep neural network may be, for example, a convolutional neural network or a deep residual network, which is not uniquely limited here.
- the relative distance relationship is characterized by defining a score of an event in which the first device is closer to the sound source target than the second device among the two devices, and the value of the score is Associated with the distance difference, the distance difference is the difference between the first distance and the second distance, the first distance is the distance between the first device and the sound source target, and the second distance is the the distance between the second device and the sound source target.
- the score can be calculated and expressed in the form of data similar to probability, and the value range of the score falls within the interval (0, 1).
- the first device is closer to the sound source target than the second device.
- the score for the closest event may be 0.8, the score for the event where the first device is closer to the sound source target than the third device may be 0.9, and the score for the second device is closer to the sound source target than the third device.
- the score for the event may be 0.7, the score for the event where the second device is closer to the sound source target than the first device may be 0.08, and the score for the event where the third device is closer to the sound source target than the first device is.
- the score may be 0.05, the score for an event where the third device is closer to the sound source target than the second device may be 0.09, etc.
- the score is calculated by at least one score of at least one group of adjacent devices that constitute the direct or indirect adjacent relationship between the two devices; the score of the adjacent device is obtained by calculating the score of the adjacent device.
- Two relative distance identifiers of two devices are obtained by calculation, and the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence.
- the distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target.
- the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance, and the distance refers to the device distance relationship sequence. The distance between it and the sound source target, so that the relative distance identification can represent the model prediction result containing global information, and improve the accuracy of the model prediction result representation.
- the training of the preset distance comparison model according to the reference speech data of the multiple speech data sets and the preset loss function, to obtain a trained distance comparison model includes: converting the training data It is divided into a training set and a test set, the training set includes part of the voice data sets in the multiple voice data sets; the preset distance comparison model is trained at least once by using the training set, until the training set is The accuracy of the distance comparison model for predicting the distance comparison result of the test set is greater than a preset accuracy.
- the preset accuracy may be, for example, 98%, 99%, etc., which is not limited here.
- the model has the prediction ability that meets the preset accuracy requirements.
- the training includes forward propagation and back propagation optimization; in the forward propagation, the predicted relative distance identifier is obtained by calculating the speech features of the speech data set; in the back propagation optimization, the predicted relative distance is used Calculate the predicted score and the real score using the relative distance identification and the real relative distance identification, and use the loss function, the predicted score and the real score to calculate the loss of the distance comparison model, according to the The parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.
- the design of the loss function is realized through the following steps a to e:
- any two devices can form a pair.
- Step b record the feature vector extracted by each device as x a , x b , x c , x d , x e , and record the deep neural network feedforward process mapping as f, then the feature vector (taking x a as an example) corresponds to the output layer
- f the feature vector
- step c a score can be obtained from the reference voice data between the pairing of the two devices:
- Step d for the real label, still calculate the score between any two device pairings. Since there is a relative relationship between the distances between multiple devices, the scores between pairs of adjacent devices are first calculated, and the scores between non-adjacent devices are calculated based on this.
- the score is:
- b is the common adjacent device of the two, and the score is:
- step e after the input data is fed forward through the deep neural network, back-propagation is performed according to the loss function between the actual output and the real label, and the network parameters are iteratively adjusted to improve the network performance.
- the loss function is calculated as follows:
- the loss function can quantitatively measure the difference between the estimated label and the real label of the speech data of any two devices in the current device set after passing through the distance comparison model, and adjust the parameters of the distance comparison model through the difference. until the model prediction accuracy meets the requirements.
- the device first obtains training data, and secondly, trains a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, and obtains a trained distance comparison model.
- Model Since the training data includes multiple voice data sets, each voice data set includes multiple reference voice data corresponding to multiple devices one-to-one, and each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target , and the multiple voice data sets correspond to the voice data sets collected in different sound collection environments, and at the same time, the loss function is used to predict the relative distance relationship between the two devices in the device pairing group and the sound source target in the same sound collection environment The accuracy dimension represents the loss of the distance comparison model.
- the device pairing group consists of any two devices among multiple devices, so the distance prediction model has the ability to predict the relative distance between the device and the sound source target, and the reference speech data does not have the number of channels Therefore, it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
- An embodiment of the present application provides an apparatus for determining a distance relationship
- the device for determining a distance relationship may be an arbitration device.
- the apparatus for determining a distance relationship is configured to perform the steps performed by the arbitration device in the above method for determining a distance relationship.
- the apparatus for determining a distance relationship provided in this embodiment of the present application may include modules corresponding to corresponding steps.
- the distance relationship determining apparatus may be divided into functional modules according to the above method examples.
- each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
- the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
- the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
- FIG. 5 shows a possible schematic structural diagram of the apparatus for determining a distance relationship involved in the above embodiment.
- the distance relationship determination device 5 is applied to the arbitration device 200 in the device control system 10; the device includes:
- the acquisition unit 50 is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding device collecting the sound of the sound source target and the the device identification of the device;
- the determining unit 51 is configured to determine at least one relative distance corresponding to at least one device one-to-one among the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model Identification, each relative distance identification in the at least one relative distance identification is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is the plurality of devices according to the preset sorting strategy of distance.
- a sequence formed by sorting, and the distance refers to the distance between the device and the sound source target.
- the determining unit 51 is specifically configured to align the frequency response capability of each reference voice data in the plurality of reference voice data in the plurality of voice collection data, and obtain a comparison with the plurality of voices.
- the determining unit 51 which is specifically used for: performing the following operations for each reference voice data in the plurality of sound collection data to obtain a plurality of target voice data after frequency response capability alignment: obtaining preset according to the device identifier of the currently processed reference voice data The frequency response unit impulse response of the current device relative to the reference device; the convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain the target speech data after frequency response capability alignment.
- the determining unit 51 is specifically configured to: perform multi-dimensional feature extraction on each target voice data in the plurality of target voice data, and obtain a Multiple voice feature sets corresponding to data one-to-one, each voice feature set includes multi-dimensional feature extraction results; and feature fusion is performed on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; and, according to the plurality of fused voice features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine at least one relative device corresponding to at least one device among the plurality of devices. distance sign.
- the determining unit 51 is specifically configured to: perform multi-dimensional feature extraction on each sound collection data in the plurality of sound collection data, and obtain a one-to-one correspondence with the plurality of sound collection data
- a plurality of voice feature sets, each voice feature set in the multiple voice feature sets includes multi-dimensional feature extraction results; and each voice feature set in the multiple voice feature sets is feature fusion to obtain multiple fused voice features; and according to the plurality of fused voice features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device in the plurality of devices at least one relative distance identifier of .
- the determining unit 51 is specifically configured to: align the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain a one-to-one correspondence with the multiple voice collection data A plurality of target speech data after the frequency response capability alignment; and multi-dimensional feature extraction is performed on each target speech data in the plurality of target speech data, to obtain a plurality of target speech data corresponding to the plurality of target speech data one-to-one.
- a collection of speech features is specifically configured to: align the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain a one-to-one correspondence with the multiple voice collection data A plurality of target speech data after the frequency response capability alignment; and multi-dimensional feature extraction is performed on each target speech data in the plurality of target speech data, to obtain a plurality of target speech data corresponding to the plurality of target speech data one-to-one.
- the determining unit 51 is specifically configured to: perform the following operations for each reference voice data in the multiple voice collection data, to obtain a A plurality of target voice data after the frequency response capability alignment of the data one-to-one correspondence: and obtain the preset current device frequency response unit impulse response relative to the reference device according to the device identifier associated with the currently processed reference voice data; A convolution operation is performed between the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain the target speech data after the frequency response capability is aligned.
- the determining unit 51 is specifically configured to: perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice of the currently processed target voice data and performing dimension reduction and secondary feature extraction on the vector speech features to obtain vector derived speech features.
- the determining unit 51 is specifically configured to: preprocess the currently processed target speech data to obtain a pre- processed target speech data; and extracting scalar speech features and vector speech features of the preprocessed target speech data; wherein, the preprocessing includes at least one of the following: silence suppression processing, preprocessing through high-frequency filters Emphasis, Framing, and Windowing.
- the determining unit determines, according to the reference voice data and the device identifier in the plurality of sound collection data and the pre-trained distance comparison model, to determine the relationship with at least one device in the plurality of devices.
- the method is further used to: determine, according to the at least one relative distance identifier, a target device in the plurality of devices for executing the voice command associated with the sound source target; If the target device is a device other than the arbitration device, send indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction; If the target device is the arbitration device, execute the operation indicated by the voice instruction.
- the distance relationship determining device 6 includes: a processing module 60 and a communication module 61 .
- the processing module 60 is used to control and manage the actions of the device control apparatus, for example, the steps performed by the acquisition unit 50, the determination unit 51, and/or other processes used to perform the techniques described herein.
- the communication module 61 is used to support the interaction between the device control apparatus and other devices.
- the distance relationship determining apparatus may further include a storage module 62, and the storage module 62 is configured to store program codes and data of the distance relationship determining apparatus.
- the processing module 60 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
- the processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
- the communication module 61 may be a transceiver, an RF circuit, a communication interface, or the like.
- the storage module 62 may be a memory.
- Both the distance relationship determining device 5 and the distance relationship determining device 6 can perform the steps performed by the arbitration device in the distance relationship determining method shown in FIG. 2 .
- An embodiment of the present application provides a device control device, where the device control device may be an arbitration device.
- the device control apparatus is configured to execute the steps performed by the target device in the above device control method.
- the device control apparatus provided in this embodiment of the present application may include modules corresponding to corresponding steps.
- the device control apparatus may be divided into functional modules according to the foregoing method examples.
- each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
- the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
- the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
- FIG. 7 shows a possible schematic structural diagram of the device control apparatus involved in the foregoing embodiment. As shown in Figure 7, the device control device 7 is applied to the target device; the device includes:
- the obtaining unit 70 is configured to obtain the indication information of the arbitration device, where the indication information is that the arbitration device determines, according to at least one relative distance identifier corresponding to at least one device in the plurality of devices, the said data in the plurality of devices. Generated when the target device is used to execute the voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring a plurality of one-to-one correspondence with the plurality of devices Sound collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification in the reference voice data and the pre-trained distance comparison model are determined to determine the at least one relative distance identification, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence.
- the device distance relationship sequence is a sequence formed by sorting the
- the execution unit 71 is configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
- the target device is the device closest to the sound source target among the plurality of devices.
- the target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.
- the device control apparatus 8 includes: a processing module 80 and a communication module 81 .
- the processing module 80 is used to control and manage the actions of the device control apparatus, for example, the steps performed by the acquisition unit 70, the execution unit 71, and/or other processes used to perform the techniques described herein.
- the communication module 81 is used to support the interaction between the device control apparatus and other devices.
- the device control apparatus may further include a storage module 82, and the storage module 82 is used for storing program codes and data of the device control apparatus.
- the processing module 80 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
- the processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
- the communication module 81 may be a transceiver, an RF circuit, a communication interface, or the like.
- the storage module 82 may be a memory.
- Both the device control device 7 and the device control device 8 can execute the steps performed by the target device in the device control method shown in FIG. 3 .
- An embodiment of the present application provides a training device for a distance comparison model
- the training device for a distance comparison model may be a model training device for training a model.
- the distance comparison model training apparatus is configured to perform the steps performed by the model training device in the above distance comparison model training method.
- the apparatus for training the distance comparison model provided by the embodiment of the present application may include modules corresponding to the corresponding steps.
- the training device of the distance comparison model can be divided into functional modules according to the above method examples.
- each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module.
- the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
- the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
- FIG. 9 shows a possible schematic structural diagram of the training device for the distance comparison model involved in the above embodiment.
- the training device 9 of the distance comparison model is applied to the model training equipment; the device includes:
- the obtaining unit 90 is configured to obtain training data, where the training data includes a plurality of voice data sets, and each voice data set in the plurality of voice data sets includes a plurality of reference voice data corresponding to a plurality of devices one-to-one.
- Each reference voice data in the plurality of reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, so
- the sound collection environment at least includes the location where the sound source target is located;
- a training unit 91 configured to train a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, to obtain a trained distance comparison model, and the loss function is used for pairing groups from devices
- the prediction accuracy dimension of the relative distance relationship between two devices in the same sound collection environment and the sound source target represents the loss of the distance comparison model
- the device pairing group consists of any two devices in the plurality of devices. composition.
- the relative distance relationship is characterized by defining a score of an event in which the first device is closer to the sound source target than the second device among the two devices, and the value of the score is Associated with the distance difference, the distance difference is the difference between the first distance and the second distance, the first distance is the distance between the first device and the sound source target, and the second distance is the the distance between the second device and the sound source target.
- the score is calculated by at least one score of at least one group of adjacent devices forming a direct or indirect adjacent relationship between the two devices;
- the score of the adjacent device is calculated by two relative distance identifiers of the two devices in the adjacent device, the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the corresponding distance.
- the position of the device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting the multiple devices according to the preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target the distance.
- the training unit 91 which is specifically used to: divide the training data into a training set and a test set, the training set includes part of the voice data sets in the multiple voice data sets; and use the training set to compare the preset distance
- the comparison model is trained at least once until the distance comparison model after training has a greater accuracy than a preset accuracy in predicting the distance comparison result of the test set.
- the training includes forward propagation and back propagation optimization
- the predicted relative distance identification and the real relative distance identification are used to calculate the predicted score and the real score
- the loss function, the predicted score and the real score are used to calculate the predicted score.
- the loss of the distance comparison model is adjusted, and the parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.
- the training device 10 of the distance comparison model includes: a processing module 100 and a communication module 101 .
- the processing module 100 is used to control and manage the actions of the training device of the distance comparison model, eg, the steps performed by the acquisition unit 90, the training unit 91, and/or other processes for performing the techniques described herein.
- the communication module 101 is used to support the interaction between the training device of the distance comparison model and other devices.
- the training device for the distance comparison model may further include a storage module 102, and the storage module 102 is used for storing program codes and data of the training device for the distance comparison model.
- the processing module 100 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure.
- the processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.
- the communication module 101 may be a transceiver, an RF circuit, a communication interface, or the like.
- the storage module 102 may be a memory.
- Both the distance comparison model training device 9 and the distance comparison model training device 10 can perform the steps performed by the model training device in the distance comparison model training method shown in FIG. 4 .
- the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
- the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
- the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
- the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission by wire or wireless to another website site, computer, server or data center.
- the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
- the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
- the semiconductor medium may be a solid state drive.
- Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute part or all of the steps of any method described in the above method embodiments , the above computer includes electronic equipment.
- Embodiments of the present application further provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of the method embodiments described above. some or all of the steps of the method.
- the computer program product may be a software installation package, and the computer includes an electronic device.
- the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
- the disclosed method, apparatus and system may be implemented in other manners.
- the device embodiments described above are only illustrative; for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation; for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included individually, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
- the above-mentioned integrated units implemented in the form of software functional units can be stored in a computer-readable storage medium.
- the above-mentioned software functional unit is stored in a storage medium, and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute some steps of the methods described in the various embodiments of the present invention.
- the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
A method and apparatus for distance relationship determination, a method and apparatus for device control, a method and apparatus for training a distance comparison model, an electronic device, and a computer readable storage medium. The method for distance relationship determination comprises: acquiring a plurality of pieces of sound collection data having one-to-one correspondence to a plurality of devices, each of the plurality of pieces of sound collection data comprising reference voice data obtained by collecting sound of a sound source target by means of a corresponding device and a device identifier of the device (201); and determining at least one relative distance identifier having one-to-one correspondence to at least one of the plurality of devices according to the plurality of pieces of sound acquisition data and a pre-trained distance comparison model, wherein each relative distance identifier in the at least one relative distance identifier is used for indicating the position of the corresponding device in a device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting a plurality of devices according to a preset sorting strategy of the distance, and the distance refers to a distance between the device and the sound source target (202). According to the method, the relative distance relationship comprising global information is predicted by means of the distance comparison model, facilitating improving the efficiency and applicability of relative distance relationship prediction.
Description
本申请属于语音助手技术领域,具体涉及一种距离关系确定、设备控制、模型训练的方法及相关装置。The present application belongs to the technical field of voice assistants, and in particular relates to a method for distance relationship determination, device control, and model training and related devices.
随着人工智能技术迎来第三次浪潮,语音助手已逐渐进入生活的各个方面,搭载于手机、手表、音箱、电视等智能设备。由于设备种类的丰富,同一空间中可能存在多个具有语音助手功能的设备。As artificial intelligence technology ushered in the third wave, voice assistants have gradually entered all aspects of life, and are equipped with smart devices such as mobile phones, watches, speakers, and TVs. Due to the rich variety of devices, there may be multiple devices with voice assistant functions in the same space.
目前,就近唤醒产品方案中,多是利用麦克风阵列进行声源距离的测量,进而通过距离比较确定距离声源最近的设备,并唤醒该设备以执行用户指令。At present, in the nearest wake-up product solution, the microphone array is used to measure the distance of the sound source, and then the device closest to the sound source is determined by distance comparison, and the device is woken up to execute user instructions.
发明内容SUMMARY OF THE INVENTION
本申请提供一种距离关系确定、设备控制、模型训练的方法及相关装置,以期提高就近唤醒产品方案仲裁设备进行声源目标与设备之间的距离计算的全面性、效率和应用便捷性。The present application provides a method and related device for distance relationship determination, device control, and model training, in order to improve the comprehensiveness, efficiency and application convenience of the nearest wake-up product solution arbitration device for calculating the distance between the sound source target and the device.
第一方面,本申请提供一种距离关系确定方法,应用于仲裁设备,所述方法包括:In a first aspect, the present application provides a method for determining a distance relationship, which is applied to an arbitration device, and the method includes:
获取与多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;Acquire multiple pieces of sound collection data corresponding to multiple devices one-to-one, and each sound collection data in the multiple pieces of sound collection data includes reference voice data obtained by collecting the sound of the sound source target by the corresponding equipment and the equipment of the equipment. identification;
根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。According to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model, determine at least one relative distance identifier corresponding to at least one device among the multiple devices, and the at least one relative distance identifier is determined. Each relative distance identifier in the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy for distances, The distance refers to the distance between the device and the sound source target.
可见,本示例中,仲裁设备首选获取多个设备的多个声音采集数据,其次,根据多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与多个设备中至少一个设备一一对应的至少一个相对距离标识。由于每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,设备距离关系序列是多个设备按照距离的预设排序策略进行排序而形成的序列,距离是指设备与声源目标之间的距离,可见模型能够预测设备与声源目标的距离在距离序列中的相对位置,如设备1、设备2、设备3按照与声源目标的距离由近到远的排序顺序形成的设备距离关系序列是设备3→设备2→设备1,则预测结果可以通过设备3的相对距离标识为1来指示设备3与声源目标的距离最近,相对于现有模型孤立的预测绝对距离的方案,本申请实现了通过距离比较模型预测多设备语音交互场景中包含全局信息的相对距离关系,同时,由于每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和设备的设备标识,可见采集参考语音数据没有声道数量的限制,故而可以克服常规麦克风阵列声源定位算法的硬件需求高以及算法复杂问题,有利于提高相对距离关系预测的效率和适用性。第二方面,本申请提供一种设备控制方法,应用于目标设备,所述方法包括:It can be seen that in this example, the arbitration device firstly obtains multiple voice collection data from multiple devices, and secondly, according to the reference voice data and device identifiers in the multiple voice collection data and the pre-trained distance comparison model, it determines whether it is compatible with multiple devices. At least one relative distance identifier corresponding to at least one device one-to-one. Since each relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance. The distance refers to the distance between the device and the sound source. The distance between the targets, it can be seen that the model can predict the relative position of the distance between the device and the sound source target in the distance sequence, such as device 1, device 2, and device 3 according to the distance from the sound source target. The device distance relationship sequence is device 3 → device 2 → device 1, then the prediction result can indicate that the distance between device 3 and the sound source target is the closest by indicating that the relative distance of device 3 is 1, which is relative to the isolated prediction absolute distance of the existing model. In the present application, the distance comparison model is used to predict the relative distance relationship including global information in the multi-device voice interaction scene. From the device identification of the device, it can be seen that there is no limitation on the number of channels for collecting reference speech data, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction. In a second aspect, the present application provides a device control method, which is applied to a target device, and the method includes:
获取仲裁设备的指示信息,所述指示信息是所述仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定所述多个设备中所述目标设备用于执行声源目标的声音关联的语音指令情况下生成的,所述至少一个相对距离标识是所述仲裁设备执行以下操作得到的:获取与所述多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;以及根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定所述至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之 间的距离;Acquire indication information of the arbitration device, where the indication information is when the arbitration device determines that the target device in the plurality of devices is used to perform the sound according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated in the case of a voice command associated with the voice of the source and target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operations: acquiring multiple voice collection data corresponding to the multiple devices one-to-one, the Each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the reference speech data in the plurality of sound collection data and The device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence. The device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;
根据所述指示信息执行所述声源目标的声音关联的语音指令所指示的操作。The operation indicated by the sound-related voice instruction of the sound source target is performed according to the instruction information.
可见,本示例中,目标设备首先获取仲裁设备的指示信息,其次,根据指示信息执行声源目标的声音关联的语音指令所指示的操作。由于指示信息是仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定多个设备中目标设备用于执行声源目标的声音关联的语音指令情况下生成的,且相对距离标识用于指示对应的设备与声源目标的距离在距离序列中的位置,距离序列用于指示由多个距离按照预设排序策略进行排序形成的序列,多个距离包括多个设备中每个设备与声源目标之间的距离,相对于现有基于设备与声源目标的绝对距离确定就近唤醒设备的方案,本申请实现了通过距离比较模型预测多设备语音交互场景中包含全局信息的相对距离关系,并根据该相对距离关系确定待唤醒的目标设备,同时,由于每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和设备的设备标识,可见采集参考语音数据没有声道数量的限制,故而可以克服常规麦克风阵列声源定位算法的硬件需求高以及算法复杂问题,有利于提高相对距离关系预测的效率和适用性。It can be seen that, in this example, the target device first obtains the indication information of the arbitration device, and secondly, according to the indication information, executes the operation indicated by the voice instruction associated with the sound of the sound source target. Because the indication information is generated by the arbitration device in the case of determining the voice command of the target device in the plurality of devices for executing the sound association of the sound source target according to the at least one relative distance identifier corresponding to at least one device in the plurality of devices, and The relative distance identifier is used to indicate the position of the distance between the corresponding device and the sound source target in the distance sequence, and the distance sequence is used to indicate the sequence formed by sorting multiple distances according to the preset sorting strategy. The distance between each device and the sound source target, compared with the existing scheme of determining the nearest wake-up device based on the absolute distance between the device and the sound source target, the present application realizes the prediction of the global information contained in the multi-device voice interaction scene through the distance comparison model and determine the target device to be awakened according to the relative distance relationship. At the same time, since each sound collection data includes the reference voice data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment, it can be seen that the collection The reference speech data has no limitation on the number of channels, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
第三方面,本申请提供一种距离比较模型的训练方法,包括:In a third aspect, the present application provides a method for training a distance comparison model, including:
获取训练数据,所述训练数据包括多个语音数据集合,所述多个语音数据集合中每个语音数据集合包含与多个设备一一对应的多个参考语音数据,所述多个参考语音数据中每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且所述多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,所述声音采集环境至少包括所述声源目标所处的位置;Acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the multiple reference voice data Each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, and the sound collection environments at least include the location of the sound source target;
根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,所述损失函数用于从设备配对组中两个设备在相同声音采集环境下与所述声源目标的相对距离关系的预测准确性维度表征所述距离比较模型的损失,所述设备配对组由所述多个设备中任意两个设备组成。A preset distance comparison model is trained according to the reference speech data of the multiple speech data sets and a preset loss function, and a trained distance comparison model is obtained. The prediction accuracy dimension of the relative distance relationship with the sound source target in the sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.
可见,本示例中,设备首先获取训练数据,其次,根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型。由于训练数据包括多个语音数据集合,每个语音数据集合包含与多个设备一一对应的多个参考语音数据,每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,同时,损失函数用于从设备配对组中两个设备在相同声音采集环境下与声源目标的相对距离关系的预测准确性维度表征距离比较模型的损失,设备配对组由多个设备中任意两个设备组成,从而距离预测模型具备预测设备与声源目标的相对远近关系的能力,且参考语音数据没有声道数量的限制,故而可以克服常规麦克风阵列声源定位算法的硬件需求高以及算法复杂问题,有利于提高相对距离关系预测的效率和适用性。It can be seen that in this example, the device first obtains training data, and secondly, trains a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, and obtains a trained distance comparison model. Since the training data includes multiple voice data sets, each voice data set includes multiple reference voice data corresponding to multiple devices one-to-one, and each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target , and the multiple voice data sets correspond to the voice data sets collected in different sound collection environments, and at the same time, the loss function is used to predict the relative distance relationship between the two devices in the device pairing group and the sound source target in the same sound collection environment The accuracy dimension represents the loss of the distance comparison model. The device pairing group consists of any two devices among multiple devices, so the distance prediction model has the ability to predict the relative distance between the device and the sound source target, and the reference speech data does not have the number of channels Therefore, it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
第四方面,本申请提供一种距离关系确定装置,应用于仲裁设备,所述装置包括:In a fourth aspect, the present application provides an apparatus for determining a distance relationship, which is applied to an arbitration device, and the apparatus includes:
获取单元,用于与多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;The acquisition unit is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the obtained sound. the equipment identification of the said equipment;
确定单元,用于根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。A determination unit, configured to determine at least one relative distance identifier corresponding to at least one of the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model , each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is that the multiple devices are sorted according to a preset sorting strategy of distances And the formed sequence, the distance refers to the distance between the device and the sound source target.
第五方面,本申请提供一种设备控制装置,应用于电子设备,所述装置包括:In a fifth aspect, the present application provides a device control device, which is applied to an electronic device, and the device includes:
获取单元,用于获取仲裁设备的指示信息,所述指示信息是所述仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定所述多个设备中所述目标设备用于执行声源目标的声音关联的语音指令情况下生成的,所述至少一个相对距离标识是所述仲裁设备执行以下操作得到的:获取与所述多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设 备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;以及根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定所述至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离;An acquiring unit, configured to acquire indication information of the arbitration device, where the indication information is when the arbitration device determines the target in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated when the device is used to execute a voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring multiple sounds corresponding to the multiple devices one-to-one Collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification and the pre-trained distance comparison model are determined, and the at least one relative distance identification is determined, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence. The position of the device distance relationship is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;
执行单元,用于根据所述指示信息执行所述声源目标的声音关联的语音指令所指示的操作。An execution unit, configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
第六方面,本申请提供一种距离比较模型的训练装置,包括:In a sixth aspect, the present application provides a training device for a distance comparison model, including:
获取单元,用于获取训练数据,所述训练数据包括多个语音数据集合,所述多个语音数据集合中每个语音数据集合包含与多个设备一一对应的多个参考语音数据,所述多个参考语音数据中每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且所述多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,所述声音采集环境至少包括所述声源目标所处的位置;an acquiring unit, configured to acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the Each reference voice data in the multiple reference voice data is voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to voice data sets collected under different sound collection environments, the The sound collection environment at least includes the location where the sound source target is located;
训练单元,用于根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,所述损失函数用于从设备配对组中两个设备在相同声音采集环境下与所述声源目标的相对距离关系的预测准确性维度表征所述距离比较模型的损失,所述设备配对组由所述多个设备中任意两个设备组成。A training unit, configured to train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to obtain the trained distance comparison model from the device pairing group The prediction accuracy dimension of the relative distance relationship between two devices and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices in the plurality of devices. .
第七方面,本申请提供一种电子设备,一个或多个处理器;In a seventh aspect, the present application provides an electronic device and one or more processors;
一个或多个存储器,用于存储程序,one or more memories for storing programs,
所述一个或多个存储器和所述程序被配置为,由所述一个或多个处理器控制所述电子设备执行如本申请实施例第一方面或第二方面或第三方面任一方法中的步骤的指令。The one or more memories and the program are configured to control the electronic device by the one or more processors to perform any method in the first aspect or the second aspect or the third aspect of the embodiments of the present application step instruction.
第八方面,本申请提供一种芯片,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行如本申请实施例第一方面或第二方面或第三方面任一方法中所描述的部分或全部步骤。In an eighth aspect, the present application provides a chip, including: a processor for calling and running a computer program from a memory, so that a device installed with the chip executes the first aspect or the second aspect or the first aspect of the embodiments of the present application. Some or all of the steps described in any of the three aspects.
第九方面,本申请提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如本申请实施例第一方面或第二方面或第三方面任一方法中所描述的部分或全部步骤。In a ninth aspect, the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the first embodiment of the present application. Some or all of the steps described in any of the methods of aspect or the second aspect or the third aspect.
第十方面,本申请提供一种计算机程序,其中,所述计算机程序可操作来使计算机执行如本申请实施例第一方面或第二方面或第三方面任一方法中所描述的部分或全部步骤。该计算机程序可以为一个软件安装包。In a tenth aspect, the present application provides a computer program, wherein the computer program is operable to cause a computer to execute part or all of the methods described in the first aspect, the second aspect or the third aspect of the embodiments of the present application step. The computer program may be a software installation package.
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1a是本申请实施例提供的一种多设备场景中用户控制的示意图;1a is a schematic diagram of user control in a multi-device scenario provided by an embodiment of the present application;
图1b是本申请实施例提供的一种设备控制系统10的架构图;FIG. 1b is an architecture diagram of a device control system 10 provided by an embodiment of the present application;
图1c是本申请实施例提供的一种智能语音助手的功能界面示意图;1c is a schematic diagram of a functional interface of an intelligent voice assistant provided by an embodiment of the present application;
图1d是本申请实施例提供的一种电子设备的结构示意图;1d is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图2是本申请实施例提供的一种距离关系确定方法的流程示意图;2 is a schematic flowchart of a method for determining a distance relationship provided by an embodiment of the present application;
图3是本申请实施例提供的一种设备控制方法的流程示意图;3 is a schematic flowchart of a device control method provided by an embodiment of the present application;
图4是本申请实施例提供的一种距离比较模型的训练方法的流程示意图;4 is a schematic flowchart of a training method for a distance comparison model provided by an embodiment of the present application;
图5是本申请实施例提供的一种距离关系确定装置的功能单元组成框图;5 is a block diagram of functional units of a device for determining a distance relationship provided by an embodiment of the present application;
图6是本申请实施例提供的另一种距离关系确定装置的功能单元组成框图;6 is a block diagram of functional units of another distance relationship determination device provided by an embodiment of the present application;
图7是本申请实施例提供的一种设备控制装置的功能单元组成框图;FIG. 7 is a block diagram of functional units of a device control device provided by an embodiment of the present application;
图8是本申请实施例提供的另一种设备控制装置的功能单元组成框图;8 is a block diagram of functional units of another device control apparatus provided by an embodiment of the present application;
图9是本申请实施例提供的一种距离比较模型的训练装置的功能单元组成框图;9 is a block diagram of functional units of a training device for a distance comparison model provided by an embodiment of the present application;
图10是本申请实施例提供的另一种距离比较模型的训练装置的功能单元组成框图。FIG. 10 is a block diagram of functional units of another distance comparison model training device provided by an embodiment of the present application.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
目前,如图1a所示,用户所处的空间存在智能音箱(与用户距离为0.5m)、智能电视1(与用户距离为0.6m)、电脑(与用户距离为1.2m)、智能电视2(与用户距离为0.55m),当用户想听音乐,发出“播放音乐”指令时,当前智能语音助理只能测量包含麦克风阵列的设备与声源的距离,若电脑不包含麦克风阵列则智能语音助理无法计算距离从而无法准确实现就近唤醒的控制。At present, as shown in Figure 1a, there are smart speakers (0.5m away from the user), smart TV 1 (0.6m away from the user), computer (1.2m away from the user), and smart TV 2 in the space where the user is located. (The distance from the user is 0.55m), when the user wants to listen to music and issue the "play music" command, the current intelligent voice assistant can only measure the distance between the device containing the microphone array and the sound source. If the computer does not contain a microphone array, the intelligent voice The assistant can't calculate the distance and can't accurately realize the control of the nearest wake-up.
针对上述问题,本申请实施例提供一种距离关系确定、设备控制、模型训练的方法及相关装置,下面结合附图进行详细说明。In response to the above problems, the embodiments of the present application provide a method and related apparatus for distance relationship determination, device control, and model training, which are described in detail below with reference to the accompanying drawings.
请参阅图1b,图1b是本申请实施例提供的一种设备控制系统10。所述设备控制系统10包括具有声音采集能力的电子设备100(例如:智能电视、智能音箱、智能手机等)、安装智能语音助手的仲裁设备200以及服务器300,该仲裁设备可以是电子设备100中的任意一个,也可以是移动设备中的任意一个,如用户手机,也可以是智能家居场景中专用的控制盒子,还可以是云端的服务器,还可以是共同完成数据处理的多个设备组成的设备组,所述仲裁设备200与电子设备100、服务器300均通信连接,形成智能家庭场景中的设备控制网络。Please refer to FIG. 1b. FIG. 1b is a device control system 10 provided by an embodiment of the present application. The device control system 10 includes an electronic device 100 (such as a smart TV, a smart speaker, a smart phone, etc.) with a sound collection capability, an arbitration device 200 installed with an intelligent voice assistant, and a server 300. The arbitration device may be one of the electronic devices 100. It can also be any one of the mobile devices, such as the user's mobile phone, or it can be a dedicated control box in the smart home scene, it can also be a server in the cloud, or it can be composed of multiple devices that jointly complete data processing. A device group, the arbitration device 200 is connected in communication with the electronic device 100 and the server 300 to form a device control network in a smart home scenario.
其中,所述智能语音助手可以安装在手机等各类设备上以支持本申请的设备控制方法,其表现出的具体的功能名称、界面交互方式可以是多种多样的,此处不做唯一限定,例如安装在手机上并呈现如图1c的“Breeno”智能助手的设置功能界面。图例中包括一键指令的功能设置,具体包括导航回家功能、附近功能、到家时提醒功能、带壳截图以及多设备及控制功能,其中,多设备及控制功能的图形标签中,设备上的数字标签可以用于标识该设备与声源目标即用户之间的远近程度。Wherein, the intelligent voice assistant can be installed on various devices such as mobile phones to support the device control method of the present application, and the specific function names and interface interaction methods presented by the intelligent voice assistant can be various, which are not uniquely limited here. , for example, it is installed on a mobile phone and presents the setting function interface of the "Breeno" smart assistant as shown in Figure 1c. The legend includes the function settings of the one-key command, including the navigation home function, the nearby function, the reminder function when arriving home, the screenshot with the shell, and the multi-device and control function. Among them, in the graphic label of the multi-device and control function, the Digital tags can be used to identify the proximity of the device to the target of the sound source, the user.
需要注意的是,仲裁设备200作为本申请实施例的策略执行设备,与其他设备(如:电子设备100和服务器300)之间的数据、信令交互方式可以是多种多样的,此处不做唯一限定。例如仲裁设备200可以直接与电子设备100连接获取对应信息,仲裁设备200可以通过移动通信网络连接服务器300实现对应的信息交互等。It should be noted that, the arbitration device 200, as the policy execution device in this embodiment of the present application, can exchange data and signaling with other devices (for example, the electronic device 100 and the server 300) in various ways, which are not discussed here. Make the only limit. For example, the arbitration device 200 may be directly connected with the electronic device 100 to obtain corresponding information, and the arbitration device 200 may be connected to the server 300 through a mobile communication network to realize corresponding information exchange and the like.
请参阅图1d,图1d是本申请实施例提供的一种电子设备的结构示意图。该电子设备应用于上述设备控制系统10,所述电子设备包括应用处理器120、存储器130、通信模块140、以及一个或多个程序131,所述应用处理器120通过内部通信总线与所述存储器130、所述通信模块140均通信连接。Please refer to FIG. 1d. FIG. 1d is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device is applied to the above-mentioned device control system 10. The electronic device includes an application processor 120, a memory 130, a communication module 140, and one or more programs 131. The application processor 120 communicates with the memory through an internal communication bus. 130. The communication modules 140 are all connected in communication.
其中,所述一个或多个程序131被存储在上述存储器130中,且被配置由上述应用处理器120执行, 所述一个或多个程序131包括用于执行上述方法实施例中任一步骤的指令。Wherein, the one or more programs 131 are stored in the above-mentioned memory 130 and configured to be executed by the above-mentioned application processor 120, and the one or more programs 131 include a program for executing any step in the above-mentioned method embodiments. instruction.
其中,应用处理器120例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,单元和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信单元可以是通信模块140、收发器、收发电路等,存储单元可以是存储器130。The application processor 120 may be, for example, a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a field Field Programmable Gate Array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It may implement or execute the various exemplary logical blocks, units and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication unit may be a communication module 140 , a transceiver, a transceiver circuit, etc., and the storage unit may be the memory 130 .
所述存储器130可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。The memory 130 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (DRAM) Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Fetch memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
具体实现中,所述应用处理器120用于执行如本申请方法实施例中由仲裁设备或者目标设备或者模型训练设备所执行的任一步骤。In specific implementation, the application processor 120 is configured to perform any step performed by the arbitration device, the target device, or the model training device in the method embodiment of the present application.
请参阅图2,图2是本申请实施例提供的一种距离关系确定方法的流程示意图,应用于上述设备控制系统10中的仲裁设备200,如图所示,本设备控制方法包括以下操作。Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a method for determining a distance relationship provided by an embodiment of the present application, which is applied to the arbitration device 200 in the device control system 10. As shown in the figure, the device control method includes the following operations.
步骤201,获取与多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识。Step 201: Acquire a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the The device ID of the device.
其中,所述多个设备可以是上述设备控制系统10中的电子设备100,数量和设备类型均不作唯一限制。The plurality of devices may be the electronic devices 100 in the device control system 10 described above, and neither the number nor the device type is uniquely limited.
其中,所述声源目标包括用户或者发音设备,此处不做唯一限定。所述声源目标的声音可以是唤醒语音,例如“你好小欧”等。此外,在异构分布式场景下,各设备可以同时等待唤醒语音。当用户说出唤醒语音,设备获得各自唤醒语音的单声道语音段。Wherein, the sound source target includes a user or a pronunciation device, which is not uniquely limited here. The sound of the sound source target may be a wake-up voice, such as "Hello Xiaoou" and the like. In addition, in a heterogeneous distributed scenario, each device can wait for the wake-up voice at the same time. When the user speaks the wake-up speech, the device obtains a monophonic segment of the respective wake-up speech.
其中,所述参考语音数据可以先进行4KHz低通滤波,抑制非人声音频部分。Wherein, the reference voice data may be first subjected to 4KHz low-pass filtering to suppress non-human voice audio parts.
具体实现中,所述参考语音数据可以是未经过频响能力对齐和/或特征提取、特征融合处理之前的语音数据,此种情况下由仲裁设备对多设备的参考语音数据统一进行相关预处理,In a specific implementation, the reference voice data may be voice data before frequency response capability alignment, feature extraction, and feature fusion processing. In this case, the arbitration device uniformly performs relevant preprocessing on the reference voice data of multiple devices. ,
此外,所述参考语音数据也可以是各个设备自行经过频响能力对齐和/或特征提取、特征融合处理后的语音数据,此种情况下仲裁设备不再统一进行预处理,而是在获取到多个设备的参考语音数据和设备标识后,调用预先训练好的距离比较模型预测至少一个设备的至少一个相对距离标识。In addition, the reference voice data can also be the voice data processed by each device through frequency response capability alignment, feature extraction, and feature fusion. After the reference speech data and device identifiers of multiple devices are obtained, a pre-trained distance comparison model is invoked to predict at least one relative distance identifier of at least one device.
步骤202,根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。Step 202: Determine at least one relative distance identifier that corresponds one-to-one with at least one of the multiple devices according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, so Each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is formed by sorting the plurality of devices according to a preset sorting strategy for distances. , the distance is the distance between the device and the sound source target.
其中,所述预设排序策略可以包括由大到小或者由小到大等,此处不做唯一限定。Wherein, the preset sorting strategy may include from large to small or from small to large, etc., which is not uniquely limited here.
其中,所述参考语音数据包括单声道语音数据或者多声道语音数据,即不需要对设备的语音采集能力做较高要求,应用便捷性更高。Wherein, the reference voice data includes monophonic voice data or multi-channel voice data, that is, there is no need to make higher requirements on the voice collection capability of the device, and the application is more convenient.
其中,所述相对距离标识可以是数字(如1/2/3/4)、图形(如长度不同的线段)等,此处不做唯一 限定。Wherein, the relative distance identifiers may be numbers (such as 1/2/3/4), graphics (such as line segments with different lengths), etc., which are not uniquely limited here.
举例来说,假设多个设备包括设备A、设备B、设备C、设备D,且设备距离关系序列为设备A→设备D→设备B→设备C,且当前的距离的排序关系是由远到近,即设备A与声源目标的距离最近,设备C与声源目标的距离最远,则预测结果可以是:设备A的相对距离标识为1(最近),设备B的相对距离标识为3,设备C的相对距离标识为4(最远),设备D的相对距离标识为2。For example, suppose that multiple devices include device A, device B, device C, and device D, and the device distance relationship sequence is device A→device D→device B→device C, and the current distance ordering relationship is from far to If the distance between the device A and the sound source target is the closest, and the distance between the device C and the sound source target is the farthest, the prediction result can be: the relative distance of device A is marked as 1 (closest), and the relative distance of device B is marked as 3 , the relative distance of device C is identified as 4 (farthest), and the relative distance of device D is identified as 2.
具体实现中,仲裁设备可以直接选择与声源目标相对距离最近的目标设备作为唤醒设备,并通过该目标设备执行用户的语音指令。此外,针对距离比较结果中出现两个设备的预测结果相同的特殊情况,仲裁设备可以进一步查询预设的设备唤醒优先级集合,确定优先唤醒的设备。如此可以提高设备控制的准确度和成功率。In a specific implementation, the arbitration device can directly select a target device with the closest relative distance to the sound source target as the wake-up device, and execute the user's voice instruction through the target device. In addition, for the special case where the prediction results of two devices are the same in the distance comparison result, the arbitration device may further query the preset device wake-up priority set to determine the device to be woken up first. In this way, the accuracy and success rate of device control can be improved.
具体实现中,若距离比较模型的输入仅包含音频数据信息,则距离比较模型的输出是各个语音数据的相对距离标识,仲裁设备可以进一步根据语音数据与设备标识之间的对应关系确定对应设备的相对距离标识,此种情况下设备标识仅需要预测结果中的语音数据所对应的设备标识。即语音数据与设备标识对应,模型输入数据不包括设备标识,预测结果与语音数据一一对应,预测结果与设备标识通过语音数据间接对应。In the specific implementation, if the input of the distance comparison model only contains audio data information, the output of the distance comparison model is the relative distance identification of each voice data, and the arbitration device can further determine the corresponding device according to the corresponding relationship between the voice data and the device identification. Relative distance identification. In this case, the device identification only needs the device identification corresponding to the voice data in the prediction result. That is, the voice data corresponds to the device ID, the model input data does not include the device ID, the prediction result corresponds to the voice data one-to-one, and the prediction result and the device ID correspond indirectly through the voice data.
若距离比较模型的输入包含音频数据信息和设备标识,如手机1、手机2、手机3,手机1、手机2的设备类型为类型1,手机3的设备类型为类型2,则手机1的设备标识可以为类型1+手机1名称,手机2的标识为类型1+手机2名称,手机3的设备标识为类型2+手机3名称,则距离比较模型的输出直接是各个设备标识即设备的相对距离标识,即预测结果可与设备标识直接对应。If the input of the distance comparison model contains audio data information and device identification, such as mobile phone 1, mobile phone 2, mobile phone 3, the device type of mobile phone 1 and mobile phone 2 is type 1, and the device type of mobile phone 3 is type 2, then the device type of mobile phone 1 is The identification can be type 1 + name of mobile phone 1, the identification of mobile phone 2 is type 1 + name of mobile phone 2, and the device identification of mobile phone 3 is type 2 + name of mobile phone 3, then the output of the distance comparison model is directly the relative identification of each device. The distance identification, that is, the prediction result can directly correspond to the equipment identification.
在一个可能的示例中,所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,包括:对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据;根据所述多个目标语音数据、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine at least one device corresponding to at least one device one-to-one among the plurality of devices. A relative distance identifier, comprising: aligning the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data to obtain the frequency response capability corresponding to the multiple voice collection data one-to-one The aligned multiple target voice data; according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model, determine whether to match at least one of the multiple devices. At least one relative distance identifier in a one-to-one correspondence between devices.
可见,本示例中,由于不同类型、型号设备由于传感器器件性能的不同,采集到的同距离激活语音信号存在显著差异。因此通过频响能力对齐处理将异构设备获得的语音信号适配至同一标准,为距离比较提供统一数据基础。相对于现有仅通过语音能量比值进行简单归一化的方案,有利于提高模型预测结果的准确度。It can be seen that, in this example, due to the different performance of the sensor devices of different types and models of devices, there are significant differences in the collected activated voice signals at the same distance. Therefore, the voice signals obtained by heterogeneous devices are adapted to the same standard through the frequency response capability alignment process, which provides a unified data basis for distance comparison. Compared with the existing solution that simply normalizes only the speech energy ratio, it is beneficial to improve the accuracy of the model prediction result.
在本可能的示例中,所述对所述多个声音采集数据中的多个参考语音数据进行频响能力对齐,得到频响能力对齐后的多个目标语音数据,包括:针对所述多个声音采集数据中的每个参考语音数据执行如下操作,得到频响能力对齐后的多个目标语音数据:根据当前处理的参考语音数据的设备标识获取预设的当前的设备相对于基准设备的频响单位冲击响应;将所述参考语音数据与所述频响单位冲击响应做卷积运算,进行增益调整,得到频响能力对齐后的目标语音数据。In this possible example, the performing frequency response capability alignment on multiple reference voice data in the multiple sound collection data to obtain multiple target voice data after frequency response capability alignment, including: Each reference voice data in the sound collection data performs the following operations to obtain a plurality of target voice data after frequency response capability alignment: obtain the preset frequency of the current device relative to the reference device according to the device identifier of the currently processed reference voice data. The unit impulse response of the frequency response is obtained; the convolution operation is performed on the reference speech data and the unit impulse response of the frequency response, and the gain is adjusted to obtain the target speech data after the frequency response capability is aligned.
具体实现中,若当前的设备与基准设备为同一个设备,则参考语音数据调整后语音不变。基准设备可以是多个设备中指定的设备。In a specific implementation, if the current device and the reference device are the same device, the voice after adjustment by the reference voice data remains unchanged. The reference device can be a device specified in multiple devices.
其中,当前的采集设备相对于基准设备的频响单位冲击响应通过如下步骤a1至步骤a5确定:Wherein, the frequency response unit impulse response of the current acquisition device relative to the reference device is determined through the following steps a1 to a5:
步骤a1,假设进行感知能力对齐的设备种类总数为L,L为正整数。将多种设备放置于距声源相同距离,播放0-8KHz扫频信号,记录各设备频响曲线。同一距离进行N次记录,每个设备得到N条频响曲线。在不同声源距离条件(如0.5m,0.8m,1m,1.2m,1.5m,1.8m,2m,2.2m,2.5m,2.8m,3m)进行M批采集。记
为第l种设备第i批第j次记录的频响曲线,其中l=1,2,…,L;i=1,2,…,M;j=1,2,…,N,L、M、N为正整数。
In step a1, it is assumed that the total number of device types for sensing capability alignment is L, and L is a positive integer. Place multiple devices at the same distance from the sound source, play the 0-8KHz frequency sweep signal, and record the frequency response curve of each device. N records are made at the same distance, and N frequency response curves are obtained for each device. M batches of acquisition were performed under different sound source distance conditions (such as 0.5m, 0.8m, 1m, 1.2m, 1.5m, 1.8m, 2m, 2.2m, 2.5m, 2.8m, 3m). remember is the frequency response curve of the jth record of the ith batch of the lth equipment, where l=1,2,...,L; i=1,2,...,M; j=1,2,...,N,L, M and N are positive integers.
步骤a2,求出各设备所有频响曲线的统计平均,作为该设备最终频响曲线:In step a2, the statistical average of all frequency response curves of each device is obtained as the final frequency response curve of the device:
其中,K为频响曲线的点数,为正整数。Among them, K is the number of points in the frequency response curve, which is a positive integer.
步骤a3,选择一种设备为基准设备,记为类型为b∈[1,2,…,L],选择方式可以是用户使用最为频繁的设备等。计算各设备与基准设备频响曲线的比值,得到设备间频响转移函数:In step a3, a device is selected as a reference device, and the type is denoted as b∈[1,2,...,L], and the selection method may be the device most frequently used by the user. Calculate the ratio of the frequency response curve of each device to the reference device, and obtain the frequency response transfer function between devices:
若l=b,则基准设备无需与自身进行能力对齐。If l=b, the reference device does not need to perform capability alignment with itself.
步骤a4,对频响转移函数进行离散快速傅里叶反变换(IDFT),得到转移函数对应频响单位冲激响应:Step a4, perform inverse discrete fast Fourier transform (IDFT) on the frequency response transfer function to obtain the unit impulse response corresponding to the frequency response of the transfer function:
步骤a5,将频响单位冲击响应t
l(n)保存于对应设备。
Step a5, save the frequency response unit impulse response t l (n) in the corresponding device.
可见,本示例中,由于不同类型、型号设备由于传感器器件性能的不同,采集到的同距离激活语音信号存在显著差异。因此通过频响曲线获得频响单位冲击响应,将异构设备获得的信号适配至同一标准,为距离比较提供统一数据基础。相对于现有仅通过语音能量比值进行简单归一化的方案,有利于提高模型预测结果的准确度。It can be seen that, in this example, due to the different performance of the sensor devices of different types and models of devices, there are significant differences in the collected activated voice signals at the same distance. Therefore, the frequency response unit impulse response is obtained through the frequency response curve, and the signals obtained by the heterogeneous devices are adapted to the same standard, which provides a unified data basis for distance comparison. Compared with the existing solution that simply normalizes only the speech energy ratio, it is beneficial to improve the accuracy of the model prediction result.
在一个可能的示例中,所述根据所述目标语音数据、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,包括:对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合,每个语音特征集合包括多维度的特征提取结果;将所述多个语音特征集合中每个语音特征集合进行特征融合,得到多个融合语音特征;根据所述多个融合语音特征、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。In a possible example, according to the target voice data, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine whether to have at least one device in the plurality of devices. A corresponding at least one relative distance identifier, including: performing multi-dimensional feature extraction on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one , each voice feature set includes multi-dimensional feature extraction results; feature fusion is performed on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; according to the multiple fused voice features, the The device identifiers in the plurality of sound collection data and the pre-trained distance comparison model are used to determine at least one relative distance identifier corresponding to at least one device among the plurality of devices.
在本可能的示例中,所述对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合,包括:针对所述多个目标语音数据中每个目标语音数据执行如下操作,得到多个语音特征集合:提取当前处理的目标语音数据的标量语音特征和矢量语音特征;对所述矢量语音特征进行降维和二次特征提取,得到矢量衍生语音特征。In this possible example, performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one, including : perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice feature of the currently processed target voice data; reduce the vector voice feature Dimension and quadratic feature extraction to obtain vector-derived speech features.
在本可能的示例中,所述提取当前处理的目标语音数据的标量语音特征和矢量语音特征,包括:对当前处理的目标语音数据进行预处理,得到预处理后的目标语音数据;提取所述预处理后的目标语音数据的标量语音特征和矢量语音特征;其中,所述预处理包括以下至少一种:静音抑制处理、通过高频滤波器进行预加重处理、分帧处理以及加窗处理。In this possible example, the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data includes: preprocessing the currently processed target speech data to obtain preprocessed target speech data; extracting the Scalar voice features and vector voice features of the preprocessed target voice data; wherein the preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through high-frequency filters, frame segmentation processing, and windowing processing.
其中,所述静音抑制(Voice Activity Detection,VAD)处理的目的是从中识别和消除长时间的静音期,以便截取出最有效的语音片段供后续步骤使用,例如可以采用WebRTC VAD算法。所述预加重处理的目的是补偿语音信号受到发音系统影响所损失的高频分量,突出高频的共振峰。预加重是通过高频滤波器实现的,其传递函数为:Wherein, the purpose of the silence suppression (Voice Activity Detection, VAD) processing is to identify and eliminate long silence periods therefrom, so as to extract the most effective voice segments for use in subsequent steps, for example, the WebRTC VAD algorithm can be used. The purpose of the pre-emphasis processing is to compensate for the high frequency components of the speech signal lost due to the influence of the pronunciation system, and to highlight the high frequency formants. Pre-emphasis is achieved by a high-frequency filter whose transfer function is:
H(z)=1-μz
-1,
H(z)=1-μz -1 ,
其中,z是语音信号的Z变换,μ是预加重的系数,一般介于0.9-1.0之间,本申请例如可以是0.97。Among them, z is the Z transform of the speech signal, and μ is the coefficient of pre-emphasis, which is generally between 0.9 and 1.0, and may be 0.97 in this application, for example.
此外,为了便于处理,依据语音信号的短时平稳性,需要对语音信号进行分帧操作。在本方案中,设定帧长为25ms,帧移为10ms,即相邻两帧之间有15ms的重叠区域,这样可以避免相邻两帧的语音信号变化过大产生的影响。为了表述方便,用s
i[n]表示第i帧的数据。
In addition, in order to facilitate processing, according to the short-term stationarity of the speech signal, it is necessary to perform a frame division operation on the speech signal. In this scheme, the frame length is set to 25ms, and the frame shift is set to 10ms, that is, there is an overlap area of 15ms between two adjacent frames, which can avoid the influence of excessive changes in the speech signals of the two adjacent frames. For the convenience of expression, use s i [n] to represent the data of the i-th frame.
在分帧以后,为了消除每帧两端可能存在的信号不连续性,防止频谱泄露,需进行加窗处理,得到s
i,w[n]=s
i[n]×w[n],w[n]代表窗函数,本申请实施例采用汉明窗(Hamming),其公式如下:
After framing, in order to eliminate the signal discontinuity that may exist at both ends of each frame and prevent spectrum leakage, a windowing process is required to obtain si,w [n]= si [n]×w[n],w [n] represents a window function, the embodiment of the present application adopts a Hamming window (Hamming), and its formula is as follows:
其中,N是s
i[n]的长度。
where N is the length of si [n].
具体实现中,上述标量语音特征可以提取5种标量语音特征,如表1所示。所述标量语音特征依据其定义进行提取,具体计算过程不再赘述。In a specific implementation, the above-mentioned scalar speech features can extract 5 kinds of scalar speech features, as shown in Table 1. The scalar speech feature is extracted according to its definition, and the specific calculation process will not be repeated.
表1标量语音特征Table 1 Scalar speech features
特征类型Feature type | 中文解释Chinese explanation | 英文解释English explanation |
LPLP | 线性预测Linear prediction | Linear PredictionLinear Prediction |
LPRRLPRR | LP残差峰值-均方根值比LP Residual Peak-RMS Ratio | LP Residual RatioLP Residual Ratio |
LPRKLPRK | LP残差峭度LP residual kurtosis | LP Residual KurtosisLP Residual Kurtosis |
LPRHPLPRHP | LP残差直方图峰值LP residual histogram peak | LP Residual Histogram PeakLP Residual Histogram Peak |
SPSKSPSK | 语谱图偏度Spectrogram skewness | Spectrogram SkewnessSpectrogram Skewness |
SHPPSHPP | 语谱图直方图峰值位置Spectrogram histogram peak position | Spectrogram Histogram Peak PositionSpectrogram Histogram Peak Position |
以上标量语音特征均可以为形如(1,)的特征值,主要刻画用户当前声场环特性,比如混响、反射、传声增益、噪声等。此处(1,)的形式是表示数据维度,(1,)表示1×1维向量,(245,)表示245×1维向量,其中245×1的向量是由前面若干个特征拼接而成。举例:若干特征:(a,),(b,),....,最后拼接为(a+b+...,)=(245,)。这样表示是针对1条语音数据而言,得到(245,)即245×1的特征向量。而在实际计算过程中,会批量处理多条语音,记为N条,N为正整数,则每个特征对应变成了每批特征,维度变化:(1,)→(1,N),即1×N维,同理(245,)→(245,N),即245×N维。The above scalar speech features can all be eigenvalues in the form of (1,), which mainly describe the current sound field loop characteristics of the user, such as reverberation, reflection, sound transmission gain, noise, and the like. Here the form of (1,) represents the data dimension, (1,) represents a 1×1-dimensional vector, and (245,) represents a 245×1-dimensional vector, of which the 245×1 vector is spliced from the previous features. . For example: several features: (a,), (b,),...., the final splicing is (a+b+...,)=(245,). In this way, for one piece of speech data, a feature vector of (245,), ie, 245×1, is obtained. In the actual calculation process, multiple voices will be processed in batches, recorded as N, where N is a positive integer, then each feature corresponds to each batch of features, and the dimension changes: (1,)→(1,N), That is, 1×N dimension, and similarly (245,)→(245,N), that is, 245×N dimension.
具体实现中,上述矢量语音特征可以提取8种矢量语音特征,如表2所示。所述矢量语音特征依据现有方法分别进行提取,具体计算过程不再赘述。In the specific implementation, the above-mentioned vector speech features can extract 8 kinds of vector speech features, as shown in Table 2. The vector speech features are extracted according to existing methods, and the specific calculation process will not be repeated.
表2矢量语音特征Table 2 Vector speech features
特征类型Feature type | 中文解释Chinese explanation | 英文解释English explanation |
MFCCMFCC | 梅尔频率倒谱系数Mel frequency cepstral coefficients | Mel-Frequency Cepstral CoefficientsMel-Frequency Cepstral Coefficients |
LPCCLPCC | 线性预测倒谱系数Linear prediction cepstral coefficients | Linear Predictive Cepstral CoefficientsLinear Predictive Cepstral Coefficients |
MHECMHEC | 平均希尔伯特包络系数Average Hilbert envelope coefficient | Mean Hilbert Envelope CoefficientsMean Hilbert Envelope Coefficients |
BFCCBFCC | 巴克频率倒谱系数Bark frequency cepstral coefficient | Bark-Frequency Cepstral CoefficientsBark-Frequency Cepstral Coefficients |
LFCCLFCC | 线性频率倒谱系数Linear frequency cepstral coefficients | Linear-Frequency Cepstral CoefficientsLinear-Frequency Cepstral Coefficients |
GFCCGFCC | 伽马通频率倒谱系数Gammatone frequency cepstral coefficients | Gammatone-Frequency Cepstral CoefficientsGammatone-Frequency Cepstral Coefficients |
NGCCNGCC | 归一化伽马啁啾倒谱系数normalized gamma chirp cepstral coefficients | NorGammachirp-Frequency Cepstral CoefficientsNorGammachirp-Frequency Cepstral Coefficients |
MSRCCMSRCC | 幅度谱根倒谱系数Amplitude spectral root cepstral coefficient | Magnitude-based Spectral Root Cepstral CoefficientsMagnitude-based Spectral Root Cepstral Coefficients |
以上矢量语音特征均为形如(12,N
f)的特征向量,其中12是特征的维数,表示矢量语音特征均具有12维特征分量,N
f表示语音的帧数。所述矢量语音特征能够刻画出不同拾音设备的拾音特性,如频谱、基音、共振峰等。
The above vector speech features are all feature vectors in the shape of (12,N f ), where 12 is the dimension of the feature, indicating that the vector speech features have 12-dimensional feature components, and N f represents the number of frames of speech. The vector speech feature can describe the pickup characteristics of different pickup devices, such as spectrum, pitch, formant, and the like.
此外,上述8种矢量语音特征参数量大,不便于直接与所述标量特征结合使用,可以经过矢量衍生特征提取进一步处理。矢量衍生特征提取包括特征分量筛选、差分特征计算和矢量特征标量化3个功能模块,下面分别进行介绍。In addition, the above-mentioned eight kinds of vector speech feature parameters have a large amount, which is inconvenient to be used in direct combination with the scalar features, and can be further processed through vector-derived feature extraction. Vector derived feature extraction includes three functional modules: feature component screening, differential feature calculation, and vector feature scalarization, which will be introduced separately below.
(1)特征分量筛选模块(1) Feature component screening module
对于所述8种矢量语音特征来说,它们均是形如(12,N
f)的特征向量,参数量大,不便于直接使用。另外,各维特征分量对于模型训练的贡献程度是不同的,而且有的包含的信息量较少,有的可能包含冗余信息,所以需要对矢量语音特征进行特征分量筛选。本方案采用Fisher准则评价各维特征分量的区分能力,Fisher判别准则如下:
For the 8 kinds of vector speech features, they are all feature vectors in the shape of (12, N f ), which have a large amount of parameters and are inconvenient to use directly. In addition, the contribution of each dimension feature component to model training is different, and some contain less information, and some may contain redundant information, so it is necessary to filter the feature components of vector speech features. This scheme uses Fisher's criterion to evaluate the distinguishing ability of each dimension's feature components. Fisher's criterion is as follows:
其中,r
Fisher是特征分量的Fisher比,该值越大表示该维特征分量区分能力越强;σ
b表示特征分量的类间方差,即不同距离类型的语音特征分量均值的方差;σ
w表示特征分量的类内方差,即同距离类型的语音特征分量的均值的方差,σ
b和σ
w的计算公式如下:
Among them, r Fisher is the Fisher ratio of the feature components, the larger the value, the stronger the distinguishing ability of the feature components in this dimension; σ b represents the inter-class variance of the feature components, that is, the variance of the mean of the speech feature components of different distance types; The intra-class variance of the feature components, that is, the variance of the mean of the speech feature components of the same distance type, the calculation formulas of σ b and σ w are as follows:
以上两式中:M表示样本个数,
表示第i个语音样本的某一特征向量的第k维分量的均值,m
k表示全部样本的某一特征向量的第k维分量的均值,n
i表示某一个语音样本的帧数,
表示第i个语音某一特征的第k维第c帧的特征值。
In the above two formulas: M represents the number of samples, Represents the mean value of the k-th dimension component of a feature vector of the ith speech sample, m k represents the mean value of the k-th dimension component of a certain feature vector of all samples, n i represents the frame number of a certain speech sample, The feature value of the cth frame of the kth dimension representing a feature of the ith speech.
如上所述,特征分量筛选模块利用Fisher准则,从所述8种矢量语音特征中分别选取出Fisher比最大的3个特征分量,这样每种矢量语音特征就从(12,N
f)转化为(3,N
f),大大降低了特征参数量。
As mentioned above, the feature component screening module uses the Fisher criterion to select the 3 feature components with the largest Fisher ratio from the 8 vector speech features, so that each vector speech feature is converted from (12, N f ) to ( 3, N f ), which greatly reduces the amount of characteristic parameters.
(2)差分特征计算模块(2) Differential feature calculation module
上述8种矢量语音特征只能反应语音的静态特性,它们的动态特性可以用矢量语音特征的差分参数来描述,计算差分参数的公式如下:The above 8 kinds of vector speech features can only reflect the static characteristics of speech, and their dynamic characteristics can be described by the differential parameters of the vector speech features. The formula for calculating the differential parameters is as follows:
其中,c
l表示某一矢量语音特征的第l维特征分量,
表示该矢量语音特征的第l维特征分量的第t帧的一阶差分值,Θ是一常数,可用该值表示差分窗的大小为2Θ+1,本方案中Θ=2。通过本公式可以得到所述8种矢量语音特征的一阶差分,分别用ΔMFCC、ΔLPCC、ΔMHEC、ΔBFCC、ΔLFCC、ΔGFCC、ΔNGCC、ΔMSRCC表示,这些矢量差分特征是形如(3,N
f)的特征向量。
Among them, c l represents the l-th dimension feature component of a certain vector speech feature, Represents the first-order difference value of the t-th frame of the l-th dimensional feature component of the vector speech feature, Θ is a constant, and this value can be used to represent the size of the difference window as 2Θ+1, where Θ=2 in this scheme. Through this formula, the first-order differences of the eight vector speech features can be obtained, which are represented by ΔMFCC, ΔLPCC, ΔMHEC, ΔBFCC, ΔLFCC, ΔGFCC, ΔNGCC, and ΔMSRCC, respectively. These vector difference features are in the form of (3, N f ). Feature vector.
(3)矢量特征标量化模块(3) Vector feature scalarization module
对于上述MFCC等8种矢量语音特征和ΔMFCC等8种矢量差分特征来说,持续时间长的语音的帧数N
f往往非常大,即特征在“帧”方向上维数依然很高;另外不同时长的语音帧数N
f不同,不便于训练机器学习模型。为了解决这两个问题,本方案利用矢量特征标量化模块对上述矢量特征进行标量化处理,下面以MFCC为例,详细描述矢量特征标量化具体过程:
For the above-mentioned 8 vector speech features such as MFCC and 8 vector difference features such as ΔMFCC, the number of frames N f of speech with a long duration is often very large, that is, the dimension of the feature is still very high in the "frame"direction; in addition, different The number of speech frames N f of different durations is not convenient for training the machine learning model. In order to solve these two problems, this scheme uses the vector feature scalarization module to perform scalarization processing on the above-mentioned vector features. The following takes MFCC as an example to describe the specific process of vector feature scalarization in detail:
步骤a、利用GMM及其EM算法对MFCC各维特征分量进行聚类,簇的个数为4,将得到各维特征分量的4个聚类中心值,得到形如(4,)的特征,记为
其中l∈(1,2,3)表示特征分量的标号;
Step a. Use GMM and its EM algorithm to cluster each dimension feature component of MFCC, the number of clusters is 4, and four cluster center values of each dimension feature component will be obtained, and a feature of the shape (4, ) will be obtained, marked as where l∈(1,2,3) represents the label of the feature component;
步骤b、计算MFCC各维特征分量的最大值、最小值和结束值,得到形如(3,)的特征,记为
Step b. Calculate the maximum value, minimum value and end value of each dimension feature component of the MFCC, and obtain a feature in the form of (3,), denoted as
步骤c、计算所述MFCC的矢量差分特征ΔMFCC的各维特征分量最大值、最小值和平方和,得到形如(3,)的特征,记为
Step c, calculate the maximum value, minimum value and square sum of each dimension feature component of the vector difference feature ΔMFCC of the MFCC, and obtain a feature in the shape of (3,), denoted as
步骤d、将a、b、c得到的
三个特征向量进行拼接,得到形如(10,)的特征,记为F
l;
Step d, obtain from a, b, and c The three eigenvectors are spliced to obtain a feature of the shape (10,), denoted as F l ;
步骤e、将特征分量各自的特征F
l再进行拼接,得到形如(30,)的新特征,记为F
MFCC,用于表征原始的MFCC特征;
Step e, splicing the respective features F1 of the feature components to obtain a new feature in the shape of (30'), denoted as F MFCC , for characterizing the original MFCC feature;
步骤f、利用上述步骤a-步骤e.,对其他7种矢量语音特征进行操作,即可得到各自对应的新特征,分别记为F
LPCC、F
MHEC、F
BFCC、F
BFCC、F
LFCC、F
GFCC、F
NGCC、F
MSRCC;
Step f, utilize above-mentioned steps a-step e., operate on other 7 kinds of vector speech features, can obtain respective new features corresponding to each other, denoted as F LPCC , F MHEC , F BFCC , F BFCC , F LFCC , F GFCC , FNGCC , FMSRCC ;
步骤g、将上述8种形如(30,)新特征进行拼接融合,即可得到形如(240,)的矢量衍生语音特征,记为F
D,F
D中的每个特征值均具有特定的物理含义。
Step g, splicing and merging the above-mentioned 8 new features in the shape of (30,) to obtain a vector-derived speech feature in the shape of (240,), denoted as F D , and each feature value in F D has a specific value. physical meaning.
具体实现中,上述特征融合步骤用于对标量语音特征和矢量衍生语音特征进行融合。首先,将所述5种标量语音特征进行拼接,得到形如(5,)的特征向量,记为F
S;然后将F
S与矢量衍生特征F
D进行拼接融合,最终得到形如(245,)的融合特征,记为F
Fusion,即每条语音最终被提取出形如(245,)的特征向量,用于说话人与异构设备距离关系模型的训练。
In a specific implementation, the above feature fusion step is used to fuse scalar speech features and vector-derived speech features. First, the 5 kinds of scalar speech features are spliced to obtain a feature vector in the shape of (5,), which is denoted as F S ; then F S and the vector derived feature FD are spliced and fused, and finally the shape of (245, ) of the fusion feature, denoted as F Fusion , that is, each speech is finally extracted into a feature vector of the shape (245,), which is used for the training of the distance relationship model between the speaker and the heterogeneous device.
可见,本示例中,融合语音特征受环境声场特性、随机噪声的影响小,适用于多种场景和异构分布式设备,比单一的能量、信噪比等特征更加鲁棒,泛化能力更强。该融合语音特征能够用于说话人与分布式设备距离关系模型的训练,从而判断不同智能设备距离用户的远近,使得“人机距离”能够成为多设备唤醒中一个重要的决策维度,提升用户的体验。具体来说,在同一空间中,用户拥有多台支持同一唤醒词的分布式设备,在用户说出唤醒词后,距离用户最近的设备进行响应,实现就近唤醒;或者将人机 距离和设备状态、设备服务能力、用户意图等其他维度相结合,进行综合判定,选择最合适的一台设备响应用户的请求。It can be seen that in this example, the fusion speech features are less affected by the characteristics of the ambient sound field and random noise, and are suitable for a variety of scenarios and heterogeneous distributed devices. powerful. The fusion speech feature can be used for training the distance relationship model between speakers and distributed devices, so as to judge the distance between different smart devices and users, so that "human-machine distance" can become an important decision-making dimension in multi-device wake-up, and improve the user's experience. experience. Specifically, in the same space, the user has multiple distributed devices that support the same wake-up word. After the user speaks the wake-up word, the device closest to the user responds to wake up nearby; , device service capability, user intent and other dimensions, make a comprehensive judgment, and select the most suitable device to respond to the user's request.
在一个可能的示例中,所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,包括:对所述多个声音采集数据中每个声音采集数据进行多维度的特征提取,得到与所述多个声音采集数据一一对应的多个语音特征集合,所述多个语音特征集合中每个语音特征集合包括多维度的特征提取结果;将所述多个语音特征集合中每个语音特征集合进行特征融合,得到多个融合语音特征;根据所述多个融合语音特征、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine at least one device corresponding to at least one device one-to-one among the plurality of devices. A relative distance identifier, comprising: performing multi-dimensional feature extraction on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of sound collection data one-to-one. Each voice feature set in the voice feature sets includes multi-dimensional feature extraction results; perform feature fusion on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; The feature, the device identifier in the plurality of sound collection data, and the pre-trained distance comparison model are used to determine at least one relative distance identifier that corresponds one-to-one with at least one device among the plurality of devices.
其中,特征提取还可以用于其他距离关系确定方法,计算结果可以用于其他应用场景等。Among them, the feature extraction can also be used for other distance relationship determination methods, and the calculation results can be used for other application scenarios.
在本可能的示例中,所述对所述多个声音采集数据中每个声音采集数据进行多维度的特征提取,得到与所述多个声音采集数据一一对应的多个语音特征集合,包括:对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据;对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合。In this possible example, the multi-dimensional feature extraction is performed on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of sound collection data one-to-one, including : perform frequency response capability alignment on each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain multiple targets whose frequency response capabilities are aligned one-to-one with the multiple voice collection data Voice data; perform multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one.
在本可能的示例中,所述对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据,包括:针对所述多个声音采集数据中的每个参考语音数据执行如下操作,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据:根据当前处理的参考语音数据关联的设备标识获取预设的当前的设备相对于基准设备的频响单位冲击响应;将所述参考语音数据与所述频响单位冲击响应做卷积运算,进行增益调整,得到频响能力对齐后的目标语音数据。In this possible example, the frequency response capability alignment is performed on each of the multiple reference voice data in the multiple sound collection data to obtain the frequency response capability corresponding to the multiple voice collection data one-to-one. The multiple target voice data after the sound capability alignment includes: performing the following operations on each reference voice data in the multiple voice collection data to obtain the aligned frequency response capabilities corresponding to the multiple voice collection data one-to-one. A plurality of target speech data: obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier associated with the currently processed reference speech data; compare the reference speech data with the frequency response unit impulse response Do the convolution operation and adjust the gain to obtain the target speech data after the frequency response capability is aligned.
在本可能的示例中,所述对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合,包括:针对所述多个目标语音数据中每个目标语音数据执行如下操作,得到多个语音特征集合:提取当前处理的目标语音数据的标量语音特征和矢量语音特征;对所述矢量语音特征进行降维和二次特征提取,得到矢量衍生语音特征。In this possible example, performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one, including : perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice feature of the currently processed target voice data; reduce the vector voice feature Dimension and quadratic feature extraction to obtain vector-derived speech features.
在本可能的示例中,所述提取当前处理的目标语音数据的标量语音特征和矢量语音特征,包括:对当前处理的目标语音数据进行预处理,得到预处理后的目标语音数据;提取所述预处理后的目标语音数据的标量语音特征和矢量语音特征;其中,所述预处理包括以下至少一种:静音抑制处理、通过高频滤波器进行预加重处理、分帧处理以及加窗处理。In this possible example, the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data includes: preprocessing the currently processed target speech data to obtain preprocessed target speech data; extracting the Scalar voice features and vector voice features of the preprocessed target voice data; wherein the preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through high-frequency filters, frame segmentation processing, and windowing processing.
需要注意的是,此分支实施例中所涉及到特征提取、特征融合以及频响能力对齐的实现原理与前述实施例中对应内容相似,此处不做赘述。It should be noted that, the implementation principles of feature extraction, feature fusion, and frequency response capability alignment involved in this branch embodiment are similar to the corresponding contents in the foregoing embodiments, and are not repeated here.
可见,本示例中,仲裁设备能够先对语音数据进行特征提取和特征融合,并可选进一步对处理后的语音数据进行频响能力对齐,提高语音数据预处理的灵活性。It can be seen that, in this example, the arbitration device can first perform feature extraction and feature fusion on the voice data, and optionally further perform frequency response capability alignment on the processed voice data to improve the flexibility of voice data preprocessing.
在一个可能的示例中,所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识之后,所述方法还包括:根据所述至少一个相对距离标识确定所述多个设备中用于执行所述声源目标的声音关联的语音指令的目标设备;若检测到所述目标设备为所述仲裁设备之外的设备,则向所述目标设备发送指示信息,所述指示信息用于指示所述目标设备执行所述语音指令所指示的操作;若检测到所述目标设备为所述仲裁设备,则执行所述语音指令所指示的操作。In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine at least one device corresponding to at least one device one-to-one among the plurality of devices. After a relative distance identification, the method further includes: determining, according to the at least one relative distance identification, a target device of the plurality of devices for executing the voice command associated with the sound source target; If the target device is a device other than the arbitration device, send indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction; if the target device is detected For the arbitration device, perform the operation indicated by the voice instruction.
其中,所述声源目标的声音关联的语音指令可以是“播放音乐”等各类用户指令,此处不做唯一限定。Wherein, the voice command associated with the sound of the sound source target may be various user commands such as "play music", which is not uniquely limited here.
具体实现中,在就近唤醒方案中,仲裁设备优先选择与声源目标的距离最近的设备作为目标设备。也就是说,在此应用场景中,所述至少一个相对距离标识应当至少包含与声源目标的距离最近的设备的相对距离标识。此外,结合应用场景的不同,该至少一个相对距离标识的具体表现形式可以是多种多样 额度,此处不做唯一限定。In a specific implementation, in the nearest wake-up scheme, the arbitration device preferentially selects the device with the closest distance to the sound source target as the target device. That is to say, in this application scenario, the at least one relative distance identifier should at least include the relative distance identifier of the device that is closest to the sound source target. In addition, in combination with different application scenarios, the specific representation form of the at least one relative distance identifier may be various quotas, which is not uniquely limited here.
可见,本示例中,仲裁设备能够根据至少一个相对距离标识智能确定出用于执行所述声源目标的声音关联的语音指令的目标设备,提高设备控制便捷性和智能性。It can be seen that, in this example, the arbitration device can intelligently determine the target device for executing the voice command associated with the sound source target according to at least one relative distance identifier, which improves the convenience and intelligence of device control.
可以看出,本申请实施例中,仲裁设备首选获取多个设备的多个声音采集数据,其次,根据多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与多个设备中至少一个设备一一对应的至少一个相对距离标识。由于每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,设备距离关系序列是多个设备按照距离的预设排序策略进行排序而形成的序列,距离是指设备与声源目标之间的距离,可见模型能够预测设备在设备距离关系序列中的相对位置,如设备1、设备2、设备3按照与声源目标的距离由近到远的排序顺序形成的设备距离关系序列是设备3→设备2→设备1,则预测结果可以通过设备3的相对距离标识为1来指示设备3与声源目标的距离最近,相对于现有模型孤立的预测绝对距离的方案,本申请实现了通过距离比较模型预测多设备语音交互场景中包含全局信息的相对距离关系,同时,由于每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和设备的设备标识,可见采集参考语音数据没有声道数量的限制,故而可以克服常规麦克风阵列声源定位算法的硬件需求高以及算法复杂问题,有利于提高相对距离关系预测的效率和适用性。It can be seen that, in the embodiment of the present application, the arbitration device firstly obtains multiple voice collection data of multiple devices, and secondly, according to the reference voice data and device identification in the multiple voice collection data and the pre-trained distance comparison model, determine At least one relative distance identifier in one-to-one correspondence with at least one device among the plurality of devices. Since each relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance. The distance refers to the distance between the device and the sound source. The distance between the targets, the visible model can predict the relative position of the device in the device distance relationship sequence, such as the device distance relationship sequence formed by device 1, device 2, and device 3 according to the distance from the sound source target from near to far. is device 3→device 2→device 1, then the prediction result can indicate that the distance between device 3 and the sound source target is the closest by indicating that the relative distance of device 3 is 1. Compared with the scheme of predicting the absolute distance in isolation from the existing model, this application The distance comparison model is used to predict the relative distance relationship containing global information in the multi-device voice interaction scene. At the same time, since each sound collection data includes the reference voice data obtained by the corresponding device to collect the sound of the sound source target and the device identification of the device , it can be seen that there is no limitation on the number of channels for collecting reference speech data, so it can overcome the high hardware requirements and complex algorithms of conventional microphone array sound source localization algorithms, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
请参阅图3,图3是本申请实施例提供的一种设备控制方法的流程示意图,应用于设备控制系统10中的目标设备,如图所示,本设备控制方法包括以下操作。Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a device control method provided by an embodiment of the present application, which is applied to a target device in the device control system 10. As shown in the figure, the device control method includes the following operations.
步骤301,获取仲裁设备的指示信息,所述指示信息是所述仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定所述多个设备中所述目标设备用于执行声源目标的声音关联的语音指令情况下生成的,所述至少一个相对距离标识是所述仲裁设备执行以下操作得到的:获取与所述多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;以及根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定所述至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离;Step 301: Acquire indication information of the arbitration device, where the indication information indicates that the arbitration device is used to determine the target device in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. Generated in the case of executing the voice command associated with the sound of the sound source target, the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring a plurality of sound collection data corresponding to the plurality of devices one-to-one , each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding device collecting the sound of the sound source target and the device identification of the device; and according to the reference voice data in the plurality of sound collection data The voice data, the device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence , the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;
步骤302,根据所述指示信息执行所述声源目标的声音关联的语音指令所指示的操作。Step 302: Execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
其中,所述目标设备为所述多个设备中距离所述声源目标最近的设备。此情况适用于就近唤醒产品方案中。The target device is the device closest to the sound source target among the multiple devices. This situation applies to the nearest wake-up product scheme.
其中,所述目标设备为所述仲裁设备;或者,所述目标设备为所述多个设备中除所述仲裁设备之外的设备。The target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.
具体实现中,若所述目标设备为所述仲裁设备,则仲裁设备直接生成指示信息,并根据该指示信息执行所述声源目标的声音关联的语音指令所指示的操作。In a specific implementation, if the target device is the arbitration device, the arbitration device directly generates indication information, and executes the operation indicated by the voice instruction associated with the sound of the sound source target according to the indication information.
若所述目标设备为所述多个设备中除所述仲裁设备之外的设备,则仲裁设备生成指示信息,向目标设备发送该指示信息,目标设备根据该指示信息执行所述声源目标的声音关联的语音指令所指示的操作。If the target device is a device other than the arbitration device among the multiple devices, the arbitration device generates indication information, sends the indication information to the target device, and the target device executes the sound source target according to the indication information. The action indicated by the voice associated with the sound.
此外,所述方法还包括:若检测到所述目标设备与所述声源目标的距离大于预设距离,则输出提示消息以提示用户靠近所述目标设备;和/或,若检测到所述目标设备与所述声源目标的距离大于所述预设距离,则调高所述目标设备的输出音量。In addition, the method further includes: if it is detected that the distance between the target device and the sound source target is greater than a preset distance, outputting a prompt message to prompt the user to approach the target device; and/or, if it is detected that the If the distance between the target device and the sound source target is greater than the preset distance, the output volume of the target device is increased.
若检测到所述目标设备与所述声源目标的距离小于或等于所述预设距离,则调低所述目标设备的输出音量。如此可以提高设备控制的智能性,提升用户使用体验。If it is detected that the distance between the target device and the sound source target is less than or equal to the preset distance, the output volume of the target device is lowered. In this way, the intelligence of device control can be improved, and the user experience can be improved.
其中,预设距离例如可以是5米、10米等。The preset distance may be, for example, 5 meters, 10 meters, or the like.
可以看出,本申请实施例中,目标设备首先获取仲裁设备的指示信息,其次,根据指示信息执行 声源目标的声音关联的语音指令所指示的操作。由于指示信息是仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定多个设备中目标设备用于执行声源目标的声音关联的语音指令情况下生成的,且相对距离标识用于指示对应的设备与声源目标的距离在距离序列中的位置,距离序列用于指示由多个距离按照预设排序策略进行排序形成的序列,多个距离包括多个设备中每个设备与声源目标之间的距离,相对于现有基于设备与声源目标的绝对距离确定就近唤醒设备的方案,本申请实现了通过距离比较模型预测多设备语音交互场景中包含全局信息的相对距离关系,并根据该相对距离关系确定待唤醒的目标设备,同时,由于每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和设备的设备标识,可见采集参考语音数据没有声道数量的限制,故而可以克服常规麦克风阵列声源定位算法的硬件需求高以及算法复杂问题,有利于提高相对距离关系预测的效率和适用性。It can be seen that, in the embodiment of the present application, the target device first obtains the indication information of the arbitration device, and secondly, according to the indication information, executes the operation indicated by the voice command associated with the sound of the sound source target. Because the indication information is generated by the arbitration device in the case of determining the voice command of the target device in the plurality of devices for executing the sound association of the sound source target according to the at least one relative distance identifier corresponding to at least one device in the plurality of devices, and The relative distance identifier is used to indicate the position of the distance between the corresponding device and the sound source target in the distance sequence, and the distance sequence is used to indicate the sequence formed by sorting multiple distances according to the preset sorting strategy. The distance between each device and the sound source target, compared with the existing scheme of determining the nearest wake-up device based on the absolute distance between the device and the sound source target, the present application realizes the prediction of the global information contained in the multi-device voice interaction scene through the distance comparison model and determine the target device to be awakened according to the relative distance relationship. At the same time, since each sound collection data includes the reference voice data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment, it can be seen that the collection The reference speech data has no limitation on the number of channels, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
请参阅图4,图4是本申请实施例提供的一种距离比较模型的训练方法的流程示意图,应用于模型训练设备,如图4所示,本设备控制方法包括以下操作。Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a training method for a distance comparison model provided by an embodiment of the present application, which is applied to a model training device. As shown in FIG. 4, the device control method includes the following operations.
步骤401,获取训练数据,所述训练数据包括多个语音数据集合,所述多个语音数据集合中每个语音数据集合包含与多个设备一一对应的多个参考语音数据,所述多个参考语音数据中每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且所述多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,所述声音采集环境至少包括所述声源目标所处的位置。Step 401: Acquire training data, where the training data includes multiple voice data sets, and each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one. In the reference speech data, each reference speech data is speech data obtained by the corresponding device collecting the sound of the sound source target, and the multiple speech data sets correspond to speech data sets collected in different sound collection environments, and the sound collection The environment includes at least the location where the sound source target is located.
其中,所述声音采集环境是指采集声音数据的声学环境,该声音采集环境可以多样化,除声源目标所处的位置之外,还可以进一步通过以下至少一种特征的差异性来构建差异化的声音采集环境:房间面积、噪声等级等。The sound collection environment refers to the acoustic environment in which sound data is collected, and the sound collection environment can be diversified. In addition to the location of the sound source target, the difference can be further constructed by the difference of at least one of the following features Sound collection environment: room size, noise level, etc.
例如,面积为10m
2,20m
2,30m
2,40m
2的房间,采集安静环境和噪声环境唤醒语音。噪声通过人为添加,可以选择噪声数据库中的高斯白噪声、电器(如风扇、空调)噪声、交通噪声等。根据纯净唤醒语音声压级,信噪比可以设置为-15dB、-10dB、-5dB、0dB、5dB、10dB、15dB等。
For example, in a room with an area of 10m 2 , 20m 2 , 30m 2 , and 40m 2 , the wake-up voice in quiet environment and noisy environment is collected. Noise is artificially added, and Gaussian white noise, electrical noise (such as fans, air conditioners), and traffic noise in the noise database can be selected. According to the sound pressure level of the pure wake-up voice, the signal-to-noise ratio can be set to -15dB, -10dB, -5dB, 0dB, 5dB, 10dB, 15dB, etc.
步骤402,根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,所述损失函数用于从设备配对组中两个设备在相同声音采集环境下与所述声源目标的相对距离关系的预测准确性维度表征所述距离比较模型的损失,所述设备配对组由所述多个设备中任意两个设备组成。Step 402: Train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to select two from the equipment pairing group. The prediction accuracy dimension of the relative distance relationship between the device and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.
其中,所述距离比较模型具体可以是深度神经网络,该深度神经网络例如可以是卷积神经网络或深度残差网络等,此处不做唯一限定。The distance comparison model may specifically be a deep neural network, and the deep neural network may be, for example, a convolutional neural network or a deep residual network, which is not uniquely limited here.
在一个可能的示例中,所述相对距离关系通过定义所述两个设备中第一设备相较第二设备距离所述声源目标更近的事件的得分进行表征,且所述得分的取值与距离差值关联,所述距离差值为第一距离与第二距离的差值,所述第一距离为所述第一设备与所述声源目标的距离,所述第二距离为所述第二设备与所述声源目标的距离。In a possible example, the relative distance relationship is characterized by defining a score of an event in which the first device is closer to the sound source target than the second device among the two devices, and the value of the score is Associated with the distance difference, the distance difference is the difference between the first distance and the second distance, the first distance is the distance between the first device and the sound source target, and the second distance is the the distance between the second device and the sound source target.
其中,所述得分可以采用类似概率的数据形式来计算和表达,得分的取值范围落在区间(0,1)。Wherein, the score can be calculated and expressed in the form of data similar to probability, and the value range of the score falls within the interval (0, 1).
举例来说,若第一距离为50cm,第二距离为80cm,第三距离(第三设备与声源目标的距离)为90cm,则第一设备相较第二设备距离所述声源目标更近的事件的得分可以是0.8,第一设备相较第三设备距离所述声源目标更近的事件的得分可以是0.9,第二设备相较第三设备距离所述声源目标更近的事件的得分可以是0.7,第二设备相较第一设备距离所述声源目标更近的事件的得分可以是0.08,第三设备相较第一设备距离所述声源目标更近的事件的得分可以是0.05,第三设备相较第二设备距离所述声源目标更近的事件的得分可以是0.09等。For example, if the first distance is 50cm, the second distance is 80cm, and the third distance (the distance between the third device and the sound source target) is 90cm, then the first device is closer to the sound source target than the second device. The score for the closest event may be 0.8, the score for the event where the first device is closer to the sound source target than the third device may be 0.9, and the score for the second device is closer to the sound source target than the third device. The score for the event may be 0.7, the score for the event where the second device is closer to the sound source target than the first device may be 0.08, and the score for the event where the third device is closer to the sound source target than the first device is. The score may be 0.05, the score for an event where the third device is closer to the sound source target than the second device may be 0.09, etc.
可见,本示例中,由于设备之间距声源有相对远近的距离关系,而这种关系有强有弱,通过将第一设备相较第二设备距离所述声源目标更近的事件的得分的取值与第一距离和第二距离之间的距离差值关联,能够让距离比较模型学习到这种差异而不是只学习到远/近这种粗粒度的关系。It can be seen that, in this example, since there is a relative distance relationship between the devices and the sound source, and this relationship is strong or weak, by comparing the score of the event where the first device is closer to the sound source target than the second device The value of is associated with the distance difference between the first distance and the second distance, which enables the distance comparison model to learn this difference instead of only learning the coarse-grained relationship of far/near.
在一个可能的示例中,所述得分通过构成所述两个设备直接或间接相邻关系的至少一组相邻设备的至少一个得分计算得到;所述相邻设备的得分通过该相邻设备中两个设备的两个相对距离标识计算得 到,所述相对距离标识对应所述距离比较模型的预测结果,所述相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。In a possible example, the score is calculated by at least one score of at least one group of adjacent devices that constitute the direct or indirect adjacent relationship between the two devices; the score of the adjacent device is obtained by calculating the score of the adjacent device. Two relative distance identifiers of two devices are obtained by calculation, and the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence. The distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target.
可见,本示例中,由于相对距离标识能够指示对应的设备在设备距离关系序列中的位置,设备距离关系序列是多个设备按照距离的预设排序策略进行排序而形成的序列,距离是指设备与声源目标之间的距离,从而相对距离标识能够表征包含全局信息的模型预测结果,提高模型预测结果表征的准确度。It can be seen that in this example, since the relative distance identifier can indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance, and the distance refers to the device distance relationship sequence. The distance between it and the sound source target, so that the relative distance identification can represent the model prediction result containing global information, and improve the accuracy of the model prediction result representation.
在本可能的示例中,所述根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,包括:将所述训练数据分为训练集和测试集,所述训练集包括所述多个语音数据集合中的部分语音数据集合;使用所述训练集对所述预设的距离比较模型进行至少一次训练,直至训练后的所述距离比较模型预测所述测试集的距离比较结果的准确度大于预设准确度。In this possible example, the training of the preset distance comparison model according to the reference speech data of the multiple speech data sets and the preset loss function, to obtain a trained distance comparison model, includes: converting the training data It is divided into a training set and a test set, the training set includes part of the voice data sets in the multiple voice data sets; the preset distance comparison model is trained at least once by using the training set, until the training set is The accuracy of the distance comparison model for predicting the distance comparison result of the test set is greater than a preset accuracy.
其中,所述预设准确度例如可以是98%、99%等,此处不做唯一限定。Wherein, the preset accuracy may be, for example, 98%, 99%, etc., which is not limited here.
可见,本示例中,通过训练距离比较模型使得模型具备满足预设准确度要求的预测能力。It can be seen that, in this example, by training the distance comparison model, the model has the prediction ability that meets the preset accuracy requirements.
在本可能的示例中,所述训练包括正向传播和反向传播优化;所述正向传播中使用语音数据集合的语音特征计算得到预测的相对距离标识;所述反向传播优化中使用预测的所述相对距离标识和真实的相对距离标识计算预测的得分和真实的得分,以及使用所述损失函数、所述预测的得分和所述真实的得分计算所述距离比较模型的损失,根据所述距离比较模型的损失调整所述距离比较模型的参数。In this possible example, the training includes forward propagation and back propagation optimization; in the forward propagation, the predicted relative distance identifier is obtained by calculating the speech features of the speech data set; in the back propagation optimization, the predicted relative distance is used Calculate the predicted score and the real score using the relative distance identification and the real relative distance identification, and use the loss function, the predicted score and the real score to calculate the loss of the distance comparison model, according to the The parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.
具体实现中,该损失函数的设计通过如下步骤a至步骤e实现:In the specific implementation, the design of the loss function is realized through the following steps a to e:
步骤a,对于训练数据中同一组唤醒动作所采集的数据,任意两个设备可形成一个配对。不失一般性地,假设当前组包含5台设备数据,记其类型和相对距离关系为:
其中
表示设备a相较设备b距声源目标更近在,则设备的相对距离标识分别为L
a=1,L
b=2,L
c=3,L
d=4,L
e=5。
In step a, for the data collected by the same group of wake-up actions in the training data, any two devices can form a pair. Without loss of generality, assuming that the current group contains data of 5 devices, the relationship between their types and relative distances is: in Indicates that the device a is closer to the sound source target than the device b, and the relative distance identifiers of the devices are L a =1, L b =2, L c =3, L d =4, and L e =5, respectively.
步骤b,记各设备所提取特征向量为x
a,x
b,x
c,x
d,x
e,记深度神经网络前馈过程映射为f,则特征向量(以x
a为例)对应输出层结果为o
a=f(x
a)。
Step b, record the feature vector extracted by each device as x a , x b , x c , x d , x e , and record the deep neural network feedforward process mapping as f, then the feature vector (taking x a as an example) corresponds to the output layer The result is o a =f(x a ).
步骤c,两设备配对间可由参考语音数据得到一个得分:In step c, a score can be obtained from the reference voice data between the pairing of the two devices:
步骤d,对于真实标签,仍计算任意两设备配对之间的得分。由于多设备间距离存在相对关系,因此首先计算相邻设备配对之间的得分,基于此计算非相邻设备间的得分。Step d, for the real label, still calculate the score between any two device pairings. Since there is a relative relationship between the distances between multiple devices, the scores between pairs of adjacent devices are first calculated, and the scores between non-adjacent devices are calculated based on this.
对于相邻设备(以b,c为例),得分为:For adjacent devices (take b, c as an example), the score is:
对于非相邻设备(以a,c为例),b为两者的共有相邻设备,得分为:For non-adjacent devices (take a, c as an example), b is the common adjacent device of the two, and the score is:
进一步地,对于设备配对(a,e):Further, for device pairing (a,e):
步骤e,输入数据经深度神经网络正向传播(feedforward)后,根据实际输出与真实标签之间的损失函数(loss function)进行反向传播,迭代调整网络参数,提升网络性能。In step e, after the input data is fed forward through the deep neural network, back-propagation is performed according to the loss function between the actual output and the real label, and the network parameters are iteratively adjusted to improve the network performance.
以任意设备配对(i,j)为例,损失函数计算如下:Taking any device pairing (i, j) as an example, the loss function is calculated as follows:
可见,本示例中,损失函数能够定量度量当前设备集合中任意两个设备的语音数据经过距离比较模型之后的估计标签与真实标签之间的差异,并通过该差异调整所述距离比较模型的参数直至模型预测准确度满足要求。It can be seen that in this example, the loss function can quantitatively measure the difference between the estimated label and the real label of the speech data of any two devices in the current device set after passing through the distance comparison model, and adjust the parameters of the distance comparison model through the difference. until the model prediction accuracy meets the requirements.
可以看出,本申请实施例中,设备首先获取训练数据,其次,根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型。由于训练数据包括多个语音数据集合,每个语音数据集合包含与多个设备一一对应的多个参考语音数据,每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,同时,损失函数用于从设备配对组中两个设备在相同声音采集环境下与声源目标的相对距离关系的预测准确性维度表征距离比较模型的损失,设备配对组由多个设备中任意两个设备组成,从而距离预测模型具备预测设备与声源目标的相对远近关系的能力,且参考语音数据没有声道数量的限制,故而可以克服常规麦克风阵列声源定位算法的硬件需求高以及算法复杂问题,有利于提高相对距离关系预测的效率和适用性。It can be seen that, in this embodiment of the present application, the device first obtains training data, and secondly, trains a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, and obtains a trained distance comparison model. Model. Since the training data includes multiple voice data sets, each voice data set includes multiple reference voice data corresponding to multiple devices one-to-one, and each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target , and the multiple voice data sets correspond to the voice data sets collected in different sound collection environments, and at the same time, the loss function is used to predict the relative distance relationship between the two devices in the device pairing group and the sound source target in the same sound collection environment The accuracy dimension represents the loss of the distance comparison model. The device pairing group consists of any two devices among multiple devices, so the distance prediction model has the ability to predict the relative distance between the device and the sound source target, and the reference speech data does not have the number of channels Therefore, it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.
本申请实施例提供一种距离关系确定装置,该距离关系确定装置可以为仲裁设备。具体的,距离关系确定装置用于执行以上距离关系确定方法中仲裁设备所执行的步骤。本申请实施例提供的距离关系确定装置可以包括相应步骤所对应的模块。An embodiment of the present application provides an apparatus for determining a distance relationship, and the device for determining a distance relationship may be an arbitration device. Specifically, the apparatus for determining a distance relationship is configured to perform the steps performed by the arbitration device in the above method for determining a distance relationship. The apparatus for determining a distance relationship provided in this embodiment of the present application may include modules corresponding to corresponding steps.
本申请实施例可以根据上述方法示例对距离关系确定装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present application, the distance relationship determining apparatus may be divided into functional modules according to the above method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图5示出上述实施例中所涉及的距离关系确定装置的一种可能的结构示意图。如图5所示,距离关系确定装置5应用于设备控制系统10中的仲裁设备200;所述装置包括:In the case where each functional module is divided according to each function, FIG. 5 shows a possible schematic structural diagram of the apparatus for determining a distance relationship involved in the above embodiment. As shown in FIG. 5, the distance relationship determination device 5 is applied to the arbitration device 200 in the device control system 10; the device includes:
获取单元50,用于与多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;The acquisition unit 50 is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding device collecting the sound of the sound source target and the the device identification of the device;
确定单元51,用于根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。The determining unit 51 is configured to determine at least one relative distance corresponding to at least one device one-to-one among the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model Identification, each relative distance identification in the at least one relative distance identification is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is the plurality of devices according to the preset sorting strategy of distance. A sequence formed by sorting, and the distance refers to the distance between the device and the sound source target.
在一个可能的示例中,在所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识方面,所述确定单元51,具体用于对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据;以及根据所述多个目标语音数据、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确 定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device among the plurality of devices. In terms of at least one relative distance identification, the determining unit 51 is specifically configured to align the frequency response capability of each reference voice data in the plurality of reference voice data in the plurality of voice collection data, and obtain a comparison with the plurality of voices. Collecting data one-to-one corresponding frequency response capability aligned multiple target voice data; and according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model, At least one relative distance identifier corresponding to at least one of the plurality of devices in a one-to-one correspondence is determined.
在一个可能的示例中,在所述对所述多个声音采集数据中的多个参考语音数据进行频响能力对齐,得到频响能力对齐后的多个目标语音数据方面,所述确定单元51,具体用于:针对所述多个声音采集数据中的每个参考语音数据执行如下操作,得到频响能力对齐后的多个目标语音数据:根据当前处理的参考语音数据的设备标识获取预设的当前的设备相对于基准设备的频响单位冲击响应;将所述参考语音数据与所述频响单位冲击响应做卷积运算,进行增益调整,得到频响能力对齐后的目标语音数据。In a possible example, in the aspect of performing frequency response capability alignment on multiple reference voice data in the multiple sound collection data to obtain multiple target voice data after frequency response capability alignment, the determining unit 51 , which is specifically used for: performing the following operations for each reference voice data in the plurality of sound collection data to obtain a plurality of target voice data after frequency response capability alignment: obtaining preset according to the device identifier of the currently processed reference voice data The frequency response unit impulse response of the current device relative to the reference device; the convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain the target speech data after frequency response capability alignment.
在一个可能的示例中,在所述根据所述目标语音数据、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识方面,所述确定单元51,具体用于:对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合,每个语音特征集合包括多维度的特征提取结果;以及将所述多个语音特征集合中每个语音特征集合进行特征融合,得到多个融合语音特征;以及根据所述多个融合语音特征、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。In a possible example, according to the target voice data, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, it is determined that at least one device is related to at least one device among the plurality of devices. In terms of at least one relative distance identifier corresponding to one-to-one, the determining unit 51 is specifically configured to: perform multi-dimensional feature extraction on each target voice data in the plurality of target voice data, and obtain a Multiple voice feature sets corresponding to data one-to-one, each voice feature set includes multi-dimensional feature extraction results; and feature fusion is performed on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; and, according to the plurality of fused voice features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine at least one relative device corresponding to at least one device among the plurality of devices. distance sign.
在一个可能的示例中,在所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识方面,所述确定单元51,具体用于:对所述多个声音采集数据中每个声音采集数据进行多维度的特征提取,得到与所述多个声音采集数据一一对应的多个语音特征集合,所述多个语音特征集合中每个语音特征集合包括多维度的特征提取结果;以及将所述多个语音特征集合中每个语音特征集合进行特征融合,得到多个融合语音特征;以及根据所述多个融合语音特征、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device among the plurality of devices. In terms of at least one relative distance identification, the determining unit 51 is specifically configured to: perform multi-dimensional feature extraction on each sound collection data in the plurality of sound collection data, and obtain a one-to-one correspondence with the plurality of sound collection data A plurality of voice feature sets, each voice feature set in the multiple voice feature sets includes multi-dimensional feature extraction results; and each voice feature set in the multiple voice feature sets is feature fusion to obtain multiple fused voice features; and according to the plurality of fused voice features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device in the plurality of devices at least one relative distance identifier of .
在一个可能的示例中,在所述对所述多个声音采集数据中每个声音采集数据进行多维度的特征提取,得到与所述多个声音采集数据一一对应的多个语音特征集合方面,所述确定单元51,具体用于:对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据;以及对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合。In a possible example, performing multi-dimensional feature extraction on each of the plurality of sound collection data to obtain multiple voice feature sets corresponding to the plurality of sound collection data one-to-one , the determining unit 51 is specifically configured to: align the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain a one-to-one correspondence with the multiple voice collection data A plurality of target speech data after the frequency response capability alignment; and multi-dimensional feature extraction is performed on each target speech data in the plurality of target speech data, to obtain a plurality of target speech data corresponding to the plurality of target speech data one-to-one. A collection of speech features.
在一个可能的示例中,在所述对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据方面,所述确定单元51,具体用于:针对所述多个声音采集数据中的每个参考语音数据执行如下操作,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据:以及根据当前处理的参考语音数据关联的设备标识获取预设的当前的设备相对于基准设备的频响单位冲击响应;以及将所述参考语音数据与所述频响单位冲击响应做卷积运算,进行增益调整,得到频响能力对齐后的目标语音数据。In a possible example, performing frequency response capability alignment on each of the plurality of reference speech data in the plurality of sound collection data to obtain a one-to-one correspondence with the plurality of sound collection data In terms of the multiple target voice data after frequency response capability alignment, the determining unit 51 is specifically configured to: perform the following operations for each reference voice data in the multiple voice collection data, to obtain a A plurality of target voice data after the frequency response capability alignment of the data one-to-one correspondence: and obtain the preset current device frequency response unit impulse response relative to the reference device according to the device identifier associated with the currently processed reference voice data; A convolution operation is performed between the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain the target speech data after the frequency response capability is aligned.
在一个可能的示例中,在所述对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合方面,所述确定单元51,具体用于:针对所述多个目标语音数据中每个目标语音数据执行如下操作,得到多个语音特征集合:提取当前处理的目标语音数据的标量语音特征和矢量语音特征;以及对所述矢量语音特征进行降维和二次特征提取,得到矢量衍生语音特征。In a possible example, performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one , the determining unit 51 is specifically configured to: perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice of the currently processed target voice data and performing dimension reduction and secondary feature extraction on the vector speech features to obtain vector derived speech features.
在一个可能的示例中,在所述提取当前处理的目标语音数据的标量语音特征和矢量语音特征方面,所述确定单元51,具体用于:对当前处理的目标语音数据进行预处理,得到预处理后的目标语音数据;以及提取所述预处理后的目标语音数据的标量语音特征和矢量语音特征;其中,所述预处理包括以下至少一种:静音抑制处理、通过高频滤波器进行预加重处理、分帧处理以及加窗处理。In a possible example, in terms of extracting the scalar speech feature and the vector speech feature of the currently processed target speech data, the determining unit 51 is specifically configured to: preprocess the currently processed target speech data to obtain a pre- processed target speech data; and extracting scalar speech features and vector speech features of the preprocessed target speech data; wherein, the preprocessing includes at least one of the following: silence suppression processing, preprocessing through high-frequency filters Emphasis, Framing, and Windowing.
在一个可能的示例中,所述确定单元在所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识之后,还用于:根据所述至少一个相对距离标识确定所述多个设备中用于执行所述声源目标的声音 关联的语音指令的目标设备;若检测到所述目标设备为所述仲裁设备之外的设备,则向所述目标设备发送指示信息,所述指示信息用于指示所述目标设备执行所述语音指令所指示的操作;若检测到所述目标设备为所述仲裁设备,则执行所述语音指令所指示的操作。In a possible example, the determining unit determines, according to the reference voice data and the device identifier in the plurality of sound collection data and the pre-trained distance comparison model, to determine the relationship with at least one device in the plurality of devices. After the one-to-one correspondence with at least one relative distance identifier, the method is further used to: determine, according to the at least one relative distance identifier, a target device in the plurality of devices for executing the voice command associated with the sound source target; If the target device is a device other than the arbitration device, send indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction; If the target device is the arbitration device, execute the operation indicated by the voice instruction.
在采用集成的单元的情况下,本申请实施例提供的另一种距离关系确定装置的结构示意图如图6所示。在图6中,距离关系确定装置6包括:处理模块60和通信模块61。处理模块60用于对设备控制装置的动作进行控制管理,例如,获取单元50、确定单元51所执行的步骤,和/或用于执行本文所描述的技术的其它过程。通信模块61用于支持设备控制装置与其他设备之间的交互。如图6所示,距离关系确定装置还可以包括存储模块62,存储模块62用于存储距离关系确定装置的程序代码和数据。In the case of using an integrated unit, a schematic structural diagram of another apparatus for determining a distance relationship provided by an embodiment of the present application is shown in FIG. 6 . In FIG. 6 , the distance relationship determining device 6 includes: a processing module 60 and a communication module 61 . The processing module 60 is used to control and manage the actions of the device control apparatus, for example, the steps performed by the acquisition unit 50, the determination unit 51, and/or other processes used to perform the techniques described herein. The communication module 61 is used to support the interaction between the device control apparatus and other devices. As shown in FIG. 6 , the distance relationship determining apparatus may further include a storage module 62, and the storage module 62 is configured to store program codes and data of the distance relationship determining apparatus.
其中,处理模块60可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信模块61可以是收发器、RF电路或通信接口等。存储模块62可以是存储器。Wherein, the processing module 60 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 61 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 62 may be a memory.
其中,上述方法实施例涉及的各场景的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。上述距离关系确定装置5和距离关系确定装置6均可执行上述图2所示的距离关系确定方法中仲裁设备所执行的步骤。Wherein, all the relevant contents of the scenarios involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. Both the distance relationship determining device 5 and the distance relationship determining device 6 can perform the steps performed by the arbitration device in the distance relationship determining method shown in FIG. 2 .
本申请实施例提供一种设备控制装置,该设备控制装置可以为仲裁设备。具体的,设备控制装置用于执行以上设备控制方法中目标设备所执行的步骤。本申请实施例提供的设备控制装置可以包括相应步骤所对应的模块。An embodiment of the present application provides a device control device, where the device control device may be an arbitration device. Specifically, the device control apparatus is configured to execute the steps performed by the target device in the above device control method. The device control apparatus provided in this embodiment of the present application may include modules corresponding to corresponding steps.
本申请实施例可以根据上述方法示例对设备控制装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present application, the device control apparatus may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图7示出上述实施例中所涉及的设备控制装置的一种可能的结构示意图。如图7所示,设备控制装置7应用于目标设备;所述装置包括:In the case where each functional module is divided according to each function, FIG. 7 shows a possible schematic structural diagram of the device control apparatus involved in the foregoing embodiment. As shown in Figure 7, the device control device 7 is applied to the target device; the device includes:
获取单元70,用于获取仲裁设备的指示信息,所述指示信息是所述仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定所述多个设备中所述目标设备用于执行声源目标的声音关联的语音指令情况下生成的,所述至少一个相对距离标识是所述仲裁设备执行以下操作得到的:获取与所述多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;以及根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定所述至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离;The obtaining unit 70 is configured to obtain the indication information of the arbitration device, where the indication information is that the arbitration device determines, according to at least one relative distance identifier corresponding to at least one device in the plurality of devices, the said data in the plurality of devices. Generated when the target device is used to execute the voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring a plurality of one-to-one correspondence with the plurality of devices Sound collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification in the reference voice data and the pre-trained distance comparison model are determined to determine the at least one relative distance identification, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence. The device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;
执行单元71,用于根据所述指示信息执行所述声源目标的声音关联的语音指令所指示的操作。The execution unit 71 is configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
在一个可能的示例中,所述目标设备为所述多个设备中距离所述声源目标最近的设备。In a possible example, the target device is the device closest to the sound source target among the plurality of devices.
在一个可能的示例中,所述目标设备为所述仲裁设备;或者,所述目标设备为所述多个设备中除所述仲裁设备之外的设备。In a possible example, the target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.
在采用集成的单元的情况下,本申请实施例提供的另一种设备控制装置的结构示意图如图8所示。在图8中,设备控制装置8包括:处理模块80和通信模块81。处理模块80用于对设备控制装置的动作进行控制管理,例如,获取单元70、执行单元71所执行的步骤,和/或用于执行本文所描述的技术的其它过程。通信模块81用于支持设备控制装置与其他设备之间的交互。如图8所示,设备控制装置还可 以包括存储模块82,存储模块82用于存储设备控制装置的程序代码和数据。In the case of using an integrated unit, a schematic structural diagram of another device control apparatus provided by an embodiment of the present application is shown in FIG. 8 . In FIG. 8 , the device control apparatus 8 includes: a processing module 80 and a communication module 81 . The processing module 80 is used to control and manage the actions of the device control apparatus, for example, the steps performed by the acquisition unit 70, the execution unit 71, and/or other processes used to perform the techniques described herein. The communication module 81 is used to support the interaction between the device control apparatus and other devices. As shown in Fig. 8, the device control apparatus may further include a storage module 82, and the storage module 82 is used for storing program codes and data of the device control apparatus.
其中,处理模块80可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信模块81可以是收发器、RF电路或通信接口等。存储模块82可以是存储器。Wherein, the processing module 80 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 81 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 82 may be a memory.
其中,上述方法实施例涉及的各场景的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。上述设备控制装置7和设备控制装置8均可执行上述图3所示的设备控制方法中目标设备所执行的步骤。Wherein, all the relevant contents of the scenarios involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. Both the device control device 7 and the device control device 8 can execute the steps performed by the target device in the device control method shown in FIG. 3 .
本申请实施例提供一种距离比较模型的训练装置,该距离比较模型的训练装置可以为用于训练模型的模型训练设备。具体的,距离比较模型的训练装置用于执行以上距离比较模型的训练方法中模型训练设备所执行的步骤。本申请实施例提供的距离比较模型的训练装置可以包括相应步骤所对应的模块。An embodiment of the present application provides a training device for a distance comparison model, and the training device for a distance comparison model may be a model training device for training a model. Specifically, the distance comparison model training apparatus is configured to perform the steps performed by the model training device in the above distance comparison model training method. The apparatus for training the distance comparison model provided by the embodiment of the present application may include modules corresponding to the corresponding steps.
本申请实施例可以根据上述方法示例对距离比较模型的训练装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present application, the training device of the distance comparison model can be divided into functional modules according to the above method examples. For example, each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module. . The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图9示出上述实施例中所涉及的距离比较模型的训练装置的一种可能的结构示意图。如图9所示,距离比较模型的训练装置9应用于模型训练设备;所述装置包括:In the case where each functional module is divided according to each function, FIG. 9 shows a possible schematic structural diagram of the training device for the distance comparison model involved in the above embodiment. As shown in Figure 9, the training device 9 of the distance comparison model is applied to the model training equipment; the device includes:
获取单元90,用于获取训练数据,所述训练数据包括多个语音数据集合,所述多个语音数据集合中每个语音数据集合包含与多个设备一一对应的多个参考语音数据,所述多个参考语音数据中每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且所述多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,所述声音采集环境至少包括所述声源目标所处的位置;The obtaining unit 90 is configured to obtain training data, where the training data includes a plurality of voice data sets, and each voice data set in the plurality of voice data sets includes a plurality of reference voice data corresponding to a plurality of devices one-to-one. Each reference voice data in the plurality of reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, so The sound collection environment at least includes the location where the sound source target is located;
训练单元91,用于根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,所述损失函数用于从设备配对组中两个设备在相同声音采集环境下与所述声源目标的相对距离关系的预测准确性维度表征所述距离比较模型的损失,所述设备配对组由所述多个设备中任意两个设备组成。A training unit 91, configured to train a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, to obtain a trained distance comparison model, and the loss function is used for pairing groups from devices The prediction accuracy dimension of the relative distance relationship between two devices in the same sound collection environment and the sound source target represents the loss of the distance comparison model, and the device pairing group consists of any two devices in the plurality of devices. composition.
在一个可能的示例中,所述相对距离关系通过定义所述两个设备中第一设备相较第二设备距离所述声源目标更近的事件的得分进行表征,且所述得分的取值与距离差值关联,所述距离差值为第一距离与第二距离的差值,所述第一距离为所述第一设备与所述声源目标的距离,所述第二距离为所述第二设备与所述声源目标的距离。In a possible example, the relative distance relationship is characterized by defining a score of an event in which the first device is closer to the sound source target than the second device among the two devices, and the value of the score is Associated with the distance difference, the distance difference is the difference between the first distance and the second distance, the first distance is the distance between the first device and the sound source target, and the second distance is the the distance between the second device and the sound source target.
在一个可能的示例中,所述得分通过构成所述两个设备直接或间接相邻关系的至少一组相邻设备的至少一个得分计算得到;In a possible example, the score is calculated by at least one score of at least one group of adjacent devices forming a direct or indirect adjacent relationship between the two devices;
所述相邻设备的得分通过该相邻设备中两个设备的两个相对距离标识计算得到,所述相对距离标识对应所述距离比较模型的预测结果,所述相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。The score of the adjacent device is calculated by two relative distance identifiers of the two devices in the adjacent device, the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the corresponding distance. The position of the device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting the multiple devices according to the preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target the distance.
在一个可能的示例中,在所述根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型方面,所述训练单元91,具体用于:将所述训练数据分为训练集和测试集,所述训练集包括所述多个语音数据集合中的部分语音数据集合;以及使用所述训练集对所述预设的距离比较模型进行至少一次训练,直至训练后的所述距离比较模型预测所述测试集的距离比较结果的准确度大于预设准确度。In a possible example, in the aspect of training a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function to obtain a trained distance comparison model, the training unit 91 , which is specifically used to: divide the training data into a training set and a test set, the training set includes part of the voice data sets in the multiple voice data sets; and use the training set to compare the preset distance The comparison model is trained at least once until the distance comparison model after training has a greater accuracy than a preset accuracy in predicting the distance comparison result of the test set.
在一个可能的示例中,所述训练包括正向传播和反向传播优化;In a possible example, the training includes forward propagation and back propagation optimization;
所述正向传播中使用语音数据集合的语音特征计算得到预测的相对距离标识;In the described forward propagation, use the voice feature of the voice data set to calculate the predicted relative distance mark;
所述反向传播优化中使用预测的所述相对距离标识和真实的相对距离标识计算预测的得分和真实的得分,以及使用所述损失函数、所述预测的得分和所述真实的得分计算所述距离比较模型的损失,根据所述距离比较模型的损失调整所述距离比较模型的参数。In the back-propagation optimization, the predicted relative distance identification and the real relative distance identification are used to calculate the predicted score and the real score, and the loss function, the predicted score and the real score are used to calculate the predicted score. The loss of the distance comparison model is adjusted, and the parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.
在采用集成的单元的情况下,本申请实施例提供的另一种距离比较模型的训练装置的结构示意图如图10所示。在图10中,距离比较模型的训练装置10包括:处理模块100和通信模块101。处理模块100用于对距离比较模型的训练装置的动作进行控制管理,例如,获取单元90、训练单元91所执行的步骤,和/或用于执行本文所描述的技术的其它过程。通信模块101用于支持距离比较模型的训练装置与其他设备之间的交互。如图10所示,距离比较模型的训练装置还可以包括存储模块102,存储模块102用于存储距离比较模型的训练装置的程序代码和数据。In the case of using an integrated unit, a schematic structural diagram of another distance comparison model training apparatus provided in the embodiment of the present application is shown in FIG. 10 . In FIG. 10 , the training device 10 of the distance comparison model includes: a processing module 100 and a communication module 101 . The processing module 100 is used to control and manage the actions of the training device of the distance comparison model, eg, the steps performed by the acquisition unit 90, the training unit 91, and/or other processes for performing the techniques described herein. The communication module 101 is used to support the interaction between the training device of the distance comparison model and other devices. As shown in FIG. 10 , the training device for the distance comparison model may further include a storage module 102, and the storage module 102 is used for storing program codes and data of the training device for the distance comparison model.
其中,处理模块100可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),ASIC,FPGA或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信模块101可以是收发器、RF电路或通信接口等。存储模块102可以是存储器。The processing module 100 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 101 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 102 may be a memory.
其中,上述方法实施例涉及的各场景的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。上述距离比较模型的训练装置9和距离比较模型的训练装置10均可执行上述图4所示的距离比较模型的训练方法中模型训练设备所执行的步骤。Wherein, all the relevant contents of the scenarios involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. Both the distance comparison model training device 9 and the distance comparison model training device 10 can perform the steps performed by the model training device in the distance comparison model training method shown in FIG. 4 .
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission by wire or wireless to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media. The semiconductor medium may be a solid state drive.
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤,上述计算机包括电子设备。Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute part or all of the steps of any method described in the above method embodiments , the above computer includes electronic equipment.
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤。该计算机程序产品可以为一个软件安装包,上述计算机包括电子设备。Embodiments of the present application further provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of the method embodiments described above. some or all of the steps of the method. The computer program product may be a software installation package, and the computer includes an electronic device.
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法、装置和系统,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的;例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式;例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed method, apparatus and system may be implemented in other manners. For example, the device embodiments described above are only illustrative; for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation; for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理包括,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included individually, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated units implemented in the form of software functional units can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium, and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute some steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.
虽然本发明披露如上,但本发明并非限定于此。任何本领域技术人员,在不脱离本发明的精神和范围内,可轻易想到变化或替换,均可作各种更动与修改,包含上述不同功能、实施步骤的组合,包含软件和硬件的实施方式,均在本发明的保护范围。Although the present invention is disclosed above, the present invention is not limited thereto. Any person skilled in the art, without departing from the spirit and scope of the present invention, can easily think of changes or substitutions, and can make various changes and modifications, including the combination of the above-mentioned different functions and implementation steps, including the implementation of software and hardware. The methods are all within the protection scope of the present invention.
Claims (23)
- 一种距离关系确定方法,其特征在于,应用于仲裁设备,所述方法包括:A method for determining a distance relationship, characterized in that it is applied to an arbitration device, the method comprising:获取与多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;Acquire multiple pieces of sound collection data corresponding to multiple devices one-to-one, and each sound collection data in the multiple pieces of sound collection data includes reference voice data obtained by collecting the sound of the sound source target by the corresponding equipment and the equipment of the equipment. identification;根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。According to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model, determine at least one relative distance identifier corresponding to at least one device among the multiple devices, and the at least one relative distance identifier is determined. Each relative distance identifier in the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy for distances, The distance refers to the distance between the device and the sound source target.
- 根据权利要求1所述的方法,其特征在于,所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,包括:The method according to claim 1, characterized in that, according to the reference voice data and device identification in the plurality of sound collection data and the pre-trained distance comparison model, it is determined that at least one of the plurality of devices is connected to the device. At least one relative distance identifier corresponding to the device one-to-one, including:对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据;Perform frequency response capability alignment on each of the multiple reference voice data in the multiple sound collection data, and obtain a plurality of target voices whose frequency response capabilities are aligned one-to-one with the multiple voice collection data data;根据所述多个目标语音数据、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。At least one relative distance corresponding to at least one of the multiple devices is determined according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model logo.
- 根据权利要求2所述的方法,其特征在于,所述对所述多个声音采集数据中的多个参考语音数据进行频响能力对齐,得到频响能力对齐后的多个目标语音数据,包括:The method according to claim 2, wherein the performing frequency response capability alignment on a plurality of reference speech data in the plurality of sound collection data to obtain a plurality of target speech data after frequency response capability alignment, comprising: :针对所述多个声音采集数据中的每个参考语音数据执行如下操作,得到频响能力对齐后的多个目标语音数据:The following operations are performed for each reference speech data in the plurality of sound collection data to obtain a plurality of target speech data after frequency response capability alignment:根据当前处理的参考语音数据的设备标识获取预设的当前的设备相对于基准设备的频响单位冲击响应;Obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier of the currently processed reference voice data;将所述参考语音数据与所述频响单位冲击响应做卷积运算,进行增益调整,得到频响能力对齐后的目标语音数据。Convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain target speech data after the frequency response capability is aligned.
- 根据权利要求2或3所述的方法,其特征在于,所述根据所述目标语音数据、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,包括:The method according to claim 2 or 3, characterized in that, according to the target voice data, the device identifiers in the plurality of voice collection data, and the pre-trained distance comparison model, determining the At least one relative distance identifier corresponding to at least one device one-to-one among multiple devices, including:对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合,每个语音特征集合包括多维度的特征提取结果;Multi-dimensional feature extraction is performed on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one, and each voice feature set includes multi-dimensional features extract results;将所述多个语音特征集合中每个语音特征集合进行特征融合,得到多个融合语音特征;Perform feature fusion on each of the multiple voice feature sets to obtain multiple fused voice features;根据所述多个融合语音特征、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。At least one relative distance corresponding to at least one of the plurality of devices is determined according to the plurality of fused speech features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model logo.
- 根据权利要求1所述的方法,其特征在于,所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,包括:The method according to claim 1, characterized in that, according to the reference voice data and device identification in the plurality of sound collection data and the pre-trained distance comparison model, it is determined that at least one of the plurality of devices is connected to the device. At least one relative distance identifier corresponding to the device one-to-one, including:对所述多个声音采集数据中每个声音采集数据进行多维度的特征提取,得到与所述多个声音采集数据一一对应的多个语音特征集合,所述多个语音特征集合中每个语音特征集合包括多维度的特征提取结果;Multi-dimensional feature extraction is performed on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of voice collection data one-to-one, and each of the plurality of voice feature sets is obtained. The speech feature set includes multi-dimensional feature extraction results;将所述多个语音特征集合中每个语音特征集合进行特征融合,得到多个融合语音特征;Perform feature fusion on each of the multiple voice feature sets to obtain multiple fused voice features;根据所述多个融合语音特征、所述多个声音采集数据中的设备标识以及所述预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识。At least one relative distance corresponding to at least one of the plurality of devices is determined according to the plurality of fused speech features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model logo.
- 根据权利要求5所述的方法,其特征在于,所述对所述多个声音采集数据中每个声音采集数据进行多维度的特征提取,得到与所述多个声音采集数据一一对应的多个语音特征集合,包括:The method according to claim 5, wherein the multi-dimensional feature extraction is performed on each sound collection data in the plurality of sound collection data to obtain a one-to-one correspondence with the plurality of sound collection data. A set of speech features, including:对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与 所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据;Perform frequency response capability alignment on each of the multiple reference voice data in the multiple sound collection data, and obtain a plurality of target voices whose frequency response capabilities are aligned one-to-one with the multiple voice collection data data;对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合。Multi-dimensional feature extraction is performed on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one.
- 根据权利要求6所述的方法,其特征在于,所述对所述多个声音采集数据中的多个参考语音数据中每个参考语音数据进行频响能力对齐,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据,包括:The method according to claim 6, wherein the frequency response capability alignment is performed on each of the plurality of reference speech data in the plurality of sound collection data, and the result is obtained with the plurality of sound collection data. Multiple target speech data after frequency response capability alignment corresponding to data one-to-one, including:针对所述多个声音采集数据中的每个参考语音数据执行如下操作,得到与所述多个声音采集数据一一对应的频响能力对齐后的多个目标语音数据:The following operations are performed for each reference speech data in the plurality of sound collection data to obtain a plurality of target speech data whose frequency response capabilities are aligned one-to-one with the plurality of sound collection data:根据当前处理的参考语音数据关联的设备标识获取预设的当前的设备相对于基准设备的频响单位冲击响应;Obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier associated with the currently processed reference voice data;将所述参考语音数据与所述频响单位冲击响应做卷积运算,进行增益调整,得到频响能力对齐后的目标语音数据。Convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain target speech data after the frequency response capability is aligned.
- 根据权利要求4或权利要求6所述的方法,其特征在于,所述对所述多个目标语音数据中每个目标语音数据进行多维度的特征提取,得到与所述多个目标语音数据一一对应的多个语音特征集合,包括:The method according to claim 4 or claim 6, wherein the multi-dimensional feature extraction is performed on each target voice data in the plurality of target voice data to obtain the same A corresponding set of multiple speech features, including:针对所述多个目标语音数据中每个目标语音数据执行如下操作,得到多个语音特征集合:The following operations are performed for each target voice data in the multiple target voice data to obtain multiple voice feature sets:提取当前处理的目标语音数据的标量语音特征和矢量语音特征;Extract scalar speech features and vector speech features of the currently processed target speech data;对所述矢量语音特征进行降维和二次特征提取,得到矢量衍生语音特征。Dimension reduction and secondary feature extraction are performed on the vector speech features to obtain vector derived speech features.
- 根据权利要求8所述的方法,其特征在于,所述提取当前处理的目标语音数据的标量语音特征和矢量语音特征,包括:The method according to claim 8, wherein the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data comprises:对当前处理的目标语音数据进行预处理,得到预处理后的目标语音数据;Preprocessing the currently processed target speech data to obtain preprocessed target speech data;提取所述预处理后的目标语音数据的标量语音特征和矢量语音特征;Extracting the scalar speech feature and the vector speech feature of the preprocessed target speech data;其中,所述预处理包括以下至少一种:静音抑制处理、通过高频滤波器进行预加重处理、分帧处理以及加窗处理。The preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through a high-frequency filter, frame segmentation processing, and windowing processing.
- 根据权利要求1-9任一项所述的方法,其特征在于,所述根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识之后,所述方法还包括:The method according to any one of claims 1-9, characterized in that, according to the reference voice data and device identifiers in the plurality of sound collection data and a pre-trained distance comparison model, determining a After the at least one relative distance identifier is in one-to-one correspondence with at least one of the devices, the method further includes:根据所述至少一个相对距离标识确定所述多个设备中用于执行所述声源目标的声音关联的语音指令的目标设备;Determine, according to the at least one relative distance identifier, a target device of the plurality of devices for executing the voice command associated with the sound source target;若检测到所述目标设备为所述仲裁设备之外的设备,则向所述目标设备发送指示信息,所述指示信息用于指示所述目标设备执行所述语音指令所指示的操作;If it is detected that the target device is a device other than the arbitration device, sending indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction;若检测到所述目标设备为所述仲裁设备,则执行所述语音指令所指示的操作。If it is detected that the target device is the arbitration device, the operation indicated by the voice instruction is performed.
- 一种设备控制方法,其特征在于,应用于目标设备,所述方法包括:A device control method, characterized in that, applied to a target device, the method comprising:获取仲裁设备的指示信息,所述指示信息是所述仲裁设备在根据与多个设备中至少一个设备一一对应的至少一个相对距离标识确定所述多个设备中所述目标设备用于执行声源目标的声音关联的语音指令情况下生成的,所述至少一个相对距离标识是所述仲裁设备执行以下操作得到的:获取与所述多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;以及根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定所述至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离;Acquire indication information of the arbitration device, where the indication information is when the arbitration device determines that the target device in the plurality of devices is used to perform the sound according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated in the case of a voice command associated with the voice of the source and target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operations: acquiring multiple voice collection data corresponding to the multiple devices one-to-one, the Each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the reference speech data in the plurality of sound collection data and The device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence. The device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;根据所述指示信息执行所述声源目标的声音关联的语音指令所指示的操作。The operation indicated by the sound-related voice instruction of the sound source target is performed according to the instruction information.
- 根据权利要求11所述的方法,其特征在于,所述目标设备为所述多个设备中距离所述声源目 标最近的设备。The method according to claim 11, wherein the target device is the device closest to the sound source target among the plurality of devices.
- 根据权利要求11或12所述的方法,其特征在于,所述目标设备为所述仲裁设备;或者,所述目标设备为所述多个设备中除所述仲裁设备之外的设备。The method according to claim 11 or 12, wherein the target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.
- 一种距离比较模型的训练方法,其特征在于,包括:A training method for a distance comparison model, comprising:获取训练数据,所述训练数据包括多个语音数据集合,所述多个语音数据集合中每个语音数据集合包含与多个设备一一对应的多个参考语音数据,所述多个参考语音数据中每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且所述多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,所述声音采集环境至少包括所述声源目标所处的位置;Acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the multiple reference voice data Each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, and the sound collection environments at least include the location of the sound source target;根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,所述损失函数用于从设备配对组中两个设备在相同声音采集环境下与所述声源目标的相对距离关系的预测准确性维度表征所述距离比较模型的损失,所述设备配对组由所述多个设备中任意两个设备组成。A preset distance comparison model is trained according to the reference speech data of the multiple speech data sets and a preset loss function, and a trained distance comparison model is obtained. The prediction accuracy dimension of the relative distance relationship with the sound source target in the sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.
- 根据权利要求14所述的方法,其特征在于,所述相对距离关系通过定义所述两个设备中第一设备相较第二设备距离所述声源目标更近的事件的得分进行表征,且所述得分的取值与距离差值关联,所述距离差值为第一距离与第二距离的差值,所述第一距离为所述第一设备与所述声源目标的距离,所述第二距离为所述第二设备与所述声源目标的距离。15. The method of claim 14, wherein the relative distance relationship is characterized by a score defining an event of the two devices where a first device is closer to the sound source target than a second device, and The value of the score is associated with the distance difference, the distance difference is the difference between the first distance and the second distance, and the first distance is the distance between the first device and the sound source target, so The second distance is the distance between the second device and the sound source target.
- 根据权利要求15所述的方法,其特征在于,所述得分通过构成所述两个设备直接或间接相邻关系的至少一组相邻设备的至少一个得分计算得到;The method according to claim 15, wherein the score is calculated by at least one score of at least one group of adjacent devices forming a direct or indirect adjacent relationship between the two devices;所述相邻设备的得分通过该相邻设备中两个设备的两个相对距离标识计算得到,所述相对距离标识对应所述距离比较模型的预测结果,所述相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。The score of the adjacent device is calculated by two relative distance identifiers of the two devices in the adjacent device, the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the corresponding distance. The position of the device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting the multiple devices according to the preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target the distance.
- 根据权利要求16所述的方法,其特征在于,所述根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,包括:The method according to claim 16, wherein the training a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, to obtain a trained distance comparison model, comprising: :将所述训练数据分为训练集和测试集,所述训练集包括所述多个语音数据集合中的部分语音数据集合;dividing the training data into a training set and a test set, where the training set includes part of the voice data sets in the multiple voice data sets;使用所述训练集对所述预设的距离比较模型进行至少一次训练,直至训练后的所述距离比较模型预测所述测试集的距离比较结果的准确度大于预设准确度。The preset distance comparison model is trained at least once by using the training set, until the trained distance comparison model predicts the distance comparison result of the test set with a greater accuracy than a preset accuracy.
- 根据权利要求17所述的方法,其特征在于,所述训练包括正向传播和反向传播优化;The method of claim 17, wherein the training comprises forward propagation and back propagation optimization;所述正向传播中使用语音数据集合的语音特征计算得到预测的相对距离标识;In the described forward propagation, use the voice feature of the voice data set to calculate the predicted relative distance mark;所述反向传播优化中使用预测的所述相对距离标识和真实的相对距离标识计算预测的得分和真实的得分,以及使用所述损失函数、所述预测的得分和所述真实的得分计算所述距离比较模型的损失,根据所述距离比较模型的损失调整所述距离比较模型的参数。In the back-propagation optimization, the predicted relative distance identification and the real relative distance identification are used to calculate the predicted score and the real score, and the loss function, the predicted score and the real score are used to calculate the predicted score. The loss of the distance comparison model is adjusted, and the parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.
- 一种距离关系确定装置,其特征在于,应用于仲裁设备,所述装置包括:An apparatus for determining a distance relationship, characterized in that, when applied to an arbitration device, the apparatus includes:获取单元,用于与多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;The acquisition unit is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the obtained sound. the equipment identification of the said equipment;确定单元,用于根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定与所述多个设备中至少一个设备一一对应的至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离。A determination unit, configured to determine at least one relative distance identifier corresponding to at least one of the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model , each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is that the multiple devices are sorted according to a preset sorting strategy of distances And the formed sequence, the distance refers to the distance between the device and the sound source target.
- 一种设备控制装置,其特征在于,应用于目标设备,所述装置包括:A device control device, characterized in that, applied to a target device, the device comprising:获取单元,用于获取仲裁设备的指示信息,所述指示信息是所述仲裁设备在根据与多个设备中至 少一个设备一一对应的至少一个相对距离标识确定所述多个设备中所述目标设备用于执行声源目标的声音关联的语音指令情况下生成的,所述至少一个相对距离标识是所述仲裁设备执行以下操作得到的:获取与所述多个设备一一对应的多个声音采集数据,所述多个声音采集数据中每个声音采集数据包括对应的设备采集声源目标的声音而得到的参考语音数据和所述设备的设备标识;以及根据所述多个声音采集数据中的参考语音数据和设备标识以及预先训练好的距离比较模型,确定所述至少一个相对距离标识,所述至少一个相对距离标识中每个相对距离标识用于指示对应的设备在设备距离关系序列中的位置,所述设备距离关系序列是所述多个设备按照距离的预设排序策略进行排序而形成的序列,所述距离是指设备与所述声源目标之间的距离;An acquiring unit, configured to acquire indication information of the arbitration device, where the indication information is when the arbitration device determines the target in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated when the device is used to execute a voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring multiple sounds corresponding to the multiple devices one-to-one Collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification and the pre-trained distance comparison model are determined, and the at least one relative distance identification is determined, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence. The position of the device distance relationship is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;执行单元,用于根据所述指示信息执行所述声源目标的声音关联的语音指令所指示的操作。An execution unit, configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
- 一种距离比较模型的训练装置,其特征在于,包括:A training device for a distance comparison model, comprising:获取单元,用于获取训练数据,所述训练数据包括多个语音数据集合,所述多个语音数据集合中每个语音数据集合包含与多个设备一一对应的多个参考语音数据,所述多个参考语音数据中每个参考语音数据为对应的设备采集声源目标的声音而得到的语音数据,且所述多个语音数据集合对应在不同声音采集环境下采集的语音数据集合,所述声音采集环境至少包括所述声源目标所处的位置;an acquiring unit, configured to acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the Each reference voice data in the multiple reference voice data is voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to voice data sets collected under different sound collection environments, the The sound collection environment at least includes the location where the sound source target is located;训练单元,用于根据所述多个语音数据集合的参考语音数据和预设的损失函数训练预设的距离比较模型,得到训练好的距离比较模型,所述损失函数用于从设备配对组中两个设备在相同声音采集环境下与所述声源目标的相对距离关系的预测准确性维度表征所述距离比较模型的损失,所述设备配对组由所述多个设备中任意两个设备组成。A training unit, configured to train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to obtain the trained distance comparison model from the device pairing group The prediction accuracy dimension of the relative distance relationship between two devices and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices in the plurality of devices. .
- 一种电子设备,其特征在于,所述电子设备包括:An electronic device, characterized in that the electronic device comprises:一个或多个处理器;one or more processors;一个或多个存储器,用于存储程序,one or more memories for storing programs,所述一个或多个存储器和所述程序被配置为,由所述一个或多个处理器控制所述设备执行如权利要求1-10或者权利要求11-13或者权利要求14-18任一项所述的方法中的步骤。The one or more memories and the program are configured to control the device by the one or more processors to perform any one of claims 1-10 or 11-13 or 14-18 the steps in the method.
- 一种计算机可读存储介质,其特征在于,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-10或者权利要求11-13或者权利要求14-18任一项所述的方法。A computer-readable storage medium, characterized by storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute any of claims 1-10 or 11-13 or claims 14-18 one of the methods described.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273250.2 | 2021-03-10 | ||
CN202110273250.2A CN115083436A (en) | 2021-03-10 | 2021-03-10 | Distance relation determination method, equipment control method, model training method and related device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022188560A1 true WO2022188560A1 (en) | 2022-09-15 |
Family
ID=83226337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/072703 WO2022188560A1 (en) | 2021-03-10 | 2022-01-19 | Methods for distance relationship determination, device control and model training, and related apparatuses |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115083436A (en) |
WO (1) | WO2022188560A1 (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105843581A (en) * | 2016-03-21 | 2016-08-10 | 腾讯科技(深圳)有限公司 | Frequency response calibration method, server, terminal device, and frequency response calibration system |
CN107507625A (en) * | 2016-06-14 | 2017-12-22 | 讯飞智元信息科技有限公司 | Sound source distance determines method and device |
CN109391528A (en) * | 2018-08-31 | 2019-02-26 | 百度在线网络技术(北京)有限公司 | Awakening method, device, equipment and the storage medium of speech-sound intelligent equipment |
US20190287526A1 (en) * | 2016-11-10 | 2019-09-19 | Nuance Communications, Inc. | Techniques for language independent wake-up word detection |
CN111128169A (en) * | 2019-12-30 | 2020-05-08 | 云知声智能科技股份有限公司 | Voice wake-up method and device |
CN111192589A (en) * | 2020-01-16 | 2020-05-22 | 云知声智能科技股份有限公司 | Voice wake-up method and device |
CN111294704A (en) * | 2020-01-22 | 2020-06-16 | 北京松果电子有限公司 | Audio processing method, device and storage medium |
CN111833863A (en) * | 2019-04-22 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Voice control system, method and apparatus, and computing device and storage medium |
-
2021
- 2021-03-10 CN CN202110273250.2A patent/CN115083436A/en active Pending
-
2022
- 2022-01-19 WO PCT/CN2022/072703 patent/WO2022188560A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105843581A (en) * | 2016-03-21 | 2016-08-10 | 腾讯科技(深圳)有限公司 | Frequency response calibration method, server, terminal device, and frequency response calibration system |
CN107507625A (en) * | 2016-06-14 | 2017-12-22 | 讯飞智元信息科技有限公司 | Sound source distance determines method and device |
US20190287526A1 (en) * | 2016-11-10 | 2019-09-19 | Nuance Communications, Inc. | Techniques for language independent wake-up word detection |
CN109391528A (en) * | 2018-08-31 | 2019-02-26 | 百度在线网络技术(北京)有限公司 | Awakening method, device, equipment and the storage medium of speech-sound intelligent equipment |
CN111833863A (en) * | 2019-04-22 | 2020-10-27 | 阿里巴巴集团控股有限公司 | Voice control system, method and apparatus, and computing device and storage medium |
CN111128169A (en) * | 2019-12-30 | 2020-05-08 | 云知声智能科技股份有限公司 | Voice wake-up method and device |
CN111192589A (en) * | 2020-01-16 | 2020-05-22 | 云知声智能科技股份有限公司 | Voice wake-up method and device |
CN111294704A (en) * | 2020-01-22 | 2020-06-16 | 北京松果电子有限公司 | Audio processing method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115083436A (en) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
WO2021093449A1 (en) | Wakeup word detection method and apparatus employing artificial intelligence, device, and medium | |
CN112562691B (en) | Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium | |
WO2019101123A1 (en) | Voice activity detection method, related device, and apparatus | |
US10861480B2 (en) | Method and device for generating far-field speech data, computer device and computer readable storage medium | |
CN110021307B (en) | Audio verification method and device, storage medium and electronic equipment | |
JP2021516369A (en) | Mixed speech recognition method, device and computer readable storage medium | |
CN107799126A (en) | Sound end detecting method and device based on Supervised machine learning | |
CN108711429B (en) | Electronic device and device control method | |
TW202008352A (en) | Method, device, audio interaction system, and storage medium for azimuth estimation | |
EP3274988A1 (en) | Controlling electronic device based on direction of speech | |
CN105679310A (en) | Method and system for speech recognition | |
CN110706707B (en) | Method, apparatus, device and computer-readable storage medium for voice interaction | |
US11222652B2 (en) | Learning-based distance estimation | |
US9953633B2 (en) | Speaker dependent voiced sound pattern template mapping | |
JP2023546703A (en) | Multichannel voice activity detection | |
WO2023279691A1 (en) | Speech classification method and apparatus, model training method and apparatus, device, medium, and program | |
CN112185425B (en) | Audio signal processing method, device, equipment and storage medium | |
CN104282303A (en) | Method for conducting voice recognition by voiceprint recognition and electronic device thereof | |
US11763806B1 (en) | Speaker recognition adaptation | |
CN114664288A (en) | Voice recognition method, device, equipment and storage medium | |
WO2022188560A1 (en) | Methods for distance relationship determination, device control and model training, and related apparatuses | |
CN114464184B (en) | Method, apparatus and storage medium for speech recognition | |
US20230113883A1 (en) | Digital Signal Processor-Based Continued Conversation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22766087 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22766087 Country of ref document: EP Kind code of ref document: A1 |