WO2022188560A1

WO2022188560A1 - Methods for distance relationship determination, device control and model training, and related apparatuses

Info

Publication number: WO2022188560A1
Application number: PCT/CN2022/072703
Authority: WO
Inventors: 曾理; 张晓帆; 钟卫东; 王佩玲; 江忠泽; 陈科鑫
Original assignee: Oppo广东移动通信有限公司
Priority date: 2021-03-10
Filing date: 2022-01-19
Publication date: 2022-09-15
Also published as: CN115083436A

Abstract

A method and apparatus for distance relationship determination, a method and apparatus for device control, a method and apparatus for training a distance comparison model, an electronic device, and a computer readable storage medium. The method for distance relationship determination comprises: acquiring a plurality of pieces of sound collection data having one-to-one correspondence to a plurality of devices, each of the plurality of pieces of sound collection data comprising reference voice data obtained by collecting sound of a sound source target by means of a corresponding device and a device identifier of the device (201); and determining at least one relative distance identifier having one-to-one correspondence to at least one of the plurality of devices according to the plurality of pieces of sound acquisition data and a pre-trained distance comparison model, wherein each relative distance identifier in the at least one relative distance identifier is used for indicating the position of the corresponding device in a device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting a plurality of devices according to a preset sorting strategy of the distance, and the distance refers to a distance between the device and the sound source target (202). According to the method, the relative distance relationship comprising global information is predicted by means of the distance comparison model, facilitating improving the efficiency and applicability of relative distance relationship prediction.

Description

Method and related device for distance relationship determination, equipment control, and model training

technical field

The present application belongs to the technical field of voice assistants, and in particular relates to a method for distance relationship determination, device control, and model training and related devices.

Background technique

As artificial intelligence technology ushered in the third wave, voice assistants have gradually entered all aspects of life, and are equipped with smart devices such as mobile phones, watches, speakers, and TVs. Due to the rich variety of devices, there may be multiple devices with voice assistant functions in the same space.

At present, in the nearest wake-up product solution, the microphone array is used to measure the distance of the sound source, and then the device closest to the sound source is determined by distance comparison, and the device is woken up to execute user instructions.

SUMMARY OF THE INVENTION

The present application provides a method and related device for distance relationship determination, device control, and model training, in order to improve the comprehensiveness, efficiency and application convenience of the nearest wake-up product solution arbitration device for calculating the distance between the sound source target and the device.

In a first aspect, the present application provides a method for determining a distance relationship, which is applied to an arbitration device, and the method includes:

Acquire multiple pieces of sound collection data corresponding to multiple devices one-to-one, and each sound collection data in the multiple pieces of sound collection data includes reference voice data obtained by collecting the sound of the sound source target by the corresponding equipment and the equipment of the equipment. identification;

According to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model, determine at least one relative distance identifier corresponding to at least one device among the multiple devices, and the at least one relative distance identifier is determined. Each relative distance identifier in the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy for distances, The distance refers to the distance between the device and the sound source target.

It can be seen that in this example, the arbitration device firstly obtains multiple voice collection data from multiple devices, and secondly, according to the reference voice data and device identifiers in the multiple voice collection data and the pre-trained distance comparison model, it determines whether it is compatible with multiple devices. At least one relative distance identifier corresponding to at least one device one-to-one. Since each relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance. The distance refers to the distance between the device and the sound source. The distance between the targets, it can be seen that the model can predict the relative position of the distance between the device and the sound source target in the distance sequence, such as device 1, device 2, and device 3 according to the distance from the sound source target. The device distance relationship sequence is device 3 → device 2 → device 1, then the prediction result can indicate that the distance between device 3 and the sound source target is the closest by indicating that the relative distance of device 3 is 1, which is relative to the isolated prediction absolute distance of the existing model. In the present application, the distance comparison model is used to predict the relative distance relationship including global information in the multi-device voice interaction scene. From the device identification of the device, it can be seen that there is no limitation on the number of channels for collecting reference speech data, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction. In a second aspect, the present application provides a device control method, which is applied to a target device, and the method includes:

Acquire indication information of the arbitration device, where the indication information is when the arbitration device determines that the target device in the plurality of devices is used to perform the sound according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated in the case of a voice command associated with the voice of the source and target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operations: acquiring multiple voice collection data corresponding to the multiple devices one-to-one, the Each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the reference speech data in the plurality of sound collection data and The device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence. The device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;

The operation indicated by the sound-related voice instruction of the sound source target is performed according to the instruction information.

It can be seen that, in this example, the target device first obtains the indication information of the arbitration device, and secondly, according to the indication information, executes the operation indicated by the voice instruction associated with the sound of the sound source target. Because the indication information is generated by the arbitration device in the case of determining the voice command of the target device in the plurality of devices for executing the sound association of the sound source target according to the at least one relative distance identifier corresponding to at least one device in the plurality of devices, and The relative distance identifier is used to indicate the position of the distance between the corresponding device and the sound source target in the distance sequence, and the distance sequence is used to indicate the sequence formed by sorting multiple distances according to the preset sorting strategy. The distance between each device and the sound source target, compared with the existing scheme of determining the nearest wake-up device based on the absolute distance between the device and the sound source target, the present application realizes the prediction of the global information contained in the multi-device voice interaction scene through the distance comparison model and determine the target device to be awakened according to the relative distance relationship. At the same time, since each sound collection data includes the reference voice data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment, it can be seen that the collection The reference speech data has no limitation on the number of channels, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.

In a third aspect, the present application provides a method for training a distance comparison model, including:

Acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the multiple reference voice data Each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, and the sound collection environments at least include the location of the sound source target;

A preset distance comparison model is trained according to the reference speech data of the multiple speech data sets and a preset loss function, and a trained distance comparison model is obtained. The prediction accuracy dimension of the relative distance relationship with the sound source target in the sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.

It can be seen that in this example, the device first obtains training data, and secondly, trains a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, and obtains a trained distance comparison model. Since the training data includes multiple voice data sets, each voice data set includes multiple reference voice data corresponding to multiple devices one-to-one, and each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target , and the multiple voice data sets correspond to the voice data sets collected in different sound collection environments, and at the same time, the loss function is used to predict the relative distance relationship between the two devices in the device pairing group and the sound source target in the same sound collection environment The accuracy dimension represents the loss of the distance comparison model. The device pairing group consists of any two devices among multiple devices, so the distance prediction model has the ability to predict the relative distance between the device and the sound source target, and the reference speech data does not have the number of channels Therefore, it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.

In a fourth aspect, the present application provides an apparatus for determining a distance relationship, which is applied to an arbitration device, and the apparatus includes:

The acquisition unit is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the obtained sound. the equipment identification of the said equipment;

A determination unit, configured to determine at least one relative distance identifier corresponding to at least one of the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model , each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is that the multiple devices are sorted according to a preset sorting strategy of distances And the formed sequence, the distance refers to the distance between the device and the sound source target.

In a fifth aspect, the present application provides a device control device, which is applied to an electronic device, and the device includes:

An acquiring unit, configured to acquire indication information of the arbitration device, where the indication information is when the arbitration device determines the target in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated when the device is used to execute a voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring multiple sounds corresponding to the multiple devices one-to-one Collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification and the pre-trained distance comparison model are determined, and the at least one relative distance identification is determined, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence. The position of the device distance relationship is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;

An execution unit, configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.

In a sixth aspect, the present application provides a training device for a distance comparison model, including:

an acquiring unit, configured to acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the Each reference voice data in the multiple reference voice data is voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to voice data sets collected under different sound collection environments, the The sound collection environment at least includes the location where the sound source target is located;

A training unit, configured to train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to obtain the trained distance comparison model from the device pairing group The prediction accuracy dimension of the relative distance relationship between two devices and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices in the plurality of devices. .

In a seventh aspect, the present application provides an electronic device and one or more processors;

one or more memories for storing programs,

The one or more memories and the program are configured to control the electronic device by the one or more processors to perform any method in the first aspect or the second aspect or the third aspect of the embodiments of the present application step instruction.

In an eighth aspect, the present application provides a chip, including: a processor for calling and running a computer program from a memory, so that a device installed with the chip executes the first aspect or the second aspect or the first aspect of the embodiments of the present application. Some or all of the steps described in any of the three aspects.

In a ninth aspect, the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the first embodiment of the present application. Some or all of the steps described in any of the methods of aspect or the second aspect or the third aspect.

In a tenth aspect, the present application provides a computer program, wherein the computer program is operable to cause a computer to execute part or all of the methods described in the first aspect, the second aspect or the third aspect of the embodiments of the present application step. The computer program may be a software installation package.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

1a is a schematic diagram of user control in a multi-device scenario provided by an embodiment of the present application;

FIG. 1b is an architecture diagram of a device control system 10 provided by an embodiment of the present application;

1c is a schematic diagram of a functional interface of an intelligent voice assistant provided by an embodiment of the present application;

1d is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

2 is a schematic flowchart of a method for determining a distance relationship provided by an embodiment of the present application;

3 is a schematic flowchart of a device control method provided by an embodiment of the present application;

4 is a schematic flowchart of a training method for a distance comparison model provided by an embodiment of the present application;

5 is a block diagram of functional units of a device for determining a distance relationship provided by an embodiment of the present application;

6 is a block diagram of functional units of another distance relationship determination device provided by an embodiment of the present application;

FIG. 7 is a block diagram of functional units of a device control device provided by an embodiment of the present application;

8 is a block diagram of functional units of another device control apparatus provided by an embodiment of the present application;

9 is a block diagram of functional units of a training device for a distance comparison model provided by an embodiment of the present application;

FIG. 10 is a block diagram of functional units of another distance comparison model training device provided by an embodiment of the present application.

Detailed ways

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish different objects, rather than to describe a specific order. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

At present, as shown in Figure 1a, there are smart speakers (0.5m away from the user), smart TV 1 (0.6m away from the user), computer (1.2m away from the user), and smart TV 2 in the space where the user is located. (The distance from the user is 0.55m), when the user wants to listen to music and issue the "play music" command, the current intelligent voice assistant can only measure the distance between the device containing the microphone array and the sound source. If the computer does not contain a microphone array, the intelligent voice The assistant can't calculate the distance and can't accurately realize the control of the nearest wake-up.

In response to the above problems, the embodiments of the present application provide a method and related apparatus for distance relationship determination, device control, and model training, which are described in detail below with reference to the accompanying drawings.

Please refer to FIG. 1b. FIG. 1b is a device control system 10 provided by an embodiment of the present application. The device control system 10 includes an electronic device 100 (such as a smart TV, a smart speaker, a smart phone, etc.) with a sound collection capability, an arbitration device 200 installed with an intelligent voice assistant, and a server 300. The arbitration device may be one of the electronic devices 100. It can also be any one of the mobile devices, such as the user's mobile phone, or it can be a dedicated control box in the smart home scene, it can also be a server in the cloud, or it can be composed of multiple devices that jointly complete data processing. A device group, the arbitration device 200 is connected in communication with the electronic device 100 and the server 300 to form a device control network in a smart home scenario.

Wherein, the intelligent voice assistant can be installed on various devices such as mobile phones to support the device control method of the present application, and the specific function names and interface interaction methods presented by the intelligent voice assistant can be various, which are not uniquely limited here. , for example, it is installed on a mobile phone and presents the setting function interface of the "Breeno" smart assistant as shown in Figure 1c. The legend includes the function settings of the one-key command, including the navigation home function, the nearby function, the reminder function when arriving home, the screenshot with the shell, and the multi-device and control function. Among them, in the graphic label of the multi-device and control function, the Digital tags can be used to identify the proximity of the device to the target of the sound source, the user.

It should be noted that, the arbitration device 200, as the policy execution device in this embodiment of the present application, can exchange data and signaling with other devices (for example, the electronic device 100 and the server 300) in various ways, which are not discussed here. Make the only limit. For example, the arbitration device 200 may be directly connected with the electronic device 100 to obtain corresponding information, and the arbitration device 200 may be connected to the server 300 through a mobile communication network to realize corresponding information exchange and the like.

Please refer to FIG. 1d. FIG. 1d is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device is applied to the above-mentioned device control system 10. The electronic device includes an application processor 120, a memory 130, a communication module 140, and one or more programs 131. The application processor 120 communicates with the memory through an internal communication bus. 130. The communication modules 140 are all connected in communication.

Wherein, the one or more programs 131 are stored in the above-mentioned memory 130 and configured to be executed by the above-mentioned application processor 120, and the one or more programs 131 include a program for executing any step in the above-mentioned method embodiments. instruction.

The application processor 120 may be, for example, a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a field Field Programmable Gate Array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It may implement or execute the various exemplary logical blocks, units and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication unit may be a communication module 140 , a transceiver, a transceiver circuit, etc., and the storage unit may be the memory 130 .

The memory 130 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (DRAM) Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Fetch memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).

In specific implementation, the application processor 120 is configured to perform any step performed by the arbitration device, the target device, or the model training device in the method embodiment of the present application.

Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a method for determining a distance relationship provided by an embodiment of the present application, which is applied to the arbitration device 200 in the device control system 10. As shown in the figure, the device control method includes the following operations.

Step 201: Acquire a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the The device ID of the device.

The plurality of devices may be the electronic devices 100 in the device control system 10 described above, and neither the number nor the device type is uniquely limited.

Wherein, the sound source target includes a user or a pronunciation device, which is not uniquely limited here. The sound of the sound source target may be a wake-up voice, such as "Hello Xiaoou" and the like. In addition, in a heterogeneous distributed scenario, each device can wait for the wake-up voice at the same time. When the user speaks the wake-up speech, the device obtains a monophonic segment of the respective wake-up speech.

Wherein, the reference voice data may be first subjected to 4KHz low-pass filtering to suppress non-human voice audio parts.

In a specific implementation, the reference voice data may be voice data before frequency response capability alignment, feature extraction, and feature fusion processing. In this case, the arbitration device uniformly performs relevant preprocessing on the reference voice data of multiple devices. ,

In addition, the reference voice data can also be the voice data processed by each device through frequency response capability alignment, feature extraction, and feature fusion. After the reference speech data and device identifiers of multiple devices are obtained, a pre-trained distance comparison model is invoked to predict at least one relative distance identifier of at least one device.

Step 202: Determine at least one relative distance identifier that corresponds one-to-one with at least one of the multiple devices according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, so Each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is formed by sorting the plurality of devices according to a preset sorting strategy for distances. , the distance is the distance between the device and the sound source target.

Wherein, the preset sorting strategy may include from large to small or from small to large, etc., which is not uniquely limited here.

Wherein, the reference voice data includes monophonic voice data or multi-channel voice data, that is, there is no need to make higher requirements on the voice collection capability of the device, and the application is more convenient.

Wherein, the relative distance identifiers may be numbers (such as 1/2/3/4), graphics (such as line segments with different lengths), etc., which are not uniquely limited here.

For example, suppose that multiple devices include device A, device B, device C, and device D, and the device distance relationship sequence is device A→device D→device B→device C, and the current distance ordering relationship is from far to If the distance between the device A and the sound source target is the closest, and the distance between the device C and the sound source target is the farthest, the prediction result can be: the relative distance of device A is marked as 1 (closest), and the relative distance of device B is marked as 3 , the relative distance of device C is identified as 4 (farthest), and the relative distance of device D is identified as 2.

In a specific implementation, the arbitration device can directly select a target device with the closest relative distance to the sound source target as the wake-up device, and execute the user's voice instruction through the target device. In addition, for the special case where the prediction results of two devices are the same in the distance comparison result, the arbitration device may further query the preset device wake-up priority set to determine the device to be woken up first. In this way, the accuracy and success rate of device control can be improved.

In the specific implementation, if the input of the distance comparison model only contains audio data information, the output of the distance comparison model is the relative distance identification of each voice data, and the arbitration device can further determine the corresponding device according to the corresponding relationship between the voice data and the device identification. Relative distance identification. In this case, the device identification only needs the device identification corresponding to the voice data in the prediction result. That is, the voice data corresponds to the device ID, the model input data does not include the device ID, the prediction result corresponds to the voice data one-to-one, and the prediction result and the device ID correspond indirectly through the voice data.

If the input of the distance comparison model contains audio data information and device identification, such as mobile phone 1, mobile phone 2, mobile phone 3, the device type of mobile phone 1 and mobile phone 2 is type 1, and the device type of mobile phone 3 is type 2, then the device type of mobile phone 1 is The identification can be type 1 + name of mobile phone 1, the identification of mobile phone 2 is type 1 + name of mobile phone 2, and the device identification of mobile phone 3 is type 2 + name of mobile phone 3, then the output of the distance comparison model is directly the relative identification of each device. The distance identification, that is, the prediction result can directly correspond to the equipment identification.

In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine at least one device corresponding to at least one device one-to-one among the plurality of devices. A relative distance identifier, comprising: aligning the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data to obtain the frequency response capability corresponding to the multiple voice collection data one-to-one The aligned multiple target voice data; according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model, determine whether to match at least one of the multiple devices. At least one relative distance identifier in a one-to-one correspondence between devices.

It can be seen that, in this example, due to the different performance of the sensor devices of different types and models of devices, there are significant differences in the collected activated voice signals at the same distance. Therefore, the voice signals obtained by heterogeneous devices are adapted to the same standard through the frequency response capability alignment process, which provides a unified data basis for distance comparison. Compared with the existing solution that simply normalizes only the speech energy ratio, it is beneficial to improve the accuracy of the model prediction result.

In this possible example, the performing frequency response capability alignment on multiple reference voice data in the multiple sound collection data to obtain multiple target voice data after frequency response capability alignment, including: Each reference voice data in the sound collection data performs the following operations to obtain a plurality of target voice data after frequency response capability alignment: obtain the preset frequency of the current device relative to the reference device according to the device identifier of the currently processed reference voice data. The unit impulse response of the frequency response is obtained; the convolution operation is performed on the reference speech data and the unit impulse response of the frequency response, and the gain is adjusted to obtain the target speech data after the frequency response capability is aligned.

In a specific implementation, if the current device and the reference device are the same device, the voice after adjustment by the reference voice data remains unchanged. The reference device can be a device specified in multiple devices.

Wherein, the frequency response unit impulse response of the current acquisition device relative to the reference device is determined through the following steps a1 to a5:

In step a1, it is assumed that the total number of device types for sensing capability alignment is L, and L is a positive integer. Place multiple devices at the same distance from the sound source, play the 0-8KHz frequency sweep signal, and record the frequency response curve of each device. N records are made at the same distance, and N frequency response curves are obtained for each device. M batches of acquisition were performed under different sound source distance conditions (such as 0.5m, 0.8m, 1m, 1.2m, 1.5m, 1.8m, 2m, 2.2m, 2.5m, 2.8m, 3m). remember

is the frequency response curve of the jth record of the ith batch of the lth equipment, where l=1,2,...,L; i=1,2,...,M; j=1,2,...,N,L, M and N are positive integers.

In step a2, the statistical average of all frequency response curves of each device is obtained as the final frequency response curve of the device:

Among them, K is the number of points in the frequency response curve, which is a positive integer.

In step a3, a device is selected as a reference device, and the type is denoted as b∈[1,2,...,L], and the selection method may be the device most frequently used by the user. Calculate the ratio of the frequency response curve of each device to the reference device, and obtain the frequency response transfer function between devices:

If l=b, the reference device does not need to perform capability alignment with itself.

Step a4, perform inverse discrete fast Fourier transform (IDFT) on the frequency response transfer function to obtain the unit impulse response corresponding to the frequency response of the transfer function:

in the formula

j is the symbol of the complex imaginary part.

Step a5, save the frequency response unit impulse response t ^l (n) in the corresponding device.

It can be seen that, in this example, due to the different performance of the sensor devices of different types and models of devices, there are significant differences in the collected activated voice signals at the same distance. Therefore, the frequency response unit impulse response is obtained through the frequency response curve, and the signals obtained by the heterogeneous devices are adapted to the same standard, which provides a unified data basis for distance comparison. Compared with the existing solution that simply normalizes only the speech energy ratio, it is beneficial to improve the accuracy of the model prediction result.

In a possible example, according to the target voice data, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine whether to have at least one device in the plurality of devices. A corresponding at least one relative distance identifier, including: performing multi-dimensional feature extraction on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one , each voice feature set includes multi-dimensional feature extraction results; feature fusion is performed on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; according to the multiple fused voice features, the The device identifiers in the plurality of sound collection data and the pre-trained distance comparison model are used to determine at least one relative distance identifier corresponding to at least one device among the plurality of devices.

In this possible example, performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one, including : perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice feature of the currently processed target voice data; reduce the vector voice feature Dimension and quadratic feature extraction to obtain vector-derived speech features.

In this possible example, the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data includes: preprocessing the currently processed target speech data to obtain preprocessed target speech data; extracting the Scalar voice features and vector voice features of the preprocessed target voice data; wherein the preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through high-frequency filters, frame segmentation processing, and windowing processing.

Wherein, the purpose of the silence suppression (Voice Activity Detection, VAD) processing is to identify and eliminate long silence periods therefrom, so as to extract the most effective voice segments for use in subsequent steps, for example, the WebRTC VAD algorithm can be used. The purpose of the pre-emphasis processing is to compensate for the high frequency components of the speech signal lost due to the influence of the pronunciation system, and to highlight the high frequency formants. Pre-emphasis is achieved by a high-frequency filter whose transfer function is:

H(z)=1-μz ^-1 ,

Among them, z is the Z transform of the speech signal, and μ is the coefficient of pre-emphasis, which is generally between 0.9 and 1.0, and may be 0.97 in this application, for example.

In addition, in order to facilitate processing, according to the short-term stationarity of the speech signal, it is necessary to perform a frame division operation on the speech signal. In this scheme, the frame length is set to 25ms, and the frame shift is set to 10ms, that is, there is an overlap area of 15ms between two adjacent frames, which can avoid the influence of excessive changes in the speech signals of the two adjacent frames. For the convenience of expression, use s _i [n] to represent the data of the i-th frame.

After framing, in order to eliminate the signal discontinuity that may exist at both ends of each frame and prevent spectrum leakage, a windowing process is required to obtain _si,w [n]= _si [n]×w[n],w [n] represents a window function, the embodiment of the present application adopts a Hamming window (Hamming), and its formula is as follows:

where N is the length of _si [n].

In a specific implementation, the above-mentioned scalar speech features can extract 5 kinds of scalar speech features, as shown in Table 1. The scalar speech feature is extracted according to its definition, and the specific calculation process will not be repeated.

Table 1 Scalar speech features

特征类型Feature type	中文解释Chinese explanation	英文解释English explanation
LPLP	线性预测Linear prediction	Linear PredictionLinear Prediction
LPRRLPRR	LP残差峰值-均方根值比LP Residual Peak-RMS Ratio	LP Residual RatioLP Residual Ratio
LPRKLPRK	LP残差峭度LP residual kurtosis	LP Residual KurtosisLP Residual Kurtosis
LPRHPLPRHP	LP残差直方图峰值LP residual histogram peak	LP Residual Histogram PeakLP Residual Histogram Peak
SPSKSPSK	语谱图偏度Spectrogram skewness	Spectrogram SkewnessSpectrogram Skewness
SHPPSHPP	语谱图直方图峰值位置Spectrogram histogram peak position	Spectrogram Histogram Peak PositionSpectrogram Histogram Peak Position

The above scalar speech features can all be eigenvalues in the form of (1,), which mainly describe the current sound field loop characteristics of the user, such as reverberation, reflection, sound transmission gain, noise, and the like. Here the form of (1,) represents the data dimension, (1,) represents a 1×1-dimensional vector, and (245,) represents a 245×1-dimensional vector, of which the 245×1 vector is spliced from the previous features. . For example: several features: (a,), (b,),...., the final splicing is (a+b+...,)=(245,). In this way, for one piece of speech data, a feature vector of (245,), ie, 245×1, is obtained. In the actual calculation process, multiple voices will be processed in batches, recorded as N, where N is a positive integer, then each feature corresponds to each batch of features, and the dimension changes: (1,)→(1,N), That is, 1×N dimension, and similarly (245,)→(245,N), that is, 245×N dimension.

In the specific implementation, the above-mentioned vector speech features can extract 8 kinds of vector speech features, as shown in Table 2. The vector speech features are extracted according to existing methods, and the specific calculation process will not be repeated.

Table 2 Vector speech features

特征类型Feature type	中文解释Chinese explanation	英文解释English explanation
MFCCMFCC	梅尔频率倒谱系数Mel frequency cepstral coefficients	Mel-Frequency Cepstral CoefficientsMel-Frequency Cepstral Coefficients
LPCCLPCC	线性预测倒谱系数Linear prediction cepstral coefficients	Linear Predictive Cepstral CoefficientsLinear Predictive Cepstral Coefficients
MHECMHEC	平均希尔伯特包络系数Average Hilbert envelope coefficient	Mean Hilbert Envelope CoefficientsMean Hilbert Envelope Coefficients
BFCCBFCC	巴克频率倒谱系数Bark frequency cepstral coefficient	Bark-Frequency Cepstral CoefficientsBark-Frequency Cepstral Coefficients
LFCCLFCC	线性频率倒谱系数Linear frequency cepstral coefficients	Linear-Frequency Cepstral CoefficientsLinear-Frequency Cepstral Coefficients
GFCCGFCC	伽马通频率倒谱系数Gammatone frequency cepstral coefficients	Gammatone-Frequency Cepstral CoefficientsGammatone-Frequency Cepstral Coefficients
NGCCNGCC	归一化伽马啁啾倒谱系数normalized gamma chirp cepstral coefficients	NorGammachirp-Frequency Cepstral CoefficientsNorGammachirp-Frequency Cepstral Coefficients
MSRCCMSRCC	幅度谱根倒谱系数Amplitude spectral root cepstral coefficient	Magnitude-based Spectral Root Cepstral CoefficientsMagnitude-based Spectral Root Cepstral Coefficients

The above vector speech features are all feature vectors in the shape of (12,N _f ), where 12 is the dimension of the feature, indicating that the vector speech features have 12-dimensional feature components, and N _f represents the number of frames of speech. The vector speech feature can describe the pickup characteristics of different pickup devices, such as spectrum, pitch, formant, and the like.

In addition, the above-mentioned eight kinds of vector speech feature parameters have a large amount, which is inconvenient to be used in direct combination with the scalar features, and can be further processed through vector-derived feature extraction. Vector derived feature extraction includes three functional modules: feature component screening, differential feature calculation, and vector feature scalarization, which will be introduced separately below.

(1) Feature component screening module

For the 8 kinds of vector speech features, they are all feature vectors in the shape of (12, N _f ), which have a large amount of parameters and are inconvenient to use directly. In addition, the contribution of each dimension feature component to model training is different, and some contain less information, and some may contain redundant information, so it is necessary to filter the feature components of vector speech features. This scheme uses Fisher's criterion to evaluate the distinguishing ability of each dimension's feature components. Fisher's criterion is as follows:

Among them, r _Fisher is the Fisher ratio of the feature components, the larger the value, the stronger the distinguishing ability of the feature components in this dimension; _σ _b represents the inter-class variance of the feature components, that is, the variance of the mean of the speech feature components of different distance types; The intra-class variance of the feature components, that is, the variance of the mean of the speech feature components of the same distance type, the calculation formulas of σ _b and σ _w are as follows:

In the above two formulas: M represents the number of samples,

Represents the mean value of the k-th dimension component of a feature vector of the ith speech sample, m _k represents the mean value of the k-th dimension component of a certain feature vector of all samples, n _i represents the frame number of a certain speech sample,

The feature value of the cth frame of the kth dimension representing a feature of the ith speech.

As mentioned above, the feature component screening module uses the Fisher criterion to select the 3 feature components with the largest Fisher ratio from the 8 vector speech features, so that each vector speech feature is converted from (12, N _f ) to ( 3, N _f ), which greatly reduces the amount of characteristic parameters.

(2) Differential feature calculation module

The above 8 kinds of vector speech features can only reflect the static characteristics of speech, and their dynamic characteristics can be described by the differential parameters of the vector speech features. The formula for calculating the differential parameters is as follows:

Among them, c ^l represents the l-th dimension feature component of a certain vector speech feature,

Represents the first-order difference value of the t-th frame of the l-th dimensional feature component of the vector speech feature, Θ is a constant, and this value can be used to represent the size of the difference window as 2Θ+1, where Θ=2 in this scheme. Through this formula, the first-order differences of the eight vector speech features can be obtained, which are represented by ΔMFCC, ΔLPCC, ΔMHEC, ΔBFCC, ΔLFCC, ΔGFCC, ΔNGCC, and ΔMSRCC, respectively. These vector difference features are in the form of (3, N _f ). Feature vector.

(3) Vector feature scalarization module

For the above-mentioned 8 vector speech features such as MFCC and 8 vector difference features such as ΔMFCC, the number of frames N _f of speech with a long duration is often very large, that is, the dimension of the feature is still very high in the "frame"direction; in addition, different The number of speech frames N _f of different durations is not convenient for training the machine learning model. In order to solve these two problems, this scheme uses the vector feature scalarization module to perform scalarization processing on the above-mentioned vector features. The following takes MFCC as an example to describe the specific process of vector feature scalarization in detail:

Step a. Use GMM and its EM algorithm to cluster each dimension feature component of MFCC, the number of clusters is 4, and four cluster center values of each dimension feature component will be obtained, and a feature of the shape (4, ) will be obtained, marked as

where l∈(1,2,3) represents the label of the feature component;

Step b. Calculate the maximum value, minimum value and end value of each dimension feature component of the MFCC, and obtain a feature in the form of (3,), denoted as

Step c, calculate the maximum value, minimum value and square sum of each dimension feature component of the vector difference feature ΔMFCC of the MFCC, and obtain a feature in the shape of (3,), denoted as

Step d, obtain from a, b, and c

The three eigenvectors are spliced to obtain a feature of the shape (10,), denoted as F ^l ;

Step e, ^splicing the respective features F1 of the feature components to obtain a new feature in the shape of (30'), denoted as F _MFCC , for characterizing the original MFCC feature;

Step f, utilize above-mentioned steps a-step e., operate on other 7 kinds of vector speech features, can obtain respective new features corresponding to each other, denoted as F _LPCC , F _MHEC , F _BFCC , F _BFCC , F _LFCC , F _GFCC , _FNGCC , _FMSRCC ;

Step g, splicing and merging the above-mentioned 8 new features in the shape of (30,) to obtain a vector-derived speech feature in the shape of (240,), denoted as F _D , and each feature value in F _D has a specific value. physical meaning.

In a specific implementation, the above feature fusion step is used to fuse scalar speech features and vector-derived speech features. First, the 5 kinds of scalar speech features are spliced to obtain a feature vector in the shape of (5,), which is denoted as F _S ; then F _S and the vector derived feature _FD are spliced and fused, and finally the shape of (245, ) of the fusion feature, denoted as F _Fusion , that is, each speech is finally extracted into a feature vector of the shape (245,), which is used for the training of the distance relationship model between the speaker and the heterogeneous device.

It can be seen that in this example, the fusion speech features are less affected by the characteristics of the ambient sound field and random noise, and are suitable for a variety of scenarios and heterogeneous distributed devices. powerful. The fusion speech feature can be used for training the distance relationship model between speakers and distributed devices, so as to judge the distance between different smart devices and users, so that "human-machine distance" can become an important decision-making dimension in multi-device wake-up, and improve the user's experience. experience. Specifically, in the same space, the user has multiple distributed devices that support the same wake-up word. After the user speaks the wake-up word, the device closest to the user responds to wake up nearby; , device service capability, user intent and other dimensions, make a comprehensive judgment, and select the most suitable device to respond to the user's request.

In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine at least one device corresponding to at least one device one-to-one among the plurality of devices. A relative distance identifier, comprising: performing multi-dimensional feature extraction on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of sound collection data one-to-one. Each voice feature set in the voice feature sets includes multi-dimensional feature extraction results; perform feature fusion on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; The feature, the device identifier in the plurality of sound collection data, and the pre-trained distance comparison model are used to determine at least one relative distance identifier that corresponds one-to-one with at least one device among the plurality of devices.

Among them, the feature extraction can also be used for other distance relationship determination methods, and the calculation results can be used for other application scenarios.

In this possible example, the multi-dimensional feature extraction is performed on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of sound collection data one-to-one, including : perform frequency response capability alignment on each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain multiple targets whose frequency response capabilities are aligned one-to-one with the multiple voice collection data Voice data; perform multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one.

In this possible example, the frequency response capability alignment is performed on each of the multiple reference voice data in the multiple sound collection data to obtain the frequency response capability corresponding to the multiple voice collection data one-to-one. The multiple target voice data after the sound capability alignment includes: performing the following operations on each reference voice data in the multiple voice collection data to obtain the aligned frequency response capabilities corresponding to the multiple voice collection data one-to-one. A plurality of target speech data: obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier associated with the currently processed reference speech data; compare the reference speech data with the frequency response unit impulse response Do the convolution operation and adjust the gain to obtain the target speech data after the frequency response capability is aligned.

It should be noted that, the implementation principles of feature extraction, feature fusion, and frequency response capability alignment involved in this branch embodiment are similar to the corresponding contents in the foregoing embodiments, and are not repeated here.

It can be seen that, in this example, the arbitration device can first perform feature extraction and feature fusion on the voice data, and optionally further perform frequency response capability alignment on the processed voice data to improve the flexibility of voice data preprocessing.

In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine at least one device corresponding to at least one device one-to-one among the plurality of devices. After a relative distance identification, the method further includes: determining, according to the at least one relative distance identification, a target device of the plurality of devices for executing the voice command associated with the sound source target; If the target device is a device other than the arbitration device, send indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction; if the target device is detected For the arbitration device, perform the operation indicated by the voice instruction.

Wherein, the voice command associated with the sound of the sound source target may be various user commands such as "play music", which is not uniquely limited here.

In a specific implementation, in the nearest wake-up scheme, the arbitration device preferentially selects the device with the closest distance to the sound source target as the target device. That is to say, in this application scenario, the at least one relative distance identifier should at least include the relative distance identifier of the device that is closest to the sound source target. In addition, in combination with different application scenarios, the specific representation form of the at least one relative distance identifier may be various quotas, which is not uniquely limited here.

It can be seen that, in this example, the arbitration device can intelligently determine the target device for executing the voice command associated with the sound source target according to at least one relative distance identifier, which improves the convenience and intelligence of device control.

It can be seen that, in the embodiment of the present application, the arbitration device firstly obtains multiple voice collection data of multiple devices, and secondly, according to the reference voice data and device identification in the multiple voice collection data and the pre-trained distance comparison model, determine At least one relative distance identifier in one-to-one correspondence with at least one device among the plurality of devices. Since each relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance. The distance refers to the distance between the device and the sound source. The distance between the targets, the visible model can predict the relative position of the device in the device distance relationship sequence, such as the device distance relationship sequence formed by device 1, device 2, and device 3 according to the distance from the sound source target from near to far. is device 3→device 2→device 1, then the prediction result can indicate that the distance between device 3 and the sound source target is the closest by indicating that the relative distance of device 3 is 1. Compared with the scheme of predicting the absolute distance in isolation from the existing model, this application The distance comparison model is used to predict the relative distance relationship containing global information in the multi-device voice interaction scene. At the same time, since each sound collection data includes the reference voice data obtained by the corresponding device to collect the sound of the sound source target and the device identification of the device , it can be seen that there is no limitation on the number of channels for collecting reference speech data, so it can overcome the high hardware requirements and complex algorithms of conventional microphone array sound source localization algorithms, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.

Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a device control method provided by an embodiment of the present application, which is applied to a target device in the device control system 10. As shown in the figure, the device control method includes the following operations.

Step 301: Acquire indication information of the arbitration device, where the indication information indicates that the arbitration device is used to determine the target device in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. Generated in the case of executing the voice command associated with the sound of the sound source target, the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring a plurality of sound collection data corresponding to the plurality of devices one-to-one , each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding device collecting the sound of the sound source target and the device identification of the device; and according to the reference voice data in the plurality of sound collection data The voice data, the device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence , the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;

Step 302: Execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.

The target device is the device closest to the sound source target among the multiple devices. This situation applies to the nearest wake-up product scheme.

The target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.

In a specific implementation, if the target device is the arbitration device, the arbitration device directly generates indication information, and executes the operation indicated by the voice instruction associated with the sound of the sound source target according to the indication information.

If the target device is a device other than the arbitration device among the multiple devices, the arbitration device generates indication information, sends the indication information to the target device, and the target device executes the sound source target according to the indication information. The action indicated by the voice associated with the sound.

In addition, the method further includes: if it is detected that the distance between the target device and the sound source target is greater than a preset distance, outputting a prompt message to prompt the user to approach the target device; and/or, if it is detected that the If the distance between the target device and the sound source target is greater than the preset distance, the output volume of the target device is increased.

If it is detected that the distance between the target device and the sound source target is less than or equal to the preset distance, the output volume of the target device is lowered. In this way, the intelligence of device control can be improved, and the user experience can be improved.

The preset distance may be, for example, 5 meters, 10 meters, or the like.

It can be seen that, in the embodiment of the present application, the target device first obtains the indication information of the arbitration device, and secondly, according to the indication information, executes the operation indicated by the voice command associated with the sound of the sound source target. Because the indication information is generated by the arbitration device in the case of determining the voice command of the target device in the plurality of devices for executing the sound association of the sound source target according to the at least one relative distance identifier corresponding to at least one device in the plurality of devices, and The relative distance identifier is used to indicate the position of the distance between the corresponding device and the sound source target in the distance sequence, and the distance sequence is used to indicate the sequence formed by sorting multiple distances according to the preset sorting strategy. The distance between each device and the sound source target, compared with the existing scheme of determining the nearest wake-up device based on the absolute distance between the device and the sound source target, the present application realizes the prediction of the global information contained in the multi-device voice interaction scene through the distance comparison model and determine the target device to be awakened according to the relative distance relationship. At the same time, since each sound collection data includes the reference voice data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment, it can be seen that the collection The reference speech data has no limitation on the number of channels, so it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.

Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a training method for a distance comparison model provided by an embodiment of the present application, which is applied to a model training device. As shown in FIG. 4, the device control method includes the following operations.

Step 401: Acquire training data, where the training data includes multiple voice data sets, and each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one. In the reference speech data, each reference speech data is speech data obtained by the corresponding device collecting the sound of the sound source target, and the multiple speech data sets correspond to speech data sets collected in different sound collection environments, and the sound collection The environment includes at least the location where the sound source target is located.

The sound collection environment refers to the acoustic environment in which sound data is collected, and the sound collection environment can be diversified. In addition to the location of the sound source target, the difference can be further constructed by the difference of at least one of the following features Sound collection environment: room size, noise level, etc.

For example, in a room with an area of 10m ² , 20m ² , 30m ² , and 40m ² , the wake-up voice in quiet environment and noisy environment is collected. Noise is artificially added, and Gaussian white noise, electrical noise (such as fans, air conditioners), and traffic noise in the noise database can be selected. According to the sound pressure level of the pure wake-up voice, the signal-to-noise ratio can be set to -15dB, -10dB, -5dB, 0dB, 5dB, 10dB, 15dB, etc.

Step 402: Train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to select two from the equipment pairing group. The prediction accuracy dimension of the relative distance relationship between the device and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.

The distance comparison model may specifically be a deep neural network, and the deep neural network may be, for example, a convolutional neural network or a deep residual network, which is not uniquely limited here.

In a possible example, the relative distance relationship is characterized by defining a score of an event in which the first device is closer to the sound source target than the second device among the two devices, and the value of the score is Associated with the distance difference, the distance difference is the difference between the first distance and the second distance, the first distance is the distance between the first device and the sound source target, and the second distance is the the distance between the second device and the sound source target.

Wherein, the score can be calculated and expressed in the form of data similar to probability, and the value range of the score falls within the interval (0, 1).

For example, if the first distance is 50cm, the second distance is 80cm, and the third distance (the distance between the third device and the sound source target) is 90cm, then the first device is closer to the sound source target than the second device. The score for the closest event may be 0.8, the score for the event where the first device is closer to the sound source target than the third device may be 0.9, and the score for the second device is closer to the sound source target than the third device. The score for the event may be 0.7, the score for the event where the second device is closer to the sound source target than the first device may be 0.08, and the score for the event where the third device is closer to the sound source target than the first device is. The score may be 0.05, the score for an event where the third device is closer to the sound source target than the second device may be 0.09, etc.

It can be seen that, in this example, since there is a relative distance relationship between the devices and the sound source, and this relationship is strong or weak, by comparing the score of the event where the first device is closer to the sound source target than the second device The value of is associated with the distance difference between the first distance and the second distance, which enables the distance comparison model to learn this difference instead of only learning the coarse-grained relationship of far/near.

In a possible example, the score is calculated by at least one score of at least one group of adjacent devices that constitute the direct or indirect adjacent relationship between the two devices; the score of the adjacent device is obtained by calculating the score of the adjacent device. Two relative distance identifiers of two devices are obtained by calculation, and the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence. The distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target.

It can be seen that in this example, since the relative distance identifier can indicate the position of the corresponding device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting multiple devices according to the preset sorting strategy of distance, and the distance refers to the device distance relationship sequence. The distance between it and the sound source target, so that the relative distance identification can represent the model prediction result containing global information, and improve the accuracy of the model prediction result representation.

In this possible example, the training of the preset distance comparison model according to the reference speech data of the multiple speech data sets and the preset loss function, to obtain a trained distance comparison model, includes: converting the training data It is divided into a training set and a test set, the training set includes part of the voice data sets in the multiple voice data sets; the preset distance comparison model is trained at least once by using the training set, until the training set is The accuracy of the distance comparison model for predicting the distance comparison result of the test set is greater than a preset accuracy.

Wherein, the preset accuracy may be, for example, 98%, 99%, etc., which is not limited here.

It can be seen that, in this example, by training the distance comparison model, the model has the prediction ability that meets the preset accuracy requirements.

In this possible example, the training includes forward propagation and back propagation optimization; in the forward propagation, the predicted relative distance identifier is obtained by calculating the speech features of the speech data set; in the back propagation optimization, the predicted relative distance is used Calculate the predicted score and the real score using the relative distance identification and the real relative distance identification, and use the loss function, the predicted score and the real score to calculate the loss of the distance comparison model, according to the The parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.

In the specific implementation, the design of the loss function is realized through the following steps a to e:

In step a, for the data collected by the same group of wake-up actions in the training data, any two devices can form a pair. Without loss of generality, assuming that the current group contains data of 5 devices, the relationship between their types and relative distances is:

in

Indicates that the device a is closer to the sound source target than the device b, and the relative distance identifiers of the devices are L _a =1, L _b =2, L _c =3, L _d =4, and L _e =5, respectively.

Step b, record the feature vector extracted by each device as x _a , x _b , x _c , x _d , x _e , and record the deep neural network feedforward process mapping as f, then the feature vector (taking x _a as an example) corresponds to the output layer The result is o _a =f(x _a ).

In step c, a score can be obtained from the reference voice data between the pairing of the two devices:

Step d, for the real label, still calculate the score between any two device pairings. Since there is a relative relationship between the distances between multiple devices, the scores between pairs of adjacent devices are first calculated, and the scores between non-adjacent devices are calculated based on this.

For adjacent devices (take b, c as an example), the score is:

For non-adjacent devices (take a, c as an example), b is the common adjacent device of the two, and the score is:

Further, for device pairing (a,e):

In step e, after the input data is fed forward through the deep neural network, back-propagation is performed according to the loss function between the actual output and the real label, and the network parameters are iteratively adjusted to improve the network performance.

Taking any device pairing (i, j) as an example, the loss function is calculated as follows:

It can be seen that in this example, the loss function can quantitatively measure the difference between the estimated label and the real label of the speech data of any two devices in the current device set after passing through the distance comparison model, and adjust the parameters of the distance comparison model through the difference. until the model prediction accuracy meets the requirements.

It can be seen that, in this embodiment of the present application, the device first obtains training data, and secondly, trains a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, and obtains a trained distance comparison model. Model. Since the training data includes multiple voice data sets, each voice data set includes multiple reference voice data corresponding to multiple devices one-to-one, and each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target , and the multiple voice data sets correspond to the voice data sets collected in different sound collection environments, and at the same time, the loss function is used to predict the relative distance relationship between the two devices in the device pairing group and the sound source target in the same sound collection environment The accuracy dimension represents the loss of the distance comparison model. The device pairing group consists of any two devices among multiple devices, so the distance prediction model has the ability to predict the relative distance between the device and the sound source target, and the reference speech data does not have the number of channels Therefore, it can overcome the high hardware requirements and complex algorithms of the conventional microphone array sound source localization algorithm, which is beneficial to improve the efficiency and applicability of relative distance relationship prediction.

An embodiment of the present application provides an apparatus for determining a distance relationship, and the device for determining a distance relationship may be an arbitration device. Specifically, the apparatus for determining a distance relationship is configured to perform the steps performed by the arbitration device in the above method for determining a distance relationship. The apparatus for determining a distance relationship provided in this embodiment of the present application may include modules corresponding to corresponding steps.

In this embodiment of the present application, the distance relationship determining apparatus may be divided into functional modules according to the above method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.

In the case where each functional module is divided according to each function, FIG. 5 shows a possible schematic structural diagram of the apparatus for determining a distance relationship involved in the above embodiment. As shown in FIG. 5, the distance relationship determination device 5 is applied to the arbitration device 200 in the device control system 10; the device includes:

The acquisition unit 50 is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding device collecting the sound of the sound source target and the the device identification of the device;

The determining unit 51 is configured to determine at least one relative distance corresponding to at least one device one-to-one among the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model Identification, each relative distance identification in the at least one relative distance identification is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is the plurality of devices according to the preset sorting strategy of distance. A sequence formed by sorting, and the distance refers to the distance between the device and the sound source target.

In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device among the plurality of devices. In terms of at least one relative distance identification, the determining unit 51 is specifically configured to align the frequency response capability of each reference voice data in the plurality of reference voice data in the plurality of voice collection data, and obtain a comparison with the plurality of voices. Collecting data one-to-one corresponding frequency response capability aligned multiple target voice data; and according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model, At least one relative distance identifier corresponding to at least one of the plurality of devices in a one-to-one correspondence is determined.

In a possible example, in the aspect of performing frequency response capability alignment on multiple reference voice data in the multiple sound collection data to obtain multiple target voice data after frequency response capability alignment, the determining unit 51 , which is specifically used for: performing the following operations for each reference voice data in the plurality of sound collection data to obtain a plurality of target voice data after frequency response capability alignment: obtaining preset according to the device identifier of the currently processed reference voice data The frequency response unit impulse response of the current device relative to the reference device; the convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain the target speech data after frequency response capability alignment.

In a possible example, according to the target voice data, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, it is determined that at least one device is related to at least one device among the plurality of devices. In terms of at least one relative distance identifier corresponding to one-to-one, the determining unit 51 is specifically configured to: perform multi-dimensional feature extraction on each target voice data in the plurality of target voice data, and obtain a Multiple voice feature sets corresponding to data one-to-one, each voice feature set includes multi-dimensional feature extraction results; and feature fusion is performed on each voice feature set in the multiple voice feature sets to obtain multiple fused voice features; and, according to the plurality of fused voice features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine at least one relative device corresponding to at least one device among the plurality of devices. distance sign.

In a possible example, according to the reference voice data and device identifiers in the plurality of sound collection data and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device among the plurality of devices. In terms of at least one relative distance identification, the determining unit 51 is specifically configured to: perform multi-dimensional feature extraction on each sound collection data in the plurality of sound collection data, and obtain a one-to-one correspondence with the plurality of sound collection data A plurality of voice feature sets, each voice feature set in the multiple voice feature sets includes multi-dimensional feature extraction results; and each voice feature set in the multiple voice feature sets is feature fusion to obtain multiple fused voice features; and according to the plurality of fused voice features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model, determine a one-to-one correspondence with at least one device in the plurality of devices at least one relative distance identifier of .

In a possible example, performing multi-dimensional feature extraction on each of the plurality of sound collection data to obtain multiple voice feature sets corresponding to the plurality of sound collection data one-to-one , the determining unit 51 is specifically configured to: align the frequency response capability of each reference voice data in the multiple reference voice data in the multiple voice collection data, and obtain a one-to-one correspondence with the multiple voice collection data A plurality of target speech data after the frequency response capability alignment; and multi-dimensional feature extraction is performed on each target speech data in the plurality of target speech data, to obtain a plurality of target speech data corresponding to the plurality of target speech data one-to-one. A collection of speech features.

In a possible example, performing frequency response capability alignment on each of the plurality of reference speech data in the plurality of sound collection data to obtain a one-to-one correspondence with the plurality of sound collection data In terms of the multiple target voice data after frequency response capability alignment, the determining unit 51 is specifically configured to: perform the following operations for each reference voice data in the multiple voice collection data, to obtain a A plurality of target voice data after the frequency response capability alignment of the data one-to-one correspondence: and obtain the preset current device frequency response unit impulse response relative to the reference device according to the device identifier associated with the currently processed reference voice data; A convolution operation is performed between the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain the target speech data after the frequency response capability is aligned.

In a possible example, performing multi-dimensional feature extraction on each of the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one , the determining unit 51 is specifically configured to: perform the following operations for each target voice data in the multiple target voice data to obtain multiple voice feature sets: extract the scalar voice feature and vector voice of the currently processed target voice data and performing dimension reduction and secondary feature extraction on the vector speech features to obtain vector derived speech features.

In a possible example, in terms of extracting the scalar speech feature and the vector speech feature of the currently processed target speech data, the determining unit 51 is specifically configured to: preprocess the currently processed target speech data to obtain a pre- processed target speech data; and extracting scalar speech features and vector speech features of the preprocessed target speech data; wherein, the preprocessing includes at least one of the following: silence suppression processing, preprocessing through high-frequency filters Emphasis, Framing, and Windowing.

In a possible example, the determining unit determines, according to the reference voice data and the device identifier in the plurality of sound collection data and the pre-trained distance comparison model, to determine the relationship with at least one device in the plurality of devices. After the one-to-one correspondence with at least one relative distance identifier, the method is further used to: determine, according to the at least one relative distance identifier, a target device in the plurality of devices for executing the voice command associated with the sound source target; If the target device is a device other than the arbitration device, send indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction; If the target device is the arbitration device, execute the operation indicated by the voice instruction.

In the case of using an integrated unit, a schematic structural diagram of another apparatus for determining a distance relationship provided by an embodiment of the present application is shown in FIG. 6 . In FIG. 6 , the distance relationship determining device 6 includes: a processing module 60 and a communication module 61 . The processing module 60 is used to control and manage the actions of the device control apparatus, for example, the steps performed by the acquisition unit 50, the determination unit 51, and/or other processes used to perform the techniques described herein. The communication module 61 is used to support the interaction between the device control apparatus and other devices. As shown in FIG. 6 , the distance relationship determining apparatus may further include a storage module 62, and the storage module 62 is configured to store program codes and data of the distance relationship determining apparatus.

Wherein, the processing module 60 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 61 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 62 may be a memory.

Wherein, all the relevant contents of the scenarios involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. Both the distance relationship determining device 5 and the distance relationship determining device 6 can perform the steps performed by the arbitration device in the distance relationship determining method shown in FIG. 2 .

An embodiment of the present application provides a device control device, where the device control device may be an arbitration device. Specifically, the device control apparatus is configured to execute the steps performed by the target device in the above device control method. The device control apparatus provided in this embodiment of the present application may include modules corresponding to corresponding steps.

In this embodiment of the present application, the device control apparatus may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.

In the case where each functional module is divided according to each function, FIG. 7 shows a possible schematic structural diagram of the device control apparatus involved in the foregoing embodiment. As shown in Figure 7, the device control device 7 is applied to the target device; the device includes:

The obtaining unit 70 is configured to obtain the indication information of the arbitration device, where the indication information is that the arbitration device determines, according to at least one relative distance identifier corresponding to at least one device in the plurality of devices, the said data in the plurality of devices. Generated when the target device is used to execute the voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring a plurality of one-to-one correspondence with the plurality of devices Sound collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification in the reference voice data and the pre-trained distance comparison model are determined to determine the at least one relative distance identification, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence. The device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;

The execution unit 71 is configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.

In a possible example, the target device is the device closest to the sound source target among the plurality of devices.

In a possible example, the target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.

In the case of using an integrated unit, a schematic structural diagram of another device control apparatus provided by an embodiment of the present application is shown in FIG. 8 . In FIG. 8 , the device control apparatus 8 includes: a processing module 80 and a communication module 81 . The processing module 80 is used to control and manage the actions of the device control apparatus, for example, the steps performed by the acquisition unit 70, the execution unit 71, and/or other processes used to perform the techniques described herein. The communication module 81 is used to support the interaction between the device control apparatus and other devices. As shown in Fig. 8, the device control apparatus may further include a storage module 82, and the storage module 82 is used for storing program codes and data of the device control apparatus.

Wherein, the processing module 80 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 81 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 82 may be a memory.

Wherein, all the relevant contents of the scenarios involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. Both the device control device 7 and the device control device 8 can execute the steps performed by the target device in the device control method shown in FIG. 3 .

An embodiment of the present application provides a training device for a distance comparison model, and the training device for a distance comparison model may be a model training device for training a model. Specifically, the distance comparison model training apparatus is configured to perform the steps performed by the model training device in the above distance comparison model training method. The apparatus for training the distance comparison model provided by the embodiment of the present application may include modules corresponding to the corresponding steps.

In this embodiment of the present application, the training device of the distance comparison model can be divided into functional modules according to the above method examples. For example, each functional module can be divided corresponding to each function, or two or more functions can be integrated into one processing module. . The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. The division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.

In the case where each functional module is divided according to each function, FIG. 9 shows a possible schematic structural diagram of the training device for the distance comparison model involved in the above embodiment. As shown in Figure 9, the training device 9 of the distance comparison model is applied to the model training equipment; the device includes:

The obtaining unit 90 is configured to obtain training data, where the training data includes a plurality of voice data sets, and each voice data set in the plurality of voice data sets includes a plurality of reference voice data corresponding to a plurality of devices one-to-one. Each reference voice data in the plurality of reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, so The sound collection environment at least includes the location where the sound source target is located;

A training unit 91, configured to train a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function, to obtain a trained distance comparison model, and the loss function is used for pairing groups from devices The prediction accuracy dimension of the relative distance relationship between two devices in the same sound collection environment and the sound source target represents the loss of the distance comparison model, and the device pairing group consists of any two devices in the plurality of devices. composition.

In a possible example, the score is calculated by at least one score of at least one group of adjacent devices forming a direct or indirect adjacent relationship between the two devices;

The score of the adjacent device is calculated by two relative distance identifiers of the two devices in the adjacent device, the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the corresponding distance. The position of the device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting the multiple devices according to the preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target the distance.

In a possible example, in the aspect of training a preset distance comparison model according to the reference speech data of the multiple speech data sets and a preset loss function to obtain a trained distance comparison model, the training unit 91 , which is specifically used to: divide the training data into a training set and a test set, the training set includes part of the voice data sets in the multiple voice data sets; and use the training set to compare the preset distance The comparison model is trained at least once until the distance comparison model after training has a greater accuracy than a preset accuracy in predicting the distance comparison result of the test set.

In a possible example, the training includes forward propagation and back propagation optimization;

In the described forward propagation, use the voice feature of the voice data set to calculate the predicted relative distance mark;

In the back-propagation optimization, the predicted relative distance identification and the real relative distance identification are used to calculate the predicted score and the real score, and the loss function, the predicted score and the real score are used to calculate the predicted score. The loss of the distance comparison model is adjusted, and the parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.

In the case of using an integrated unit, a schematic structural diagram of another distance comparison model training apparatus provided in the embodiment of the present application is shown in FIG. 10 . In FIG. 10 , the training device 10 of the distance comparison model includes: a processing module 100 and a communication module 101 . The processing module 100 is used to control and manage the actions of the training device of the distance comparison model, eg, the steps performed by the acquisition unit 90, the training unit 91, and/or other processes for performing the techniques described herein. The communication module 101 is used to support the interaction between the training device of the distance comparison model and other devices. As shown in FIG. 10 , the training device for the distance comparison model may further include a storage module 102, and the storage module 102 is used for storing program codes and data of the training device for the distance comparison model.

The processing module 100 may be a processor or a controller, such as a central processing unit (Central Processing Unit, CPU), a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), ASIC, FPGA or other programmable Logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 101 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 102 may be a memory.

Wherein, all the relevant contents of the scenarios involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. Both the distance comparison model training device 9 and the distance comparison model training device 10 can perform the steps performed by the model training device in the distance comparison model training method shown in FIG. 4 .

The above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission by wire or wireless to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media. The semiconductor medium may be a solid state drive.

Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program causes the computer to execute part or all of the steps of any method described in the above method embodiments , the above computer includes electronic equipment.

Embodiments of the present application further provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of the method embodiments described above. some or all of the steps of the method. The computer program product may be a software installation package, and the computer includes an electronic device.

It should be understood that, in various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.

In the several embodiments provided in this application, it should be understood that the disclosed method, apparatus and system may be implemented in other manners. For example, the device embodiments described above are only illustrative; for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation; for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included individually, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

The above-mentioned integrated units implemented in the form of software functional units can be stored in a computer-readable storage medium. The above-mentioned software functional unit is stored in a storage medium, and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute some steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.

Although the present invention is disclosed above, the present invention is not limited thereto. Any person skilled in the art, without departing from the spirit and scope of the present invention, can easily think of changes or substitutions, and can make various changes and modifications, including the combination of the above-mentioned different functions and implementation steps, including the implementation of software and hardware. The methods are all within the protection scope of the present invention.

Claims

A method for determining a distance relationship, characterized in that it is applied to an arbitration device, the method comprising:

Acquire multiple pieces of sound collection data corresponding to multiple devices one-to-one, and each sound collection data in the multiple pieces of sound collection data includes reference voice data obtained by collecting the sound of the sound source target by the corresponding equipment and the equipment of the equipment. identification;

According to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model, determine at least one relative distance identifier corresponding to at least one device among the multiple devices, and the at least one relative distance identifier is determined. Each relative distance identifier in the relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy for distances, The distance refers to the distance between the device and the sound source target.
The method according to claim 1, characterized in that, according to the reference voice data and device identification in the plurality of sound collection data and the pre-trained distance comparison model, it is determined that at least one of the plurality of devices is connected to the device. At least one relative distance identifier corresponding to the device one-to-one, including:

Perform frequency response capability alignment on each of the multiple reference voice data in the multiple sound collection data, and obtain a plurality of target voices whose frequency response capabilities are aligned one-to-one with the multiple voice collection data data;

At least one relative distance corresponding to at least one of the multiple devices is determined according to the multiple target voice data, the device identifiers in the multiple voice collection data, and the pre-trained distance comparison model logo.
The method according to claim 2, wherein the performing frequency response capability alignment on a plurality of reference speech data in the plurality of sound collection data to obtain a plurality of target speech data after frequency response capability alignment, comprising: :

The following operations are performed for each reference speech data in the plurality of sound collection data to obtain a plurality of target speech data after frequency response capability alignment:

Obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier of the currently processed reference voice data;

Convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain target speech data after the frequency response capability is aligned.
The method according to claim 2 or 3, characterized in that, according to the target voice data, the device identifiers in the plurality of voice collection data, and the pre-trained distance comparison model, determining the At least one relative distance identifier corresponding to at least one device one-to-one among multiple devices, including:

Multi-dimensional feature extraction is performed on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one, and each voice feature set includes multi-dimensional features extract results;

Perform feature fusion on each of the multiple voice feature sets to obtain multiple fused voice features;

At least one relative distance corresponding to at least one of the plurality of devices is determined according to the plurality of fused speech features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model logo.
The method according to claim 1, characterized in that, according to the reference voice data and device identification in the plurality of sound collection data and the pre-trained distance comparison model, it is determined that at least one of the plurality of devices is connected to the device. At least one relative distance identifier corresponding to the device one-to-one, including:

Multi-dimensional feature extraction is performed on each of the plurality of sound collection data to obtain a plurality of voice feature sets corresponding to the plurality of voice collection data one-to-one, and each of the plurality of voice feature sets is obtained. The speech feature set includes multi-dimensional feature extraction results;

Perform feature fusion on each of the multiple voice feature sets to obtain multiple fused voice features;

At least one relative distance corresponding to at least one of the plurality of devices is determined according to the plurality of fused speech features, the device identifiers in the plurality of sound collection data, and the pre-trained distance comparison model logo.
The method according to claim 5, wherein the multi-dimensional feature extraction is performed on each sound collection data in the plurality of sound collection data to obtain a one-to-one correspondence with the plurality of sound collection data. A set of speech features, including:

Perform frequency response capability alignment on each of the multiple reference voice data in the multiple sound collection data, and obtain a plurality of target voices whose frequency response capabilities are aligned one-to-one with the multiple voice collection data data;

Multi-dimensional feature extraction is performed on each target voice data in the multiple target voice data to obtain multiple voice feature sets corresponding to the multiple target voice data one-to-one.
The method according to claim 6, wherein the frequency response capability alignment is performed on each of the plurality of reference speech data in the plurality of sound collection data, and the result is obtained with the plurality of sound collection data. Multiple target speech data after frequency response capability alignment corresponding to data one-to-one, including:

The following operations are performed for each reference speech data in the plurality of sound collection data to obtain a plurality of target speech data whose frequency response capabilities are aligned one-to-one with the plurality of sound collection data:

Obtain the preset frequency response unit impulse response of the current device relative to the reference device according to the device identifier associated with the currently processed reference voice data;

Convolution operation is performed on the reference speech data and the frequency response unit impulse response, and gain adjustment is performed to obtain target speech data after the frequency response capability is aligned.
The method according to claim 4 or claim 6, wherein the multi-dimensional feature extraction is performed on each target voice data in the plurality of target voice data to obtain the same A corresponding set of multiple speech features, including:

The following operations are performed for each target voice data in the multiple target voice data to obtain multiple voice feature sets:

Extract scalar speech features and vector speech features of the currently processed target speech data;

Dimension reduction and secondary feature extraction are performed on the vector speech features to obtain vector derived speech features.
The method according to claim 8, wherein the extracting the scalar speech feature and the vector speech feature of the currently processed target speech data comprises:

Preprocessing the currently processed target speech data to obtain preprocessed target speech data;

Extracting the scalar speech feature and the vector speech feature of the preprocessed target speech data;

The preprocessing includes at least one of the following: silence suppression processing, pre-emphasis processing through a high-frequency filter, frame segmentation processing, and windowing processing.
The method according to any one of claims 1-9, characterized in that, according to the reference voice data and device identifiers in the plurality of sound collection data and a pre-trained distance comparison model, determining a After the at least one relative distance identifier is in one-to-one correspondence with at least one of the devices, the method further includes:

Determine, according to the at least one relative distance identifier, a target device of the plurality of devices for executing the voice command associated with the sound source target;

If it is detected that the target device is a device other than the arbitration device, sending indication information to the target device, where the indication information is used to instruct the target device to perform the operation indicated by the voice instruction;

If it is detected that the target device is the arbitration device, the operation indicated by the voice instruction is performed.
A device control method, characterized in that, applied to a target device, the method comprising:

Acquire indication information of the arbitration device, where the indication information is when the arbitration device determines that the target device in the plurality of devices is used to perform the sound according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated in the case of a voice command associated with the voice of the source and target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operations: acquiring multiple voice collection data corresponding to the multiple devices one-to-one, the Each sound collection data in the plurality of sound collection data includes the reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the reference speech data in the plurality of sound collection data and The device identifier and the pre-trained distance comparison model are used to determine the at least one relative distance identifier, and each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence. The device distance relationship sequence is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;

The operation indicated by the sound-related voice instruction of the sound source target is performed according to the instruction information.
The method according to claim 11, wherein the target device is the device closest to the sound source target among the plurality of devices.
The method according to claim 11 or 12, wherein the target device is the arbitration device; or, the target device is a device other than the arbitration device among the multiple devices.
A training method for a distance comparison model, comprising:

Acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the multiple reference voice data Each reference voice data is the voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to the voice data sets collected under different sound collection environments, and the sound collection environments at least include the location of the sound source target;

A preset distance comparison model is trained according to the reference speech data of the multiple speech data sets and a preset loss function, and a trained distance comparison model is obtained. The prediction accuracy dimension of the relative distance relationship with the sound source target in the sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices among the plurality of devices.
15. The method of claim 14, wherein the relative distance relationship is characterized by a score defining an event of the two devices where a first device is closer to the sound source target than a second device, and The value of the score is associated with the distance difference, the distance difference is the difference between the first distance and the second distance, and the first distance is the distance between the first device and the sound source target, so The second distance is the distance between the second device and the sound source target.
The method according to claim 15, wherein the score is calculated by at least one score of at least one group of adjacent devices forming a direct or indirect adjacent relationship between the two devices;

The score of the adjacent device is calculated by two relative distance identifiers of the two devices in the adjacent device, the relative distance identifier corresponds to the prediction result of the distance comparison model, and the relative distance identifier is used to indicate the corresponding distance. The position of the device in the device distance relationship sequence, the device distance relationship sequence is a sequence formed by sorting the multiple devices according to the preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target the distance.
The method according to claim 16, wherein the training a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, to obtain a trained distance comparison model, comprising: :

dividing the training data into a training set and a test set, where the training set includes part of the voice data sets in the multiple voice data sets;

The preset distance comparison model is trained at least once by using the training set, until the trained distance comparison model predicts the distance comparison result of the test set with a greater accuracy than a preset accuracy.
The method of claim 17, wherein the training comprises forward propagation and back propagation optimization;

In the described forward propagation, use the voice feature of the voice data set to calculate the predicted relative distance mark;

In the back-propagation optimization, the predicted relative distance identification and the real relative distance identification are used to calculate the predicted score and the real score, and the loss function, the predicted score and the real score are used to calculate the predicted score. The loss of the distance comparison model is adjusted, and the parameters of the distance comparison model are adjusted according to the loss of the distance comparison model.
An apparatus for determining a distance relationship, characterized in that, when applied to an arbitration device, the apparatus includes:

The acquisition unit is used for a plurality of sound collection data corresponding to a plurality of devices one-to-one, and each sound collection data in the plurality of sound collection data includes the reference speech data obtained by collecting the sound of the sound source target by the corresponding device and the obtained sound. the equipment identification of the said equipment;

A determination unit, configured to determine at least one relative distance identifier corresponding to at least one of the multiple devices according to the reference voice data and device identifiers in the multiple sound collection data and the pre-trained distance comparison model , each relative distance identifier in the at least one relative distance identifier is used to indicate the position of the corresponding device in the device distance relationship sequence, and the device distance relationship sequence is that the multiple devices are sorted according to a preset sorting strategy of distances And the formed sequence, the distance refers to the distance between the device and the sound source target.
A device control device, characterized in that, applied to a target device, the device comprising:

An acquiring unit, configured to acquire indication information of the arbitration device, where the indication information is when the arbitration device determines the target in the plurality of devices according to at least one relative distance identifier corresponding to at least one device in the plurality of devices. The at least one relative distance identifier is generated when the device is used to execute a voice command associated with the sound source target, and the at least one relative distance identifier is obtained by the arbitration device performing the following operation: acquiring multiple sounds corresponding to the multiple devices one-to-one Collection data, each sound collection data in the plurality of sound collection data includes reference speech data obtained by the corresponding equipment collecting the sound of the sound source target and the equipment identification of the equipment; and according to the plurality of sound collection data The reference voice data and device identification and the pre-trained distance comparison model are determined, and the at least one relative distance identification is determined, and each relative distance identification in the at least one relative distance identification is used to indicate that the corresponding device is in the device distance relationship sequence. The position of the device distance relationship is a sequence formed by sorting the plurality of devices according to a preset sorting strategy of distance, and the distance refers to the distance between the device and the sound source target;

An execution unit, configured to execute the operation indicated by the voice instruction associated with the sound of the sound source target according to the instruction information.
A training device for a distance comparison model, comprising:

an acquiring unit, configured to acquire training data, where the training data includes multiple voice data sets, each voice data set in the multiple voice data sets includes multiple reference voice data corresponding to multiple devices one-to-one, the Each reference voice data in the multiple reference voice data is voice data obtained by the corresponding device collecting the sound of the sound source target, and the multiple voice data sets correspond to voice data sets collected under different sound collection environments, the The sound collection environment at least includes the location where the sound source target is located;

A training unit, configured to train a preset distance comparison model according to the reference speech data of the plurality of speech data sets and a preset loss function, and obtain a trained distance comparison model, and the loss function is used to obtain the trained distance comparison model from the device pairing group The prediction accuracy dimension of the relative distance relationship between two devices and the sound source target in the same sound collection environment represents the loss of the distance comparison model, and the device pairing group is composed of any two devices in the plurality of devices. .
An electronic device, characterized in that the electronic device comprises:

one or more processors;

one or more memories for storing programs,

The one or more memories and the program are configured to control the device by the one or more processors to perform any one of claims 1-10 or 11-13 or 14-18 the steps in the method.
A computer-readable storage medium, characterized by storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute any of claims 1-10 or 11-13 or claims 14-18 one of the methods described.