WO2020006699A1 - 语音处理的方法和装置 - Google Patents

语音处理的方法和装置 Download PDF

Info

Publication number
WO2020006699A1
WO2020006699A1 PCT/CN2018/094464 CN2018094464W WO2020006699A1 WO 2020006699 A1 WO2020006699 A1 WO 2020006699A1 CN 2018094464 W CN2018094464 W CN 2018094464W WO 2020006699 A1 WO2020006699 A1 WO 2020006699A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voice information
detection parameter
information
playback device
Prior art date
Application number
PCT/CN2018/094464
Other languages
English (en)
French (fr)
Inventor
陈尚松
屈亚新
李智
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2018/094464 priority Critical patent/WO2020006699A1/zh
Priority to CN201880095355.XA priority patent/CN112400205A/zh
Publication of WO2020006699A1 publication Critical patent/WO2020006699A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation

Definitions

  • the present application relates to the field of speech processing, and more particularly, to a method and an apparatus for speech processing.
  • One of the key technologies of smart speaker products is echo cancellation, that is, during speech recognition, the smart speaker can decode the sound collected from the microphone, such as the sound output from the smart speaker and the human voice command, and decode the audio in the smart speaker.
  • the sound output by the device is cancelled to achieve the purpose of retaining only the voice instructions of the person who needs to be identified, and improving the voice recognition rate.
  • the present application provides a method and a device for speech processing, which can improve the speech recognition rate.
  • a method for speech processing includes: acquiring a first detection parameter, the first detection parameter being used to indicate a signal change amount of a signal transmission between an electronic device and a voice playback device; Parameters, first voice information and second voice information to determine a target voice, the first voice information is voice information to be played sent by the electronic device to the voice playback device, and the second voice information is composed of the target voice and the third
  • the mixed voice information is voice information
  • the third voice information is voice information played by the voice playback device after receiving the voice information to be played.
  • the embodiment of the present application obtains a first detection parameter for indicating a signal change amount of a signal transmission between the electronic device and the voice playback device, and according to the first detection parameter, the first voice information, and a target voice and a first
  • the second voice information mixed with the three voice information determines the target voice. Since the first detection parameter is introduced in the process of determining the target voice, the echo cancellation effect is improved, so that the determined target voice contains fewer echoes, thereby improving Recognition rate of target speech.
  • determining the target voice according to the first detection parameter, the first voice information, and the second voice information includes: determining the third voice information according to the first detection parameter and the first voice information; and according to the The third voice information and the second voice information determine the target voice.
  • the third voice information may be determined based on the first voice information and the first detection parameter, and the target voice may be determined based on the third voice information and the second voice information, that is, it is determined that the third voice information is included
  • the target voice in the mixed voice information with the target voice thereby improving the speech recognition rate.
  • the method before determining the target voice, further includes: determining the first detection parameter corresponding to the device identification ID of the voice playback device by searching a mapping table, where the mapping table includes at least one voice playback The mapping relationship between the device ID of the device and at least one detection parameter.
  • the mapping table is stored in a storage device of the electronic device.
  • a mapping table of different voice playback devices and detection parameters that are measured in advance at the time of starting work or before leaving the factory can be stored in the electronic device, thereby helping to quickly learn the corresponding detection parameters according to the voice playback device. To improve speech recognition efficiency.
  • mapping table sent by the server is received.
  • a mapping table of different voice playback devices and detection parameters measured in advance at the time of starting work or before leaving the factory may be sent to the server and stored in the storage device of the server.
  • the mapping table is sent to the electronic device, thereby saving the storage space of the electronic device, and improving the compatibility of the electronic device with various voice devices.
  • the acquiring the first detection parameter includes: sending a device identification ID of the voice playback device to a server, so that the server determines the first detection parameter according to the voice playback device identification ID; and receiving the server The first detection parameter sent.
  • the electronic device may send the device ID of the voice playback device to the server, so that the server determines the first detection parameter according to the device ID of the voice playback device and the mapping table, and sends the first detection parameter to the electronic device, thereby saving electronics.
  • the speech recognition efficiency is improved.
  • the method before storing the mapping table, the method further includes: generating the mapping table.
  • the generating the mapping table includes: determining the first detection parameter according to the fourth voice information and the fifth voice information, where the fourth voice information is voice information sent by the electronic device to the voice playback device The fifth voice information is voice information received by the electronic device in response to the fourth voice information; and a mapping relationship between the device ID of the voice playback device and the first detection parameter is stored in the mapping table.
  • the electronic device sends fourth voice information to the voice playback device, receives fifth voice information in response to the fourth voice information, determines the first detection parameter according to the fourth voice information and the fifth voice information, and sets the first detection parameter
  • the mapping relationship between the device ID of the first voice device and the first detection parameter is stored in a mapping table, thereby helping to improve the voice recognition rate.
  • the first detection parameter includes at least one of a delay amount, a volume change amount, and a frequency change amount caused by signal transmission.
  • the electronic device can determine the third voice information according to the first voice information by combining at least one of the delay amount, volume change amount, and frequency change amount of the voice information, and furthermore, based on the third voice information and the second voice information,
  • the target voice is accurately determined, so that the electronic device can accurately determine the target voice, thereby improving the recognition rate of the target voice.
  • the method further includes: identifying the target voice; and when the target voice is a voice instruction, instructing the electronic device to perform a command operation.
  • the electronic device can identify whether the target voice is a voice command, and in the case that the target voice is a voice command, perform an accurate command operation according to the voice command, thereby improving the user experience.
  • a method for speech processing includes:
  • a first detection parameter corresponding to the voice playback device according to the device ID and a mapping table of the voice playback device, where the mapping table includes a mapping relationship between the device ID of at least one voice playback device and at least one detection parameter;
  • the server may store a mapping table of at least one voice playback device and at least one detection parameter. After receiving the device ID of the voice playback device sent by the electronic device, the server may determine the first detection parameter according to the device ID of the voice playback device and the mapping table. And sending the first detection parameter to the electronic device, thereby saving the storage space of the electronic device.
  • the method before the electronic device sends the first detection parameter to the voice playback device, the method further includes: receiving a mapping table of a device ID of the voice playback device and the first detection parameter.
  • the mapping relationship stored by the server may be sent to the server by the electronic device after being determined, so that the storage space of the electronic device can be saved.
  • the mapping relationship stored by the server may be sent to the server by the electronic device after determining the mapping relationship between the device ID and the detection parameter of a pair of playback devices.
  • the method before the electronic device sends the first detection parameter to the voice playback device, the method further includes:
  • mapping table includes a mapping relationship between a device ID of at least one voice playback device and at least one detection parameter.
  • the mapping relationship stored by the server can be used by the electronic device to determine the mapping relationship between the device IDs and detection parameters of multiple pairs of playback devices, and then send the mapping relationship to the server to save signaling overhead.
  • a voice processing device may be a device or a chip in the device.
  • the device has the functions of implementing the embodiments of the first aspect described above. This function can be realized by hardware, and can also be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more units corresponding to the functions described above.
  • the device when the device is a device, the device includes a processing module and an acquisition module, and the processing module and the acquisition module may be implemented by a processor.
  • the device further includes a storage device.
  • a submodule which can be, for example, a memory.
  • the storage sub-module is used to store computer execution instructions
  • the processing module is connected to the storage sub-module, and the processing module executes the computer execution instructions stored by the storage sub-module, so that the device executes the foregoing A method of speech processing according to any one of the first aspects.
  • the chip when the device is a chip, the chip includes: a processing module and an acquisition module, the processing module and the acquisition module may be implemented by a processor, and the chip may further include input / output Interface, pin, or circuit.
  • the processor may execute computer execution instructions stored in the storage module, so that a chip in the terminal executes the voice processing method of any one of the first aspects.
  • the storage module is a storage sub-module in the chip, such as a register, a cache, etc.
  • the storage module may also be a storage module in an electronic device outside the chip, such as a read-only memory (read -memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM read-only memory
  • RAM random access memory
  • the processor mentioned above may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more for controlling the above.
  • the first aspect of the speech processing method is a program execution integrated circuit.
  • a voice processing apparatus which includes:
  • a memory for storing a mapping table, where the mapping table includes a mapping relationship between a device ID of at least one voice playback device and at least one detection parameter;
  • the target voice is determined according to the first detection parameter, the first voice information, and the second voice information.
  • the first voice information is voice information to be played sent by the electronic device to the voice playback device, and the second voice information is determined by the target.
  • the voice information after the voice is mixed with the third voice information, and the third voice information is voice information played by the voice playback device after receiving the voice information to be played.
  • the processor is used to:
  • the target voice is determined according to the third voice information and the second voice information.
  • the first detection parameter includes at least one of a delay amount, a volume change amount, and a frequency change amount caused by signal transmission.
  • a television set-top box includes an audio decoder, a sound collector, and the voice processing device described in the third aspect, and the audio decoder is configured to decode received audio information to obtain the first A voice information, the voice collector is used to collect the second voice information.
  • a computer storage medium stores program code, where the program code is used to instruct instructions to execute the method in the first aspect or the second aspect or in any possible implementation manner thereof.
  • a computer program product containing instructions which when run on a computer, causes the computer to execute the method in the first or second aspect or any possible implementation thereof.
  • a voice processing system includes the device according to the third aspect and a voice playback device.
  • a processor is provided, which is coupled to a memory, and is configured to execute the method in the first or second aspect or any possible implementation manner thereof.
  • the electronic device obtains a first detection parameter for indicating a signal change amount of signal transmission between the electronic device and the voice playback device, and according to the first detection parameter, the first voice information, and the target voice
  • the second voice information mixed with the third voice information determines the target voice. Since the first detection parameter is introduced in the process of determining the target voice, the echo cancellation effect is improved, so that the determined target voice contains fewer echoes. This further improves the recognition rate of the target speech.
  • FIG. 1 is a schematic diagram of a television playback system provided by the present application
  • FIG. 2 is a structural diagram of a television playback system provided by the present application.
  • FIG. 3 is a schematic flowchart of a voice processing provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a speech processing apparatus according to another embodiment of the present application.
  • the voice playback device in the embodiment of the present application may refer to a device with a voice playback function, such as a television, a mobile phone, a tablet computer, an LCD display, and the electronic device may refer to a processing device such as a set-top box, a stereo, an in-vehicle device, a handheld device with a wireless communication function, This embodiment of the present application is not limited to this.
  • This application can be applied to a home TV broadcast scenario, a car speech recognition scenario, or a scenario where the sound environment is complicated and a recordable tracking mark is provided, which is not limited in this application.
  • FIG. 1 is a television broadcast system provided by the present application.
  • the television broadcast system includes a television 110 and a set-top box 120.
  • the television 110 and the set-top box 120 may be connected through an interface such as a high definition multimedia interface (HDMI). connection.
  • HDMI high definition multimedia interface
  • the television 110 is used as a voice playback device, and the receiver sends the voice information sent by the set-top box 120 to perform voice playback.
  • the set-top box 120 as an electronic device can collect the target voice sent by the user 130, recognize the voice instructions in the target voice, and Perform the relevant instruction operation.
  • the target voice may also be issued by the device.
  • FIG. 2 provides a structure of a set-top box in the above-mentioned television broadcasting system.
  • the set-top box 120 includes an audio decoder 121, a sound collector 122, an echo cancellation module 123, and a voice recognition module 124.
  • the audio decoder 121 decodes the received audio information to generate first voice information, and transmits the first voice information to the television 110 through, for example, HDMI, and simultaneously transmits the first voice information to the echo cancellation module 123;
  • the television 110 plays the first audio information transmitted by the audio decoder 121 for voice playback, such as voice playback through a speaker;
  • the sound collector 122 collects the second voice information.
  • the second voice information is a mixed voice.
  • the mixed voice may include the third voice information and the target voice played by the television 110.
  • the target voice may be a human voice.
  • Voice information such as a voice instruction used to instruct the set-top box 120 to perform a related operation, it should be noted that the sound collector may be a microphone.
  • the echo cancellation module 123 obtains a first detection parameter, and the first detection parameter is used to indicate a signal change amount of the signal transmission between the set-top box 120 and the television 110.
  • the echo cancellation module 123 collects the first detection parameter, the first voice information, and the sound collection.
  • the second voice information collected by the processor 122 performs echo cancellation processing, that is, the decoded audio information and the mixed sound collected by the sound collector are subjected to echo cancellation processing to determine a target voice, and the determined target voice is provided to the voice recognition module 124 ;
  • FIG. 2 provides an implementation manner in which the echo cancellation module 123 obtains the first detection parameter.
  • the storage module 125 in the set-top box 120 stores the mapping of the device identity (ID) and detection parameter of different TVs in advance. Table, the set-top box 120 may determine the first detection parameter according to the device ID of the TV and the mapping table, so that different detection parameters can be provided for TVs with different IDs.
  • the third voice information and TV set-top box played by the TV collected through the microphone in the set-top box The difference in the first voice information sent by the audio decoder to the TV via HDMI is not the same.
  • This application performs echo cancellation processing on the sound collected by the microphone according to the detection parameters and first voice information adapted to the currently used television, which can improve echo cancellation.
  • the voice recognition module 124 recognizes the target voice, and instructs the set-top box 120 to perform related command operations when the target voice is recognized as a voice command.
  • the echo cancellation module 123 acquires the first detection parameter, and determines the target voice according to the first detection parameter, the first voice information, and the second voice information, since it is introduced in the process of determining the target voice
  • the first detection parameter improves echo cancellation so that the determined target voice contains fewer echoes. Therefore, the voice recognition module 124 can more accurately recognize the target voice and improve the voice recognition rate.
  • FIG. 3 shows a schematic flowchart of speech processing according to an embodiment of the present application.
  • 301 Acquire a first detection parameter, where the first detection parameter is used to indicate a signal change amount of a signal transmission between an electronic device and a voice playback device.
  • the first detection parameter is used to indicate a signal change amount during signal transmission between the electronic device and the voice playback device, for example, signal loss, signal interference, and the like due to signal transmission between the electronic device and the voice playback device.
  • the voice playback device may be a device having a function of playing a voice, and may also be a device having a function of playing a voice and an image (that is, a device playing a video).
  • the electronic device may be a television set-top box.
  • step 301 may be the echo cancellation module 123 shown in FIG. 2.
  • the electronic device in this embodiment of the present application may be a set-top box, or a digital video disc (disk, DVD) player, or may be other capable of sending audio information to a voice playback device, so that the voice playback device can
  • the device for playing the audio information is not limited in this application.
  • the voice playback device may be a device capable of playing only voice, or a display (eg, a television) capable of playing both voice and video.
  • the 302. Determine a target voice according to the first detection parameter, the first voice information, and the second voice information.
  • the first voice information is voice information to be played sent by the electronic device to the voice playback device.
  • the second voice information includes The target voice and third voice information, the third voice information is voice information played by the voice playback device after receiving the voice information to be played.
  • the electronic device and the voice playback device can communicate through, for example, an HDMI interface. Due to the distance between the electronic device and the voice playback device, the first voice information is affected during the communication process, for example, the first voice There may be signal loss in the information or the first voice information may be interfered by other signals, so that the voice playback device receives the affected first voice information.
  • the electronic device sends the first voice information to the voice playback device, it can receive the third voice information used by the voice playback device to respond to the first voice information.
  • the voice playback device plays the third voice information
  • the electronic device receives the mixed voice information (ie, the second voice information) of the third voice information and the target voice. Therefore, since the first detection parameter is introduced in the process of determining the target voice, the echo cancellation effect is improved, so that the determined target voice contains fewer echoes, thereby improving the recognition rate of the target voice.
  • the third voice information may be that the first voice information sent by the electronic device reaches the voice playback device through the voice attenuation between the electronic device and the voice playback device, and then is played by the voice playback device and then passes through the voice playback device and the electronic device. The attenuation reaches the voice message of the electronic device.
  • the electronic device recognizes whether the target voice is a voice instruction through the voice recognition module 124, and in the case where the target voice is a voice instruction, performs an accurate command operation according to the voice instruction, thereby Improved user experience. If the target voice is not a voice command, the electronic device does not need to perform a command operation.
  • determining the target voice according to the first detection parameter, the first voice information, and the second voice information may specifically be determining the third voice information according to the first voice information and the first detection parameter, and then according to the third voice information
  • the target voice can be determined with the second voice information, that is, the voice information that the voice player can receive after the first voice information is sent by the electronic device is determined according to the first detection parameter, and then according to the third voice
  • the information determines the target voice in the mixed voice information including the third voice information and the target voice, thereby improving the voice recognition rate.
  • the execution subject of step 302 may be the echo cancellation module 123 shown in FIG. 2, that is, the echo cancellation module 123 determines the target voice according to the first voice information, the second voice information, and the first detection parameter, thereby improving Echo cancellation effect and improved speech recognition rate.
  • the voice instruction may be a voice instruction issued by a person or a voice instruction issued by another electronic device, for example, a voice instruction issued by a person may be recorded in advance.
  • the first voice information may be audio information decoded by the audio decoder 121 shown in FIG. 2.
  • the second voice information may be received by the sound collector 122 shown in FIG. 2 and sent to the echo cancellation module 123 shown in FIG. 2.
  • the sound collector may be a microphone.
  • the first detection parameter includes at least one of a delay amount, a volume change amount, and a frequency change amount caused by signal transmission.
  • the signal change amount of the signal transmission between the electronic device and the voice playback device may be a data characteristic that changes the voice information sent from the electronic device to the voice playback device, and the data characteristic may be a delay amount, At least one of a volume change amount and a frequency change amount, so that the echo cancellation module can determine the third voice information according to the first voice information in combination with the signal change amount, and then the third voice information and the second voice information are more accurate according to the third voice information and the second voice information.
  • the target voice is determined, so that the electronic device can accurately recognize the target voice, thereby improving the voice recognition rate.
  • the embodiment of the present application may store the first detection parameter in advance.
  • a mapping relationship between device identifiers (IDs) and detection parameters of different voice playback devices may be stored in advance, and the electronic device may determine the first detection parameter according to the device ID of the voice playback device and the mapping relationship.
  • IDs device identifiers
  • the electronic device may determine the first detection parameter according to the device ID of the voice playback device and the mapping relationship.
  • the mapping relationship may be a mapping table, and the mapping table may be a storage sub-module in the echo cancellation module, or a storage module (e.g., FIG. 2) outside the echo cancellation module in the electronic device.
  • the storage module may be a flash space.
  • mapping table can be determined before leaving the factory and stored statically. For example, in the production process, the audio characteristics of various models of voice playback devices are collected, and the mapping table of the models and detection parameters of different voice playback devices is obtained by simulating the working environment.
  • the mapping table is measured first. For example, when the electronic device is in the working environment for the first time, an echo detection is first performed to obtain a mapping table between the current model of the voice playback device and the electronic device.
  • the embodiment of the present application can achieve the problem of echo cancellation in home scenes of different brands or different televisions of the same brand, thereby improving the voice recognition rate.
  • the electronic device may generate the mapping table in advance.
  • the electronic device sends fourth voice information to the voice playback device, receives fifth voice information in response to the fourth voice information, and determines the first detection parameter according to the fourth voice information and the fifth voice information. And storing the device ID of the first voice device and the first detection parameter in a mapping table.
  • the fourth voice information may be the same as the aforementioned first voice information, and accordingly, the fifth voice information is the same as the aforementioned third voice information.
  • the electronic device is a set-top box
  • the first voice device is a TV
  • the fourth voice information may be a pre-prepared audio file.
  • the set-top box is connected to the TV through HDMI.
  • Information compare the collected sound with the sound from the set-top box decoding the audio file, record the first detection parameters (for example, the delay of the sound collected by the microphone, the change in the volume of the audio volume, the change in frequency, etc.), and A detection parameter is paired with the device ID of the TV and stored in a set-top box, thereby completing echo detection.
  • the frequency of the audio file is within the frequency range covered by the electronic device.
  • mapping table of device IDs and detection parameters of multiple different voice devices may be measured in advance.
  • mapping table may also be stored on the server.
  • the electronic device may send the device ID of the voice playback device to the server, so that the server determines the first detection parameter according to the device ID of the voice playback device and the mapping table, and sends the first detection parameter to the electronic device, thereby Save the storage space of the electronic device.
  • the server stores the mapping table, which improves the compatibility of the electronic device with various voice devices.
  • mapping table in the server may be sent to the server for storage after the echo cancellation module determines it.
  • the electronic device sends fourth voice information to the voice playback device, receives fifth voice information in response to the fourth voice information, and determines a second detection parameter according to the fourth voice information and the fifth voice information, and Sending the mapping table between the device ID of the second voice device and the second detection parameter to the server.
  • the electronic device may send the mapping table of the device ID and detection parameter of each pair of voice devices to the server, or send the mapping table to the server after the measurement is completed. This application does not address this issue. Limitation.
  • the server may be a physical server or a cloud server, which is not limited in this application.
  • the electronic device or server may periodically or according to requirements update the original mapping table or add a new mapping relationship between the device ID and detection parameters of the new voice playback device, so that each voice device The corresponding detection parameters are more accurate, thereby further improving the speech recognition rate.
  • the voice processing method in the embodiment of the present application obtains a first detection parameter for indicating a signal change amount of signal transmission between the electronic device and the voice playback device, and according to the first detection parameter, the first The voice information and the second voice information obtained by mixing the target voice with the third voice information determine the target voice. Since the first detection parameter is introduced in the process of determining the target voice, the echo cancellation effect is improved, so that the determined target voice Contains fewer echoes, which improves the recognition rate of the target speech.
  • FIG. 4 shows a schematic block diagram of a voice processing apparatus according to an embodiment of the present application.
  • the apparatus 400 for voice processing may include an obtaining module 410 and a determining module 420.
  • the obtaining module 410 and the determining module 420 may be the echo cancellation module 123 shown in FIG. 2.
  • the obtaining module 410 is configured to obtain a first detection parameter, where the first detection parameter is used to indicate a signal change amount of a signal transmission between an electronic device and a voice playback device;
  • the determining module 420 is configured to determine a target voice according to the first detection parameter, the first voice information, and the second voice information.
  • the first voice information is voice information to be played sent by the electronic device to the voice playback device.
  • the second voice information is voice information obtained by mixing the target voice and third voice information, and the third voice information is voice information played by the voice playback device after receiving the voice information to be played.
  • the determining module 420 is specifically configured to:
  • the obtaining module 410 is specifically configured to:
  • the first detection parameter corresponding to the device identification ID of the voice playback device is determined by looking up a mapping table, and the mapping table includes a mapping relationship between the device ID of at least one voice playback device and at least one detection parameter.
  • mapping table is stored in a storage device of the electronic device.
  • the apparatus 400 further includes:
  • a receiving module configured to receive the mapping table sent by the server.
  • the obtaining module 410 is specifically configured to:
  • the first detection parameter includes at least one of a delay amount, a volume change amount, and a frequency change amount caused by signal transmission.
  • the apparatus 400 further includes a voice recognition module 430 for identifying the target voice, and instructing the electronic device to perform a command operation when the target voice is a voice command.
  • a voice recognition module 430 for identifying the target voice, and instructing the electronic device to perform a command operation when the target voice is a voice command.
  • voice recognition module 430 may be 124 shown in FIG. 2.
  • the apparatus 400 for voice processing in the embodiment of the present application may be a device or a chip in the device.
  • FIG. 5 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present application.
  • the apparatus 500 may include an input-output interface 510 and a processor 520.
  • the apparatus 500 may further include a memory 530.
  • the HDMI between the audio decoder 121 and the television 110 shown in FIG. 2 may be communicated with other modules through the device 500 through the input / output interface 510, and the acquisition module 410 and the determination module 420 may be implemented by the processor 520.
  • Implementation, and the speech recognition module 430 may also be implemented by the processor 520, and the storage sub-module storing the mapping table may be implemented by the memory 530.
  • the memory 530 may be used to store instruction information, and may also be used to store code, instructions, and the like executed by the processor 520.
  • the storage submodule is used to store computer execution instructions
  • the processing module 320 is connected to the storage submodule, and the processing module 320 executes the computer execution instructions stored by the storage submodule, so that the electronic device Method for performing the above-mentioned speech processing.
  • the voice processing apparatus 500 may also execute the instructions in the storage module 125 shown in FIG. 2.
  • the storage module 125 may also be implemented by the memory 530, which is not limited in this application.
  • the speech processing apparatus 400 is a chip
  • the chip includes an obtaining module 410 and a determining module 420, and the obtaining module 410 and the determining module 420 may be implemented by the processor 520.
  • the chip further includes an input / output interface, a pin or a circuit, etc., for implementing the function of the HDMI.
  • the processor 520 may execute computer execution instructions stored in the storage module 125 shown in FIG. 2.
  • the storage module 125 may also be used to store a mapping table.
  • the storage module is a storage module in the chip, such as a register, a cache, etc.
  • the storage module may also be a storage module located outside the chip in the electronic device, for example, the storage module 125 , Such as read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM read-only memory
  • RAM random access memory
  • the processor 520 may be an integrated circuit chip and has a signal processing capability. In the implementation process, each step of the foregoing method embodiment may be completed by using an integrated logic circuit of hardware in a processor or an instruction in a form of software.
  • the aforementioned processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable Programming logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA off-the-shelf programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a software module may be located in a mature storage medium such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, and the like.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the memory 530 in the embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), or Erase programmable read-only memory (EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double SDRAM double SDRAM
  • DDR SDRAM double SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • synchronous link dynamic random access memory synchronous link DRAM, SLDRAM
  • direct RAMbus RAM direct RAMbus RAM
  • An embodiment of the present application further provides a computer storage medium, and the computer storage medium may store program instructions for instructing any of the foregoing methods.
  • the storage medium may specifically be a memory 530.
  • An embodiment of the present application further provides a chip system including a processor, which is configured to support a distributed unit, a centralized unit, and a terminal device and an electronic device to implement the functions involved in the foregoing embodiments. For example, for example, Generate or process the data and / or information involved in the above methods.
  • the chip system further includes a memory, where the memory is configured to store program instructions and data necessary for the distributed unit, the centralized unit, and the terminal device and the electronic device.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or a part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the aforementioned storage media include: U disks, mobile hard disks, read-only memories (ROM), random access memories (RAM), magnetic disks or optical disks, and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

本申请提供了一种语音处理的方法和装置。该方法通过获取用于指示该电子设备与该语音播放设备之间信号传输的信号变化量的第一探测参数,并根据该第一探测参数、该第一语音信息以及由目标语音和第三语音信息混合后的第二语音信息确定目标语音,由于在确定目标语音的过程中引入了第一探测参数,提高了回声抵消效果,使得确定出的目标语音中包含更少的回声,进而提高了目标语音的识别率。

Description

语音处理的方法和装置 技术领域
本申请涉及语音处理领域,更具体地,涉及一种语音处理的方法和装置。
背景技术
随着智能音箱产品的热销,带动了大量语音交互类产品的发展。智能音箱产品的关键技术之一是回声抵消,即在语音识别的时候,智能音箱能够把从麦克风采集到的声音,例如从智能音箱输出的声音和人的语音指令,与智能音箱中的音频解码器输出的声音进行抵消,达到只保留需要识别的人的语音指令的目的,提高语音识别率。
在带音箱并且通过高清晰度多媒体接口(high definition multimediaInterface,HDMI)连接电视的智能机顶盒(set top box,STB)产品形态中,需要支持电视声音的回声抵消。但是,由于机顶盒和电视的相对位置存在一定的距离,传统方案根据智能机顶盒中的音频解码器输出的声音对麦克风采集到的声音进行回声抵消,使得语音识别率较低。
发明内容
本申请提供一种语音处理的方法和装置,能够提高语音识别率。
第一方面,提供了一种语音处理的方法,该方法包括:获取第一探测参数,该第一探测参数用于指示电子设备与语音播放设备之间信号传输的信号变化量;根据第一探测参数、第一语音信息和第二语音信息,确定目标语音,该第一语音信息为该电子设备向该语音播放设备发送的待播放的语音信息,该第二语音信息由该目标语音和第三语音信息混合后的语音信息,该第三语音信息为该语音播放设备在接收该待播放的语音信息之后所播放的语音信息。
本申请实施例通过获取用于指示该电子设备与该语音播放设备之间信号传输的信号变化量的第一探测参数,并根据该第一探测参数、该第一语音信息以及由目标语音和第三语音信息混合后的第二语音信息确定目标语音,由于在确定目标语音的过程中引入了第一探测参数,提高了回声抵消效果,使得确定出的目标语音中包含更少的回声,进而提高了目标语音的识别率。
在一些可能的实现方式中,该根据第一探测参数、第一语音信息和第二语音信息,确定目标语音包括:根据该第一探测参数和第一语音信息确定该第三语音信息;根据该第三语音信息和第二语音信息确定该目标语音。
本申请实施例可以是根据该第一语音信息和第一探测参数可以确定出第三语音信息,再根据第三语音信息和第二语音信息可以确定出目标语音,即确定出包括第三语音信息和目标语音的混合语音信息中的目标语音,从而提高语音识别率。
在一些可能的实现方式中,在确定该目标语音之前,该方法还包括:通过查找映射表确定与该语音播放设备的设备标识ID对应的该第一探测参数,该映射表包括至少一个语 音播放设备的设备ID和至少一个探测参数的映射关系。
这样对于不同的语音播放设备能够提供不同的探测参数,进而更准确的进行抵消处理,更进一步提高语音识别率。
在一些可能的实现方式中,该映射表存储在电子设备的存储设备中。
本申请实施例可以将在启动工作时或者在出厂之前,预先测量的不同的语音播放设备与探测参数的映射表存储在电子设备中,从而有助于快捷的根据语音播放设备获知对应的探测参数,提高语音识别效率。
在一些可能的实现方式中,接收服务器发送的该映射表。
本申请实施例可以将在启动工作时或者在出厂之前,预先测量的不同的语音播放设备与探测参数的映射表发送到服务器,并存储在服务器的存储设备中,在电子设备需要确定目标语音的情况下,将该映射表发送给电子设备,从而节省电子设备的存储空间,以及提高了电子设备兼容各种语音设备的兼容性。
在一些可能的实现方式中,该获取第一探测参数包括:将该语音播放设备的设备标识ID发送给服务器,以使得该服务器根据该语音播放设备标识ID确定该第一探测参数;接收该服务器发送的该第一探测参数。
该电子设备可以向服务器发送语音播放设备的设备ID,使得服务器根据该语音播放设备的设备ID和该映射表确定第一探测参数,并向该电子设备发送该第一探测参数,从而在节省电子设备的存储空间的同时,提高了语音识别效率。
在一些可能的实现方式中,在存储该映射表之前,该方法还包括:生成该映射表。
在一些可能的实现方式中,该生成该映射表包括:根据第四语音信息和第五语音信息,确定该第一探测参数,该第四语音信息为该电子设备向语音播放设备发送的语音信息,该第五语音信息为该电子设备接收的响应该第四语音信息的语音信息;将该语音播放设备的设备ID与该第一探测参数的映射关系存储到该映射表。
该电子设备向语音播放设备发送第四语音信息,并接收响应该第四语音信息的第五语音信息,再根据该第四语音信息和该第五语音信息确定该第一探测参数,并将该第一语音设备的设备ID和该第一探测参数的映射关系存储到映射表中,从而有助于提高语音识别率。
在一些可能的实现方式中,所述第一探测参数包括信号传输造成的延时量、音量变化量和频率变化量中的至少一项。
这样电子设备根据第一语音信息结合该语音信息的延时量、音量变化量和频率变化量中的至少一项可以确定出第三语音信息,进而根据该第三语音信息和第二语音信息更加准确的确定出目标语音,这样电子设备可以准确地确定出该目标语音,提高了目标语音的识别率。
在一些可能的实现方式中,在确定该目标语音之后,该方法还包括:识别该目标语音;在该目标语音为语音指令时,指示该电子设备执行指令操作。
电子设备在确定目标语音之后,可以识别该目标语音是否为语音指令,在该目标语音为语音指令的情况下,根据该语音指令,执行准确的指令操作,从而提高了用户体验。
第二方面,提供了一种语音处理的方法,该方法包括:
接收电子设备发送的语音播放设备的设备标识ID;
根据所述语音播放设备的设备ID和映射表,确定所述语音播放设备对应的第一探测参数,该映射表包括至少一个语音播放设备的设备ID和至少一个探测参数的映射关系;
发送该第一探测参数至该电子设备。
服务器可以存储至少一个语音播放设备和至少一个探测参数的映射表,在接收到电子设备发送的语音播放设备的设备ID后,可以根据该语音播放设备的设备ID和该映射表确定第一探测参数,并向该电子设备发送该第一探测参数,从而节省了电子设备的存储空间。
在一些可能的实现方式中,在该电子设备向该语音播放设备发送该第一探测参数之前,该方法还包括:接收该语音播放设备的设备ID和该第一探测参数的映射表。
服务器存储的映射关系可以是电子设备确定后发送给该服务器的,这样可以节省电子设备的存储空间。此外,服务器存储的映射关系可以是电子设备在确定一对播放设备的设备ID和探测参数的映射关系之后就发送给服务器。
在一些可能的实现方式中,在该电子设备向该语音播放设备发送该第一探测参数之前,该方法还包括:
接收映射表,该映射表包括至少一个语音播放设备的设备ID和至少一个探测参数的映射关系。
服务器存储的映射关系可以电子设备在确定多对播放设备的设备ID和探测参数的映射关系之后,再发送给服务器,节省信令开销。
第三方面,提供了一种语音处理的装置,该装置可以是设备,也可以是设备内的芯片。该装置具有实现上述第一方面的各实施例的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的单元。
在一种可能的设计中,当该装置为设备时,该设备包括:处理模块和获取模块,所述处理模块和所述获取模块可以通过处理器实现,可选地,所述设备还包括存储子模块,该存储子模块例如可以是存储器。当电子设备包括存储子模块时,该存储子模块用于存储计算机执行指令,该处理模块与该存储子模块连接,该处理模块执行该存储子模块存储的计算机执行指令,以使该设备执行上述第一方面任意一项的语音处理的方法。
在另一种可能的设计中,当该装置为芯片时,该芯片包括:处理模块和获取模块,所述处理模块和所述获取模块可以通过处理器实现,所述芯片还可以包括输入/输出接口、管脚或电路等。该处理器可执行存储模块存储的计算机执行指令,以使该终端内的芯片执行上述第一方面任意一项的语音处理的方法。可选地,所述存储模块为所述芯片内的存储子模块,如寄存器、缓存等,所述存储模块还可以是电子设备内的位于所述芯片外部的存储模块,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
其中,上述任一处提到的处理器,可以是一个通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制上述第一方面语音处理的方法的程序执行的集成电路。
第四方面,提供了一种语音处理的装置,其特征在于,包括:
存储器,用于存储映射表,该映射表包括至少一个语音播放设备的设备ID和至少一个探测参数的映射关系;
处理器,用于:
根据该存储器中存储的映射表获取与语音播放设备的设备标识ID对应的第一探测参数,该第一探测参数用于指示语音处理装置与该语音播放设备之间信号传输的信号变化量;
根据第一探测参数、第一语音信息和第二语音信息,确定目标语音,该第一语音信息为该电子设备向该语音播放设备发送的待播放的语音信息,该第二语音信息由该目标语音和第三语音信息混合后的语音信息,该第三语音信息为该语音播放设备在接收该待播放的语音信息之后所播放的语音信息。
在一些可能的实现方式中,该处理器用于:
根据该第一探测参数和该第一语音信息确定该第三语音信息;
根据该第三语音信息和该第二语音信息确定该目标语音。
在一些可能的实现方式中,该第一探测参数包括信号传输造成的延时量、音量变化量和频率变化量中的至少一项。
第五方面,提供了一种电视机顶盒,该电视机顶盒包括音频解码器、声音采集器和上述第三方面所述的语音处理的装置,该音频解码器用于解码接收的音频信息以获取所述第一语音信息,该语音采集器,用于采集所述第二语音信息。
第六方面,提供了一种计算机存储介质,该计算机存储介质中存储有程序代码,该程序代码用于指示执行上述第一方面或第二方面中或其任意可能的实现方式中的方法的指令。
第七方面,提供了一种包含指令的计算机程序产品,其在计算机上运行时,使得计算机执行上述第一方面或第二方面或其任意可能的实现方式中的方法。
第八方面,提供了一种语音处理系统,该通信系统包括上述第三方面的装置和语音播放设备。
第九方面,提供了一种处理器,用于与存储器耦合,用于执行上述第一方面或第二方面或其任意可能的实现方式中的方法。
基于上述方案,电子设备通过获取用于指示该电子设备与该语音播放设备之间信号传输的信号变化量的第一探测参数,并根据该第一探测参数、该第一语音信息以及由目标语音和第三语音信息混合后的第二语音信息确定目标语音,由于在确定目标语音的过程中引入了第一探测参数,提高了回声抵消效果,使得确定出的目标语音中包含更少的回声,进而提高了目标语音的识别率。
附图说明
图1是本申请提供的一种电视播放系统的示意图;
图2是本申请提供的一种电视播放系统的结构图;
图3是本申请实施例提供的语音处理的流程示意图;
图4是本申请实施例提供的语音处理装置的示意性结构图;
图5是本申请另一个实施例提供的语音处理装置的示意性结构图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例中的语音播放设备可以指电视、手机、平板电脑、液晶显示器等具有语音播放功能的设备,电子设备可以指机顶盒、音响、车载设备、具有无线通信功能的手持设备等处理设备,本申请实施例对此并不限定。
本申请可以应用于家庭电视播放场景,也可以应用于汽车语音识别场景,或者还可以应用于存在发声环境复杂,并具备可记录追踪的标识的场景下,本申请对此不进行限定。
图1为本申请提供的一种电视播放系统,该电视播放系统包括电视110和机顶盒120,电视110和机顶盒120之间可以通过例如高清晰度多媒体接口(high definition multimedia interface,HDMI)等接口进行连接。在该电视播放系统中,电视110作为语音播放设备,接收机顶盒120发送的语音信息并进行语音播放,机顶盒120作为电子设备,可以采集用户130发出的目标语音,识别该目标语音中的语音指令并执行相关指令操作。
可选地,该目标语音也可以是设备发出的。
针对上述电视播放系统,图2提供了上述电视播放系统中机顶盒的结构,该机顶盒120包括音频解码器121、声音采集器122、回声抵消模块123和语音识别模块124。
音频解码器121对接收的音频信息进行解码以生成第一语音信息,并且将第一语音信息通过例如HDMI传输给电视110,并同时将第一语音信息传输给回声抵消模块123;
电视110将音频解码器121传输的第一音频信息进行语音播放,比如通过喇叭进行语音播放;
声音采集器122采集到第二语音信息,该第二语音信息是一个混合后的声音,该混合后的声音可以包括电视110播放的第三语音信息和目标语音,该目标语音可以是人发出的语音信息,比如用于指示机顶盒120执行相关操作的语音指令,需要说明的是,该声音采集器可以是麦克风。
回声抵消模块123获取第一探测参数,该第一探测参数用于指示机顶盒120与电视110之间信号传输的信号变化量,回声抵消模块123根据第一探测参数、第一语音信息、以及声音采集器122采集到的第二语音信息进行回声抵消处理,即将解码的音频信息和声音采集器采集到的混合声音进行回声抵消处理以确定出目标语音,并且将确定的目标语音提供给语音识别模块124;
在一种实现方式中,图2提供了一种回声抵消模块123获取第一探测参数的实现方式,机顶盒120中存储模块125中预先存储不同电视的设备标识(identity,ID)和探测参数的映射表,机顶盒120可以根据电视的设备ID和该映射表确定该第一探测参数,这样对于ID不同的电视能够提供不同的探测参数。
需要说明的是,由于不同品牌不同型号的电视的声学结构不同,以及机顶盒和电视的相对位置在不同的家庭中也不同,因此通过机顶盒中的麦克风采集到的电视播放的第三语音信息与机顶盒中的音频解码器通过HDMI送给电视的第一语音信息的差异并不相同。本申请根据适配于当前使用的电视的探测参数和第一语音信息对麦克风采集到的声音进行回声抵消处理,可以提高回声抵消。
语音识别模块124对目标语音进行识别,在识别出该目标语音为语音指令时指示机顶盒120进行相关的指令操作。
在该机顶盒中,由于回声抵消模块123获取了第一探测参数,并且根据该第一探测参数、第一语音信息和第二语音信息确定出目标语音,由于在确定该目标语音的过程中引入 了第一探测参数,提高了回声抵消,使得确定出的目标语音中包含更少的回声,因此,语音识别模块124能够更准确的识别出该目标语音,提高了语音识别率。
结合上述电视播放系统,图3示出了本申请实施例的语音处理的示意性流程图。
301,获取第一探测参数,该第一探测参数用于指示电子设备与语音播放设备之间信号传输的信号变化量。
具体地,第一探测参数用于指示该电子设备和语音播放设备之间进行信号传输时的信号变化量,例如,由于电子设备和语音播放设备之间进行信号传输发生的信号损耗、信号干扰等导致的信号变化量。
需要说明的是,该语音播放设备可以是具有播放语音功能的设备,还可以具有播放语音和影像功能的设备(即播放视频的设备)。该电子设备可以是电视机顶盒。
可选地,步骤301的执行主体可以是图2所示的回声抵消模块123。
可选地,本申请实施例的电子设备可以是机顶盒,或者数字化视频光盘(digital video disc/disk,DVD)播放机,或者还可以是其他能够向语音播放设备发送音频信息,使得语音播放设备能够对该音频信息进行播放的设备,本申请对此不进行限定。
可选地,该语音播放设备可以是仅能播放语音的设备,或者既能播放语音又能播放影像的显示器(例如,电视机)。
302,根据第一探测参数、第一语音信息和第二语音信息,确定目标语音,该第一语音信息为该电子设备向该语音播放设备发送的待播放的语音信息,该第二语音信息包括目标语音和第三语音信息,该第三语音信息为该语音播放设备在接收该待播放的语音信息之后所播放的语音信息。
具体地,电子设备和语音播放设备之间可以通过例如HDMI接口的方式进行通信,由于电子设备和语音播放设备之间的距离使得第一语音信息在通信过程中会受到影响,例如,第一语音信息可能会发生信号损耗或第一语音信息会受到其他信号的干扰,这样语音播放设备接收到的是受到影响后的第一语音信息。该电子设备向语音播放设备发送第一语音信息之后,可以接收到语音播放设备用于响应该第一语音信息的第三语音信息。其中,语音播放设备在播放第三语音信息时,若存在目标语音,则电子设备接收到的是第三语音信息和目标语音的混合语音信息(即第二语音信息)。因此,由于在确定目标语音的过程中引入了第一探测参数,提高了回声抵消效果,使得确定出的目标语音中包含更少的回声,进而提高了目标语音的识别率。
应理解,该第三语音信息可以是电子设备发送的第一语音信息经过电子设备和语音播放设备之间的语音衰减到达语音播放设备,再由语音播放设备播放之后,经过语音播放设备和电子设备之间的衰减到达该电子设备的语音信息。
可选地,电子设备在确定该目标语音之后,通过语音识别模块124识别该目标语音是否为语音指令,在该目标语音为语音指令的情况下,根据该语音指令,执行准确的指令操作,从而提高了用户体验。若该目标语音不是语音指令,则电子设备不需要进行指令操作。
可选地,根据第一探测参数、第一语音信息和第二语音信息确定目标语音具体可以是根据该第一语音信息和第一探测参数可以确定出第三语音信息,再根据第三语音信息和第二语音信息可以确定出目标语音,即先根据第一探测参数确定出电子设备发出第一语音 信息后语音播放器能够接收到的语音信息(即第三语音信息),再根据第三语音信息确定出包括第三语音信息和目标语音的混合语音信息中的目标语音,从而提高语音识别率。
可选地,步骤302的执行主体可以是图2所示的回声抵消模块123,即该回声抵消模块123根据该第一语音信息、第二语音信息和第一探测参数确定目标语音,从而提高了回声抵消效果,以及提高了语音识别率。
可选地,该语音指令可以是人发出的语音指令,也可以是其他电子设备发出的语音指令,例如,可以提前录好人发出的语音指令。
可选地,该第一语音信息可以是由图2所示的音频解码器121解码后的音频信息。
可选地,该第二语音信息可以是通过图2所示的声音采集器122接收的,并发送给图2所示的回声抵消模块123。
应理解,该声音采集器可以是麦克风。
可选地,该第一探测参数包括信号传输造成的延迟量、音量变化量和频率变化量中的至少一项。
具体地,该电子设备和语音播放设备之间的信号传输的信号变化量可以是语音信息从该电子设备发送到该语音播放设备语音信息的发生变化的数据特征,该数据特征可以是延迟量、音量变化量和频率变化量中的至少一项,这样回声抵消模块根据第一语音信息结合该信号变化量可以确定出第三语音信息,进而根据该第三语音信息和第二语音信息更加准确的确定出目标语音,这样使得电子设备可以准确地识别出该目标语音,提高了语音识别率。
可选地,本申请实施例可以预先存储该第一探测参数。
可选地,本申请实施例可以预先存储不同语音播放设备的设备标识(identity,ID)和探测参数的映射关系,电子设备可以根据语音播放设备的设备ID和该映射关系确定该第一探测参数,这样对于不同的语音播放设备能够提供不同的探测参数,进而更准确的进行抵消处理,更进一步提高语音识别率。
具体地,该映射关系具体可以是映射表,该映射表可以是在回声抵消模块中的存储子模块中,也可以是在电子设备中,且在回声抵消模块之外的存储模块(例如图2所示的存储模块125)中,本申请对此不进行限定。例如,该存储模块可以是闪(flash)空间。
需要说明的是,该映射表可以在出厂之前就确定好,并静态存储。例如,在生产过程中对各种语音播放设备的型号音频特征进行收集,以及模拟工作环境得到不同语音播放设备的型号与探测参数的映射表。
或者,该电子设备在启动工作时先测量得到该映射表。例如,在电子设备第一次处于工作环境中时,首先进行一次回声探测得到当前语音播放设备的型号与该电子设备的映射表。
例如,该语音播放设备为电视,则本申请实施例能够实现不同品牌,或者同一品牌的不同电视在家庭场景中的回声抵消问题,提高了语音识别率。
可选地,电子设备可以预先生成该映射表。
具体地,该电子设备向该语音播放设备发送第四语音信息,并接收响应该第四语音信息的第五语音信息,再根据该第四语音信息和该第五语音信息确定该第一探测参数,并将该第一语音设备的设备ID和该第一探测参数存储到映射表中。
应理解,该第四语音信息可以与前述第一语音信息相同,相应地,该第五语音信息与前述第三语音信息相同。
例如,该电子设备为机顶盒,该第一语音设备为电视,该第四语音信息可以是预先准备的音频文件,机顶盒通过HDMI接入到该电视,机顶盒采集包括电视播放的声音和人发出的指令信息,将采集到的声音与机顶盒解码音频文件得到的声音进行对比,记录第一探测参数(例如,麦克风采集到的声音的延迟、音频音量的大小变化,频率的变化等),并将该第一探测参数与该电视的设备ID配对存储在机顶盒中,从而完成回声探测。
应理解,该音频文件的频率在该电子设备覆盖的频率范围之内。
需要说明的是,本申请实施例可以预先测量多种不同的语音设备的设备ID与探测参数的映射表。
可选地,该映射表也可以存储在服务器。
具体地,该电子设备可以向服务器发送语音播放设备的设备ID,使得服务器根据该语音播放设备的设备ID和该映射表确定第一探测参数,并向该电子设备发送该第一探测参数,从而节省电子设备的存储空间,此外,服务器存储该映射表,提高了电子设备兼容各种语音设备的兼容性。
可选地,服务器中映射表可以是回声抵消模块确定后发送到服务器存储的。
具体地,该电子设备向语音播放设备发送第四语音信息,并接收响应该第四语音信息的第五语音信息,再根据该第四语音信息和该第五语音信息确定第二探测参数,并将该第二语音设备的设备ID和该第二探测参数的映射表发送该服务器。
需要说明的是,该电子设备可以是每测量一对语音设备的设备ID和探测参数的映射表就发送给服务器,也可以是测量完成多对映射表后统一发送给服务器,本申请对此不进行限定。
可选地,该服务器可以是实体服务器,也可以是云服务器,本申请对此不进行限定。
可选地,电子设备或服务器可以周期性,或者根据需求,更新原有的映射表或者向原有的映射表中添加新的语音播放设备的设备ID和探测参数的对应关系,使得每个语音设备对应的探测参数更加准确,从而更进一步提高语音识别率。
因此,本申请实施例的语音处理的方法,通过获取用于指示该电子设备与该语音播放设备之间信号传输的信号变化量的第一探测参数,并根据该第一探测参数、该第一语音信息以及由目标语音和第三语音信息混合后的第二语音信息确定目标语音,由于在确定目标语音的过程中引入了第一探测参数,提高了回声抵消效果,使得确定出的目标语音中包含更少的回声,进而提高了目标语音的识别率。
图4示出了本申请实施例的语音处理的装置的示意性框图。
应理解,该语音处理的装置400可以包括获取模块410和确定模块420。
需要说明的是,该获取模块410和确定模块420可以是图2所示的回声抵消模块123。
获取模块410,用于获取第一探测参数,该第一探测参数用于指示电子设备与语音播放设备之间信号传输的信号变化量;
确定模块420,用于根据第一探测参数、第一语音信息和第二语音信息,确定目标语音,该第一语音信息为该电子设备向该语音播放设备发送的待播放的语音信息,该第二语音信息由该目标语音和第三语音信息混合后的语音信息,该第三语音信息为该语音播放 设备在接收该待播放的语音信息之后所播放的语音信息。
可选地,该确定模块420具体用于:
根据所述第一探测参数和第一语音信息确定所述第三语音信息;
根据所述第三语音信息和第二语音信息确定所述目标语音。
可选地,该获取模块410具体用于:
通过查找映射表确定与该语音播放设备的设备标识ID对应的该第一探测参数,该映射表包括至少一个语音播放设备的设备ID和至少一个探测参数的映射关系。
可选地,该映射表存储在该电子设备的存储设备中。
可选地,该装置400还包括:
接收模块,用于接收服务器发送的该映射表。
可选地,该获取模块410具体用于:
将该语音播放设备的设备标识ID发送给服务器,以使得该服务器根据该语音播放设备标识ID确定该第一探测参数;
接收该服务器发送的该第一探测参数。
可选地,该第一探测参数包括信号传输造成的延时量、音量变化量和频率变化量中的至少一项。
可选地,该装置400还包括:语音识别模块430,用于识别该目标语音,并在该目标语音为语音指令时,指示该电子设备执行指令操作。
应理解,该语音识别模块430可以是图2所示的124。
可选地,本申请实施例的语音处理的装置400可以是设备,也可以是设备内的芯片。
应理解,根据本申请实施例的语音处理的装置400中的各个模块的上述和其它管理操作和/或功能分别为了实现前述各个方法的相应步骤,为了简洁,在此不再赘述。
可选地,若该语音处理的装置400为设备,则本申请实施例中的获取模块410和确定模块420可以由处理器520实现。图5示出了本申请实施例的语音处理的装置的结构性示意图。如图5所示,装置500可以包括输入输出接口510和处理器520。可选地,装置500还可以包括存储器530。本申请实施例中图2所示的音频解码器121和电视110之间的HDMI可以通过装置500与其他模块的通信可以通过输入输出接口510实现,获取模块410和确定模块420可以由处理器520实现,且语音识别模块430也可以由处理器520实现,存储映射表的存储子模块可以由存储器530实现。其中,存储器530可以用于存储指示信息,还可以用于存储处理器520执行的代码、指令等。
当设备包括存储子模块时,该存储子模块用于存储计算机执行指令,该处理模块320与该存储子模块连接,该处理模块320执行该存储子模块存储的计算机执行指令,以使该电子设备执行上述语音处理的方法。
可选地,语音处理装置500也可以执行图2所示的存储模块125中的指令。存储模块125也可以由存储器530实现,本申请对此不进行限定。
可选地,若该语音处理的装置400为芯片,则该芯片包括获取模块410和确定模块420,该获取模块410和确定模块420可以由处理器520实现。可选地,该芯片还包括输入/输出接口、管脚或电路等,用于实现HDMI的功能。处理器520可执行图2所示的存储模块125存储的计算机执行指令。该存储模块125还可以用于存储映射表。
可选地,所述存储模块为所述芯片内的存储模块,如寄存器、缓存等,所述存储模块还可以是所述电子设备内的位于所述芯片外部的存储模块,例如,存储模块125,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
应理解,处理器520可以是集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字语音处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器530可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchronous link DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本申请实施例还提供一种计算机存储介质,该计算机存储介质可以存储用于指示上述任一种方法的程序指令。
可选地,该存储介质具体可以为存储器530。
本申请实施例还提供了一种芯片系统,该芯片系统包括处理器,用于支持分布式单元、集中式单元以及、终端设备和电子设备以实现上述实施例中所涉及的功能,例如,例如生成或处理上述方法中所涉及的数据和/或信息。
在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存分布式单元、集中式单元以及终端设备和电子设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可 以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (21)

  1. 一种语音处理的方法,其特征在于,包括:
    获取第一探测参数,所述第一探测参数用于指示电子设备与语音播放设备之间信号传输的信号变化量;
    根据第一探测参数、第一语音信息和第二语音信息,确定目标语音,所述第一语音信息为所述电子设备向所述语音播放设备发送的待播放的语音信息,所述第二语音信息由所述目标语音和第三语音信息混合后的语音信息,所述第三语音信息为所述语音播放设备在接收所述待播放的语音信息之后所播放的语音信息。
  2. 根据权利要求1所述的方法,其特征在于,所述根据第一探测参数、第一语音信息和第二语音信息,确定目标语音包括:
    根据所述第一探测参数和所述第一语音信息确定所述第三语音信息;
    根据所述第三语音信息和所述第二语音信息确定所述目标语音。
  3. 根据权利要求1或2所述的方法,其特征在于,所述获取第一探测参数包括:
    通过查找映射表确定与所述语音播放设备的设备标识ID对应的所述第一探测参数,所述映射表包括至少一个语音播放设备的设备ID和至少一个探测参数的映射关系。
  4. 根据权利要求3所述的方法,其特征在于,所述映射表存储在所述电子设备的存储设备中。
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    接收服务器发送的所述映射表。
  6. 根据权利要求1或2所述的方法,其特征在于,所述获取第一探测参数包括:
    将所述语音播放设备的设备标识ID发送给服务器,以使得所述服务器根据所述语音播放设备标识ID确定所述第一探测参数;
    接收所述服务器发送的所述第一探测参数。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,在确定所述目标语音之后,所述方法还包括:
    识别所述目标语音;
    在所述目标语音为语音指令时,指示所述电子设备执行指令操作。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述第一探测参数包括信号传输造成的延时量、音量变化量和频率变化量中的至少一项。
  9. 一种语音处理的装置,其特征在于,包括:
    获取模块,用于获取第一探测参数,所述第一探测参数用于指示电子设备与语音播放设备之间信号传输的信号变化量;
    确定模块,用于根据第一探测参数、第一语音信息和第二语音信息,确定目标语音,所述第一语音信息为所述电子设备向所述语音播放设备发送的待播放的语音信息,所述第二语音信息由所述目标语音和第三语音信息混合后的语音信息,所述第三语音信息为所述语音播放设备在接收所述待播放的语音信息之后所播放的语音信息。
  10. 根据权利要求9所述的装置,其特征在于,所述确定模块具体用于:
    根据所述第一探测参数和所述第一语音信息确定所述第三语音信息;
    根据所述第三语音信息和所述第二语音信息确定所述目标语音。
  11. 根据权利要求9或10所述的装置,其特征在于,所述获取模块具体用于:
    通过查找映射表确定与所述语音播放设备的设备标识ID对应的所述第一探测参数,所述映射表包括至少一个语音播放设备的设备ID和至少一个探测参数的映射关系。
  12. 根据权利要求11所述的装置,其特征在于,所述映射表存储在所述电子设备的存储设备中。
  13. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    接收模块,用于接收服务器发送的所述映射表。
  14. 根据权利要求9或10所述的装置,其特征在于,所述获取模块具体用于:
    将所述语音播放设备的设备标识ID发送给服务器,以使得所述服务器根据所述语音播放设备标识ID确定所述第一探测参数;
    接收所述服务器发送的所述第一探测参数。
  15. 根据权利要求9至14所述的装置,其特征在于,所述装置还包括语音识别模块,所述语音识别模块用于:
    识别该目标语音;
    在该目标语音为语音指令时,指示该电子设备执行指令操作。
  16. 根据权利要求9至15中任一项所述的装置,其特征在于,所述第一探测参数包括信号传输造成的延时量、音量变化量和频率变化量中的至少一项。
  17. 一种语音处理的装置,其特征在于,包括存储器和处理器,所述存储器用于存储计算机程序,所述处理器用于从所述存储器中调用并运行所述计算机程序,使得所述语音处理的装置执行如权利要求1-8中任一项所述的方法。
  18. 一种电子设备,其特征在于,包括音频解码器、声音采集器和如权利要求9至17中任一项所述的语音处理装置,
    所述音频解码器,用于解码音频数据以获取所述第一语音信息,并通过输入输出接口将所述第一语音信息传输至语音播放设备;
    所述声音采集器,用于采集所述第二语音信息。
  19. 一种语音处理系统,其特征在于,包括权利要求9至18中任一项所述的装置和语音播放设备。
  20. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至8中任一项所述的方法。
  21. 一种计算机程序产品,当其在计算机上运行时,使得计算机执行如权利要求1至8中任一项所述的方法。
PCT/CN2018/094464 2018-07-04 2018-07-04 语音处理的方法和装置 WO2020006699A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/094464 WO2020006699A1 (zh) 2018-07-04 2018-07-04 语音处理的方法和装置
CN201880095355.XA CN112400205A (zh) 2018-07-04 2018-07-04 语音处理的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/094464 WO2020006699A1 (zh) 2018-07-04 2018-07-04 语音处理的方法和装置

Publications (1)

Publication Number Publication Date
WO2020006699A1 true WO2020006699A1 (zh) 2020-01-09

Family

ID=69059645

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094464 WO2020006699A1 (zh) 2018-07-04 2018-07-04 语音处理的方法和装置

Country Status (2)

Country Link
CN (1) CN112400205A (zh)
WO (1) WO2020006699A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113031904B (zh) * 2021-03-25 2023-10-24 联想(北京)有限公司 一种控制方法及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105516859A (zh) * 2015-11-27 2016-04-20 深圳Tcl数字技术有限公司 消除回声的方法和系统
CN105825862A (zh) * 2015-01-05 2016-08-03 沈阳新松机器人自动化股份有限公司 一种机器人人机对话回声消除系统
US20170263247A1 (en) * 2014-08-21 2017-09-14 Lg Electronics Inc. Digital device and method for controlling same
CN107613428A (zh) * 2017-09-15 2018-01-19 北京地平线信息技术有限公司 声音处理方法、装置和电子设备
CN207354519U (zh) * 2017-10-20 2018-05-11 深圳暴风统帅科技有限公司 一种远场语音控制机顶盒

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5695447B2 (ja) * 2011-03-01 2015-04-08 株式会社東芝 テレビジョン装置及び遠隔操作装置
CN106782598A (zh) * 2016-12-15 2017-05-31 深圳Tcl数字技术有限公司 电视画面和外设声音同步控制方法和装置
CN107452395B (zh) * 2017-08-23 2021-06-18 深圳创维-Rgb电子有限公司 一种语音信号回声消除装置及电视机
CN107566874A (zh) * 2017-09-22 2018-01-09 百度在线网络技术(北京)有限公司 基于电视设备的远场语音控制系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170263247A1 (en) * 2014-08-21 2017-09-14 Lg Electronics Inc. Digital device and method for controlling same
CN105825862A (zh) * 2015-01-05 2016-08-03 沈阳新松机器人自动化股份有限公司 一种机器人人机对话回声消除系统
CN105516859A (zh) * 2015-11-27 2016-04-20 深圳Tcl数字技术有限公司 消除回声的方法和系统
CN107613428A (zh) * 2017-09-15 2018-01-19 北京地平线信息技术有限公司 声音处理方法、装置和电子设备
CN207354519U (zh) * 2017-10-20 2018-05-11 深圳暴风统帅科技有限公司 一种远场语音控制机顶盒

Also Published As

Publication number Publication date
CN112400205A (zh) 2021-02-23

Similar Documents

Publication Publication Date Title
JP6700344B2 (ja) 情報交換方法、装置、オーディオ端末、コンピュータ可読記憶媒体及びプログラム
US11282520B2 (en) Method, apparatus and device for interaction of intelligent voice devices, and storage medium
TWI637630B (zh) 一種在多媒體裝置之間用於促進音訊及視訊同步化之裝置
US20160065791A1 (en) Sound image play method and apparatus
US10148900B2 (en) Controlling one or more source terminals based on remote control information
TW201132122A (en) System and method in a television for providing user-selection of objects in a television program
CN108124172B (zh) 云投影的方法、装置及系统
EP3147730B1 (en) Sound box parameter configuration method, mobile terminal, server, and system
US11057664B1 (en) Learning multi-device controller with personalized voice control
CN104834623A (zh) 音频播放方法及装置
US9288421B2 (en) Method for controlling external input and broadcast receiving apparatus
CN104918069A (zh) 一种播放场景还原方法、系统、播放终端及控制终端
CN110007893A (zh) 一种音频输出的方法及电子设备
WO2016202242A1 (zh) Arc控制方法、控制单元和显示设备
KR20180110033A (ko) 멀티미디어 정보 재생 방법 및 시스템, 표준화 서버, 생방송 단말기
WO2017166685A1 (zh) 外接扬声器切换方法及装置
KR20110037680A (ko) 포터블 디바이스의 멀티 채널 오디오 출력 장치 및 방법
WO2020006699A1 (zh) 语音处理的方法和装置
JP2019514050A5 (zh)
US20180152497A1 (en) Method and multi-media device for video communication
CN111653284B (zh) 交互以及识别方法、装置、终端设备及计算机存储介质
US10375340B1 (en) Personalizing the learning home multi-device controller
CN111034206A (zh) 显示装置及其提供内容的方法
CN112055238B (zh) 视频播放的控制方法、设备及系统
WO2020114369A1 (zh) 无线通信延迟测试方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18925689

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18925689

Country of ref document: EP

Kind code of ref document: A1