WO2020024508A1

WO2020024508A1 - Voice information obtaining method and apparatus

Info

Publication number: WO2020024508A1
Application number: PCT/CN2018/120368
Authority: WO
Inventors: 廖湖锋; 王子; 刘健军
Original assignee: 珠海格力电器股份有限公司
Priority date: 2018-08-01
Filing date: 2018-12-11
Publication date: 2020-02-06
Also published as: CN110797048A; CN110797048B

Abstract

A voice information obtaining method and apparatus. The method comprises: a device acquires first voice information in an environment where the device is located (S202); the device determines a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to second voice information, wherein the second voice information is voice played by the device itself (S204); and determine third voice information in the first voice information according to the similarity between the first sound frequency and the second sound frequency, and delete the third voice information from the first voice information to obtain target voice information (S206). The present invention solves the problem in the prior art of being difficult to distinguish sound played by a device itself and voice information acquired by the device, and enables said sound and said voice to be accurately separated from each other according to the sound frequencies, so that the device can accurately obtain voice information of the user, thereby implementing voice interaction with the device.

Description

Method and device for acquiring voice information

Technical field

This application relates to, but is not limited to, the field of electrical appliances, and in particular, to a method and device for acquiring voice information.

Background technique

In related technologies, online voice devices have occupied a considerable proportion in the market, and will continue to increase. Generally, online voice devices support voice interaction and additional functions, such as singing and broadcasting the weather. Communicating with them is affected by the pronunciation of the voice device itself.

There is no effective solution to the problem that the sound broadcast by the device itself and the voice information collected by the device are difficult to distinguish in the related art.

Summary of the invention

The embodiments of the present application provide a method and an apparatus for acquiring voice information, so as to at least solve the problem that it is difficult to distinguish between the sound broadcast by the device itself and the voice information collected by the device in the related art.

According to an embodiment of the present application, a method for acquiring voice information is provided, including: a device collects first voice information in an environment where the device is located; and the device determines a first sound corresponding to the first voice information. Frequency, a second sound frequency corresponding to the second voice information, wherein the second voice information is a voice played by the device itself; and is determined according to the similarity between the first voice frequency and the second voice frequency The third voice information in the first voice information is deleted from the first voice information to obtain the target voice information.

According to another embodiment of the present application document, a method for acquiring voice information is also provided, which includes: the first device collects first voice information in an environment in which it is located, and acquires all voice playback devices in the current environment from a network side The currently played second voice information, wherein the environment includes the plurality of voice playback devices; the first device determines a first sound frequency corresponding to the first voice information, and a first sound frequency corresponding to the second voice information Two sound frequencies; determining the third sound information in the first sound information according to the similarity between the first sound frequency and the second sound frequency, and deleting the third sound information from the first sound information To get the target voice information.

According to another embodiment of the application document, a method for acquiring voice information is also provided, which includes: the device collects first voice information in an environment in which the device is located; and the device determines that the first voice information corresponds to The first feature information and the second feature information corresponding to the second voice information, wherein the second voice information is a voice played by the device itself; and the determined feature is determined based on the similarity between the feature information and the second feature information. The third voice information in the first voice information is described, and the third voice information is deleted from the first voice information to obtain the target voice information.

According to another embodiment of the application document, a device for acquiring voice information is further provided, including: a first acquisition module configured to acquire first voice information in an environment where the device is located; a first determination module configured to set To determine a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, wherein the second voice information is a voice played by the device itself; a second determining module is configured to: To determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information to obtain a target voice message.

According to another embodiment of the application document, a device for acquiring voice information is further provided, including: a second acquisition module configured to acquire first voice information in an environment in which the device is located, and acquiring current information from a network side Second voice information currently played by all voice playback devices in the environment, wherein the environment includes the plurality of voice playback devices; a third determining module is configured to determine a first sound frequency corresponding to the first voice information, A second voice frequency corresponding to the second voice information; a fourth determination module configured to determine a third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency And deleting the third voice information from the first voice information to obtain the target voice information.

According to another embodiment of the present application document, a voice information acquisition device is further provided, including: a third acquisition module configured to acquire first voice information in an environment where the device is located; and a fifth determination module configured to set To determine the first feature information corresponding to the first voice information and the second feature information corresponding to the second voice information, wherein the second voice information is a voice played by the device itself; a sixth determining module, sets To determine the third voice information in the first voice information according to the similarity between the feature information and the second feature information, and delete the third voice information from the first voice information to obtain the target voice information.

According to yet another embodiment of the present application, a storage medium is also provided. The storage medium stores a computer program, and the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.

According to another embodiment of the present application, an electronic device is further provided, which includes a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to execute any one of the foregoing. Steps in a method embodiment.

Through this application, a device collects first voice information in an environment in which the device is located; the device determines a first voice frequency corresponding to the first voice information and a second voice frequency corresponding to the second voice information, wherein the second The voice information is the voice played by the device itself; the third voice information in the first voice information is determined according to the similarity between the first voice frequency and the second voice frequency, and the third voice information is deleted from the first voice information Voice information to get the target voice information. The above technical solution is adopted to solve the problem that it is difficult to distinguish between the sound broadcasted by the device itself and the voice information collected by the device in the related technology, and the two are accurately separated according to the frequency of the sound, so that the device can accurately obtain the user's voice information and realize the same as the device. Voice interaction.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings described here are used to provide a further understanding of the present application and constitute a part of the present application. The schematic embodiments of the present application and the description thereof are used to explain the present application, and do not constitute an improper limitation on the present application. In the drawings:

FIG. 1 is a block diagram of a hardware structure of a home appliance with a method for acquiring voice information according to an embodiment of the present application; FIG.

2 is a flowchart of a method for acquiring voice information according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a voice device according to the present application.

detailed description

Hereinafter, the present application will be described in detail with reference to the drawings and embodiments. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

It should be noted that the terms “first” and “second” in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

Example one

The method embodiments provided in the first embodiment of the present application may be executed in a home appliance, a computer terminal, or a similar computing device. Taking a home appliance as an example, FIG. 1 is a block diagram of a hardware structure of a home appliance according to a method for acquiring voice information according to an embodiment of the present application. As shown in FIG. 1, the home appliance 10 may include one or more (only one shown in FIG. 1) a processor 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) ) And a memory 104 configured to store data, optionally, the home appliance may further include a transmission device 106 and an input-output device 108 configured as a communication function. Persons of ordinary skill in the art can understand that the structure shown in FIG. 1 is only schematic, and it does not limit the structure of the home appliance. For example, the home appliance 10 may further include more or fewer components than those shown in FIG. 1, or have a different configuration from that shown in FIG. 1.

The memory 104 may be configured to store software programs and modules of application software, such as program instructions / modules corresponding to the method for acquiring voice information in the embodiments of the present application. The processor 102 runs the software programs and modules stored in the memory 104, thereby Perform various functional applications and data processing, that is, implement the method described above. The memory 104 may include a high-speed random access memory, and may further include a non-volatile memory, such as one or more magnetic storage devices, a flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory remotely disposed with respect to the processor 102, and these remote memories may be connected to the home appliance 10 through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The transmission device 106 is configured to receive or transmit data via a network. A specific example of the above network may include a wireless network provided by a communication provider of the home appliance 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency (RF) module, which is configured to communicate with the Internet in a wireless manner.

In this embodiment, a method for acquiring voice information running on the home appliance is provided. FIG. 2 is a flowchart of a method for acquiring voice information according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps. :

Step S202: The device collects first voice information in an environment where the device is located;

The first voice information may include information such as music played by itself, and also includes a user's control instruction on the device.

Step S204: The device determines a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, where the second voice information is a voice played by the device itself;

Step S206: Determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information to obtain the target voice information. .

After the target voice information is obtained, the semantics of the target voice information can be identified, and the control instruction of the user can be determined.

Through the above steps, the device collects the first voice information in the environment where the device is located; the device determines a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, wherein the second The voice information is the voice played by the device itself; the third voice information in the first voice information is determined according to the similarity between the first voice frequency and the second voice frequency, and the third voice information is deleted from the first voice information Voice information to get the target voice information. The above technical solution is adopted to solve the problem that it is difficult to distinguish between the sound broadcasted by the device itself and the voice information collected by the device in the related technology, and the two are accurately separated according to the frequency of the sound, so that the device can accurately obtain the user's voice information and realize the same as the device Voice interaction.

Optionally, the main body of the above steps may be home appliances such as air conditioners and refrigerators, but is not limited thereto.

Optionally, the second sound frequency is determined by obtaining the second sound frequency from a buffer of the device. The voice information played by the device itself is generally stored in the cache in advance, or it may be obtained from other connected storage media, such as a USB flash drive.

Optionally, the third voice information in the first voice information is determined according to the similarity between the first voice frequency and the second voice frequency, and the third voice information is deleted from the first voice information to obtain the target voice. The information includes: in the first sound frequency, determining a sound frequency having a similarity with the second sound frequency higher than a threshold, and using the determined sound frequency as the third sound frequency; The third voice information is deleted from the first voice information to obtain the target voice information.

The portion of the first sound frequency that has a high degree of similarity to the second sound frequency may be determined to be the portion of the sound that it plays itself, and deleted, and the rest is the user's voice information.

Optionally, after the device collects the first voice information in the environment in which the device is located, when detecting that the device is not currently playing a voice, it is determined that the first voice information is the target voice information.

Optionally, the device collecting the first voice information in the environment where the device is located includes: the device collecting the first voice information through a microphone.

According to another embodiment of the application document, a method for acquiring voice information is also provided, including the following steps:

Step 1: The first device collects first voice information in an environment where the device is located, and obtains second voice information currently played by all voice playback devices in the current environment from a network side, where the environment includes the multiple voice playbacks. device;

Step 2: the first device determines a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information;

Step 3: Determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information to obtain the target voice information. .

When there are multiple voice playback devices in the current environment, multiple voice playback devices share the voice information they play to the network-side device for other devices to refer to when identifying the user's control command, so as to leave the user's voice message.

The above technical solution is adopted to solve the problem that it is difficult to distinguish between the sound broadcasted by the device itself and the voice information collected by the device in the related technology, and the two are accurately separated according to the frequency of the sound, so that the device can accurately obtain the user's voice information and realize the same as the device. Voice interaction.

Step 1: The device collects first voice information in an environment where the device is located;

Step 2: The device determines first feature information corresponding to the first voice information and second feature information corresponding to the second voice information, wherein the second voice information is a voice played by the device itself;

Step 3: Determine the third voice information in the first voice information according to the similarity between the feature information and the second feature information, and delete the third voice information from the first voice information to obtain the target voice information.

Optionally, the first feature information and the second feature information each include at least one of the following: a sound frequency, a tone, a tone color, and a volume.

The following description is made with reference to another embodiment of the application document.

This application document addresses the following technical issues: The voice signals received by an online voice device are not affected by the sounds they broadcast.

The equipment in this application file supports online voice functions, as well as voice broadcast and interactive functions.

The entire small system in this application file includes a voice acquisition part, a control unit, and a voice playback part. FIG. 3 is a schematic structural diagram of a voice device according to the present application. As shown in FIG. 3, it includes a voice acquisition module, a control unit, and a voice playback part. Module, when the device broadcasts the voice, the control unit buffers the frequency of the broadcast sound at the same time; at the same time, the control unit receives the voice acquisition audio; in the control unit, compares the audio collected by the voice with the audio buffer of the voice broadcast Yes, delete the audio content collected by the voice and delete the part with a high degree of similarity to the audio of the voice broadcast, and the remaining part is the audio content of the actual collection environment.

By adopting the above technical solution, the influence of the sound broadcast by the online voice device itself is eliminated, and the accuracy of sound sampling by the online voice device is improved.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is Better implementation. Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium (such as ROM / RAM, magnetic disk, The optical disc) includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present application.

Example two

A device for acquiring voice information is also provided in this embodiment, and the device is configured to implement the foregoing embodiments and preferred implementations, and the descriptions will not be repeated. As used below, the term "module" may implement a combination of software and / or hardware for a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware is also possible and conceived.

According to another embodiment of the present application document, a device for acquiring voice information is further provided, including:

A first acquisition module configured to acquire first voice information in an environment in which the device is located;

A first determining module configured to determine a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, wherein the second voice information is a voice played by the device itself;

A second determining module, configured to determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information, Get the target voice information.

The above technical solution is adopted to solve the problem that it is difficult to distinguish between the sound broadcasted by the device itself and the voice information collected by the device in the related technology, and the two are accurately separated according to the frequency of the sound, so that the device can accurately obtain the user's voice information and realize the same as the device Voice interaction.

The second acquisition module is configured to collect first voice information in an environment in which the device is located, and obtain second voice information currently played by all voice playback devices in the current environment from the network side, where the environment includes the multiple voices Playback equipment

A third determining module, configured to determine a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information;

A fourth determining module, configured to determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information, Get the target voice information.

A third acquisition module configured to acquire first voice information in an environment in which the device is located;

A fifth determining module is configured to determine first feature information corresponding to the first voice information and second feature information corresponding to the second voice information, where the second voice information is a voice played by the device itself;

A sixth determining module is configured to determine the third voice information in the first voice information according to the similarity between the feature information and the second feature information, and delete the third voice information from the first voice information to obtain a target voice. information.

It should be noted that the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to the above: the above modules are located in the same processor; or the above modules are arbitrarily combined The forms are located in different processors.

Example three

An embodiment of the present application further provides a storage medium. Optionally, in this embodiment, the foregoing storage medium may be configured to store program code configured to perform the following steps:

S1. The device collects first voice information in an environment where the device is located.

S2. The device determines a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, wherein the second voice information is a voice played by the device itself;

S3. Determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information to obtain Target voice information.

Optionally, in this embodiment, the foregoing storage medium may include, but is not limited to, a U disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a mobile hard disk, and a magnetic disk. Various media such as discs or optical discs that can store program codes.

An embodiment of the present application further provides an electronic device including a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to perform the steps in any one of the foregoing method embodiments.

Optionally, the electronic device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor, and the input / output device is connected to the processor.

Optionally, in this embodiment, the foregoing processor may be configured to execute the following steps by a computer program:

Optionally, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementation manners, and details are not described in this embodiment.

Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present application may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed in a network composed of multiple computing devices. Above, optionally, they may be implemented with program code executable by a computing device, so that they may be stored in a storage device and executed by the computing device, and in some cases, may be in a different order than here The steps shown or described are performed either by making them into individual integrated circuit modules or by making multiple modules or steps into a single integrated circuit module. As such, this application is not limited to any particular combination of hardware and software.

The above description is only a preferred embodiment of the present application, and is not intended to limit the present application. For those skilled in the art, this application may have various modifications and changes. Any modification, equivalent replacement, or improvement made within the spirit and principle of this application shall be included in the protection scope of this application.

Industrial applicability

In the above technical solution provided by the present application, in the environmental voice information collected by the device, the sound played by the device itself is deleted, so as to eliminate interference of the device's own sound as much as possible, and solves the sound broadcasted by the device itself in the related technology and the data collected by the device. For the problem that voice information is difficult to distinguish, the two are accurately separated according to the sound frequency, so that the device can accurately obtain the user's voice information and realize the voice interaction with the device.

Claims

A method for acquiring voice information, including:

The device collects first voice information in an environment in which the device is located;

Determining, by the device, a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, wherein the second voice information is a voice played by the device itself;

Determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information to obtain a target voice information.
The method of claim 1, wherein the second sound frequency is determined by:

Acquiring the second sound frequency from a buffer of the device.
The method according to claim 1, wherein the third voice information in the first voice information is determined according to the similarity between the first voice frequency and the second voice frequency, and from the first voice The third voice information is deleted from the information to obtain the target voice information, including:

In the first sound frequency, determining a sound frequency with a similarity to the second sound frequency higher than a threshold, and using the determined sound frequency as the third sound frequency;

Deleting the third voice information corresponding to the third voice frequency from the first voice information to obtain the target voice information.
The method according to claim 1, wherein after the device collects the first voice information in the environment in which the device is located, the method further comprises:

When it is detected that the device is not currently playing a voice, it is determined that the first voice information is the target voice information.
The method according to claim 1, wherein the device collecting the first voice information in an environment in which the device is located comprises:

The device collects the first voice information through a microphone.
A method for acquiring voice information, including:

The first device collects first voice information in an environment in which the first device is located, and acquires second voice information currently played by all voice playback devices in the current environment from the network side, where the environment includes multiple voice playback devices;

Determining, by the first device, a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information;

Determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the third voice information from the first voice information to obtain a target voice information.
A method for acquiring voice information, including:

The device collects first voice information in an environment in which the device is located;

Determining, by the device, first feature information corresponding to the first voice information and second feature information corresponding to second voice information, wherein the second voice information is a voice played by the device itself;

Determining the third voice information in the first voice information according to the similarity between the feature information and the second feature information, and deleting the third voice information from the first voice information to obtain the target voice information.
The method according to claim 7, wherein the first characteristic information and the second characteristic information each include at least one of the following:

Sound frequency, tone, timbre, volume.
An apparatus for acquiring voice information, including:

A first acquisition module configured to acquire first voice information in an environment where the device is located;

A first determining module configured to determine a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information, wherein the second voice information is a voice played by the device itself;

The second determining module is configured to determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the first voice information from the first voice information. Three voice messages to get the target voice message.
An apparatus for acquiring voice information, including:

The second acquisition module is configured to collect first voice information in an environment where the device is located, and obtain second voice information currently played by all voice playback devices in the current environment from the network side, where the environment includes multiple voice playbacks. device;

A third determining module, configured to determine a first sound frequency corresponding to the first voice information and a second sound frequency corresponding to the second voice information;

A fourth determining module is configured to determine the third voice information in the first voice information according to the similarity between the first voice frequency and the second voice frequency, and delete the first voice information from the first voice information. Three voice messages to get the target voice message.
An apparatus for acquiring voice information, including:

A third acquisition module, configured to acquire first voice information in an environment where the device is located;

A fifth determining module, configured to determine the first feature information corresponding to the first voice information and the second feature information corresponding to the second voice information, wherein the second voice information is a voice played by the device itself;

A sixth determining module, configured to determine the third voice information in the first voice information according to the similarity between the feature information and the second feature information, and delete the third voice information from the first voice information, Get the target voice information.
A storage medium, wherein a computer program is stored in the storage medium, and the computer program is configured to execute the method according to any one of claims 1 to 8 when running.
An electronic device includes a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute the computer program of any one of claims 1 to 8. method.