CN109346067B

CN109346067B - Voice information processing method and device and storage medium

Info

Publication number: CN109346067B
Application number: CN201811307605.XA
Authority: CN
Inventors: 王慧君; 刘健军; 毛跃辉; 张新; 韩雪; 廖海霖; 郑文成; 李保水; 文皓
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2021-02-26
Anticipated expiration: 2038-11-05
Also published as: CN109346067A

Abstract

The invention provides a method and a device for processing voice information and a storage medium; wherein, the method comprises the following steps: acquiring first voice information and second voice information generated within a preset range; wherein, the first voice information carries a designated voice awakening word; and according to the de-weighting determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, carrying out de-weighting processing on the first voice information and the second voice information to obtain third voice information. The invention solves the problem of the recognition accuracy rate of the voice control information in the related technology, thereby achieving the effect of improving the voice recognition rate.

Description

Voice information processing method and device and storage medium

Technical Field

The invention relates to the field of computers, in particular to a method and a device for processing voice information and a storage medium.

Background

Due to the attenuation problem of sound wave transmission, the controllable range of the voice covered by a single positioning microphone is limited, the environmental noise is low in a single-user environment, and the accuracy of voice control semantic recognition can be met through a single microphone acquisition device. However, under the multi-user condition, because users are more and the environment is relatively noisy, the speech control information recognition accuracy of the single-microphone acquisition device started based on the position of the user is low, and the false recognition is easily caused.

In view of the above problems in the related art, no effective solution exists at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for processing voice information and a storage medium, which are used for at least solving the problem of the recognition accuracy rate of voice control information in the related technology.

According to an embodiment of the present invention, there is provided a method for processing voice information, including: acquiring first voice information and second voice information generated within a preset range; wherein, the first voice information carries a designated voice awakening word; and according to the de-weighting determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, carrying out de-weighting processing on the first voice information and the second voice information to obtain third voice information.

Optionally, performing de-duplication processing on the first voice message and the second voice message according to a de-duplication weight determined by a relative distance between a sound source position of the first voice message and a sound source position of the second voice message to obtain third voice message includes: acquiring coincident voice information in which the first voice information and the second voice information are coincident with each other; and carrying out duplication elimination processing on the first voice information according to the duplication elimination weight and the coincidence voice information to obtain processed third voice information.

Optionally, the magnitude of the de-emphasis weight is inversely proportional to the distance of the relative distance.

Optionally, the acquiring the first voice message and the second voice message generated within the preset range includes: starting a microphone closest to the first voice information sound source position to acquire the first voice information; and opening a microphone closest to the second voice information sound source position to acquire the second voice information.

Optionally, the relative distance is obtained by: acquiring the first voice information sound source position and the second voice information sound source position in a camera positioning and/or voice position analysis mode; and determining the relative distance according to the position of the first voice information sound source and the position of the second voice information sound source.

According to another aspect of the present invention, there is provided a speech information processing apparatus, including: the acquisition module is used for acquiring first voice information and second voice information generated in a preset range; wherein, the first voice information carries a designated voice awakening word; and the processing module is used for carrying out de-weighting processing on the first voice information and the second voice information according to the de-weighting weight determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information to obtain third voice information.

Optionally, the processing module includes: a first obtaining unit, configured to obtain overlapped speech information in which the first speech information and the second speech information are overlapped with each other; and the processing unit is used for carrying out de-duplication processing on the first voice information according to the de-duplication weight and the coincidence voice information to obtain processed third voice information.

Optionally, the obtaining module includes: the first acquisition unit is used for starting a microphone closest to the position of the first voice information sound source so as to acquire the first voice information; and the second acquisition unit is used for starting a microphone closest to the second voice information sound source position so as to acquire the second voice information.

Optionally, the processing module further comprises: the second acquisition unit is used for acquiring the first voice information sound source position and the second voice information sound source position in a camera positioning and/or voice position analysis mode; and the determining unit is used for determining the relative distance according to the position of the first voice information sound source and the position of the second voice information sound source.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to the invention, the first voice information and the second voice information carrying the appointed voice awakening words in the preset range are acquired, and then the first voice information and the second voice information are subjected to the de-duplication processing according to the de-duplication weight determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, so that the first voice information carrying the appointed voice awakening words is more pure, namely, the noise in the first voice information is filtered through the de-duplication processing to obtain the third voice information, therefore, the third voice information can be more accurately identified, the identification rate of the voice information is improved, and the problem of the accuracy rate of the voice control information identification in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a terminal of a voice information processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of processing voice information according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech information processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the first embodiment of the present application may be executed in a terminal, a computer terminal, or a similar computing device. Taking the example of being operated on a terminal, fig. 1 is a block diagram of a hardware structure of the terminal of the method for processing voice information according to the embodiment of the present invention. As shown in fig. 1, the terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the terminal. For example, the terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the method for processing voice information in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In the present embodiment, a method for processing voice information running on the terminal is provided, and fig. 2 is a flowchart of a method for processing voice information according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring first voice information and second voice information generated in a preset range; the first voice information carries a designated voice awakening word;

step S204, according to the determined de-duplication weight of the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, the first voice information and the second voice information are subjected to de-duplication processing to obtain third voice information.

Through the steps S202 and S204, the first voice information and the second voice information carrying the designated voice wakeup word in the preset range are acquired, and then the first voice information and the second voice information are subjected to deduplication processing according to the deduplication weight determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, so that the first voice information carrying the designated voice wakeup word is more pure, that is, the noise in the first voice information is filtered through the deduplication processing to obtain the third voice information, therefore, the third voice information can be recognized more accurately, the recognition rate of the voice information is improved, and the problem of accuracy rate of voice control information recognition in the related technology is solved.

In an optional implementation manner of this embodiment, as to the deduplication weight determined according to the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, which is referred to in the above step S204, the manner of performing deduplication processing on the first voice information and the second voice information to obtain the third voice information may be implemented as follows:

step S204-1, acquiring superposed voice information in which the first voice information and the second voice information are superposed with each other;

step S204-2, carrying out duplication elimination processing on the first voice information according to the duplication elimination weight and the coincidence voice information to obtain processed third voice information; wherein the magnitude of the de-emphasis weight is inversely proportional to the distance of the relative distance.

For the above, the inverse ratio of the magnitude of the de-emphasis weight to the distance of the relative distance is: the weight removal is smaller if the relative distance between the sound source position of the first voice information and the sound source position of the second voice information is larger, and is larger if the relative distance between the sound source position of the first voice information and the sound source position of the second voice information is smaller. The specific value can be set according to the actual situation, as long as the above rule is met.

In another optional implementation manner of this embodiment, the manner of acquiring the first voice information and the second voice information generated within the preset range in step S202 may be implemented as follows:

step S202-1, starting a microphone closest to the position of a first voice information sound source to acquire first voice information;

step S202-2, the microphone closest to the second voice information sound source position is turned on to collect the second voice information.

For the above step S202-1 and step S202-2, in a specific application scenario, the following may be performed: a plurality of microphones are arranged in a preset range, and the microphone closest to the sound source position of the voice information is started while the voice information is generated, and the same way is also adopted for other voice information; in this way, in the case of multiple microphones, the multiple microphones do not have to be turned on simultaneously, thereby saving power. And the method can acquire the voice more accurately.

It should be noted that the relative distance involved in this embodiment is obtained by: acquiring a first voice information sound source position and a second voice information sound source position in a camera positioning and/or voice position analysis mode; and determining the relative distance according to the position of the first voice information sound source and the position of the second voice information sound source.

The following describes the present embodiment in detail with reference to specific embodiments thereof;

the embodiment provides a voice acquisition control and noise reduction method based on multiple microphones, which comprises the following steps:

step S302, acquiring the position of each user based on the camera positioning recognition or voice position analysis technology.

Step S302 is to acquire a voice wake-up word, where the voice information corresponding to the voice wake-up word is control voice information (corresponding to the first voice information in the above embodiment), and the acquisition mode is to turn on a microphone device closest to the user to acquire the control voice information of the user as a control sound source.

In step S304, the microphone devices closest to the other user positions are respectively turned on to collect the voice information (corresponding to the second voice information in the above embodiment) corresponding to the sound source as the noise source.

It should be noted that in the case of a single user, no noise source is acquired.

And S306, judging the coincidence degree of the noise source and the control sound source, and removing the coincidence of the noise source and the control sound source, thereby obtaining a duplication-removed sound source.

Step S308, carrying out de-duplication sound source attenuation processing based on the distance between the noise sound source user and the control sound source user.

Wherein, the larger the distance between the noise source user and the control sound source user is, the larger the attenuation weight of the de-emphasis sound source is. The smaller the distance of the noise source user from the control source user position, the smaller the de-emphasis source attenuation weight.

Step S310, controlling the source to screen the de-weighted source to obtain a relatively pure processed source sample.

And step S312, performing next sound source processing and semantic analysis based on the processed sound source samples.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a device for processing voice information is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and details of which have been already described are omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a schematic structural diagram of a speech information processing apparatus according to an embodiment of the present invention, as shown in fig. 3, the apparatus including: the acquiring module 32 is configured to acquire first voice information and second voice information generated within a preset range; the first voice information carries a designated voice awakening word; and the processing module 34 is coupled to the obtaining module 32, and configured to perform deduplication processing on the first voice information and the second voice information according to the deduplication weight determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information to obtain third voice information.

Optionally, the processing module 34 in this embodiment includes: a first acquiring unit configured to acquire overlapped speech information in which the first speech information and the second speech information are overlapped with each other; the processing unit is coupled with the first acquisition unit and used for carrying out de-duplication processing on the first voice information according to the de-duplication weight and the coincidence voice information to obtain processed third voice information; wherein the magnitude of the de-emphasis weight is inversely proportional to the distance of the relative distance.

Optionally, the obtaining module 32 in this embodiment includes: the first acquisition unit is used for starting a microphone closest to the position of a first voice information sound source so as to acquire first voice information; and the second acquisition unit is used for starting a microphone closest to the second voice information sound source position so as to acquire the second voice information.

Optionally, the processing module 34 in this embodiment may further include: the second acquisition unit is used for acquiring the first voice information sound source position and the second voice information sound source position in a camera positioning and/or voice position analysis mode; and the determining unit is used for determining the relative distance according to the position of the first voice information sound source and the position of the second voice information sound source.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring first voice information and second voice information generated in a preset range; the first voice information carries a designated voice awakening word;

and S2, performing de-duplication processing on the first voice information and the second voice information according to the de-duplication weight determined by the relative distance between the position of the sound source of the first voice information and the position of the sound source of the second voice information to obtain third voice information.

Optionally, the storage medium is further arranged to store a computer program for performing the steps of:

step S1, acquiring overlapped voice information in which the first voice information and the second voice information are overlapped;

step S2, carrying out duplication elimination processing on the first voice information according to the duplication elimination weight and the coincidence voice information to obtain processed third voice information; wherein the magnitude of the de-emphasis weight is inversely proportional to the distance of the relative distance.

step S1, starting a microphone nearest to the first voice information sound source position to collect the first voice information;

in step S2, the microphone closest to the second voice information sound source is turned on to collect the second voice information.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for processing voice information, comprising:

acquiring first voice information and second voice information generated within a preset range; wherein, the first voice information carries a designated voice awakening word;

according to the de-weighting weight determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information, performing de-weighting processing on the first voice information and the second voice information to obtain third voice information;

the acquiring of the first voice message and the second voice message generated in the preset range comprises:

starting a microphone closest to the first voice information sound source position to acquire the first voice information;

and opening a microphone closest to the second voice information sound source position to acquire the second voice information.

2. The method according to claim 1, wherein performing de-duplication processing on the first voice message and the second voice message according to a de-duplication weight determined by a relative distance between a sound source position of the first voice message and a sound source position of the second voice message to obtain third voice message comprises:

acquiring coincident voice information in which the first voice information and the second voice information are coincident with each other;

and carrying out duplication elimination processing on the first voice information according to the duplication elimination weight and the coincidence voice information to obtain processed third voice information.

3. The method of claim 2, wherein the magnitude of the de-emphasis weight is inversely proportional to the distance of the relative distance.

4. The method of claim 1, wherein the relative distance is obtained by:

acquiring the first voice information sound source position and the second voice information sound source position in a camera positioning and/or voice position analysis mode;

and determining the relative distance according to the position of the first voice information sound source and the position of the second voice information sound source.

5. An apparatus for processing speech information, comprising:

the acquisition module is used for acquiring first voice information and second voice information generated in a preset range; wherein, the first voice information carries a designated voice awakening word;

the processing module is used for carrying out de-weighting processing on the first voice information and the second voice information according to de-weighting weight determined by the relative distance between the sound source position of the first voice information and the sound source position of the second voice information to obtain third voice information;

the acquisition module includes:

the first acquisition unit is used for starting a microphone closest to the position of the first voice information sound source so as to acquire the first voice information;

and the second acquisition unit is used for starting a microphone closest to the second voice information sound source position so as to acquire the second voice information.

6. The apparatus of claim 5, wherein the processing module comprises:

a first obtaining unit, configured to obtain overlapped speech information in which the first speech information and the second speech information are overlapped with each other;

and the processing unit is used for carrying out de-duplication processing on the first voice information according to the de-duplication weight and the coincidence voice information to obtain processed third voice information.

7. The apparatus of claim 6, wherein the magnitude of the de-emphasis weight is inversely proportional to the distance of the relative distance.

8. The apparatus of claim 5, wherein the processing module further comprises:

the second acquisition unit is used for acquiring the first voice information sound source position and the second voice information sound source position in a camera positioning and/or voice position analysis mode;

and the determining unit is used for determining the relative distance according to the position of the first voice information sound source and the position of the second voice information sound source.

9. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 4 when executed.