CN108053833A

CN108053833A - Processing method, device, electronic equipment and the storage medium that voice is uttered long and high-pitched sounds

Info

Publication number: CN108053833A
Application number: CN201711243578.XA
Authority: CN
Inventors: 杨宗业
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2018-05-18

Abstract

Processing method, device, electronic equipment and the computer readable storage medium uttered long and high-pitched sounds this application discloses a kind of voice.Wherein method includes：Whether detection current scene meets the chat scenario that closely multi-person speech is hands-free；If so, the voice signal in acquisition chat scenario, and the voice signal of the present frame collected is converted into frequency-region signal；Extract the feature of frequency-region signal；Whether the voice signal that present frame is judged according to the feature of frequency-region signal is signal of uttering long and high-pitched sounds；If so, the voice signal of present frame is removed.This method can cause echo to inhibit sordid problem to avoid howling is excessive, and then the situation of winding the self-oscillation amplifying occur, improve it is double stress results, and improve the usage experience of user.

Description

Processing method, device, electronic equipment and the storage medium that voice is uttered long and high-pitched sounds

Technical field

A kind of utter long and high-pitched sounds this application involves voice processing technology field more particularly to voice processing method, device, electronics are set Standby and computer readable storage medium.

Background technology

With the development of digital network, there is numerous mobile phone speech intercom systems：It can be very by cell phone software Easily realize traditional intercom.Transmitting terminal mobile phone is obtained voice signal and is passed through data network and be transferred to and connect by microphone End is listened, answers the voice signal that section is received by loud speaker broadcasting, material is thus formed a basic voice inter-speaking systems. But in actual application, due to the needs of game, more people can indoors or closely interior mutual hands-free voice, due to connecing The loud speaker at end is listened constantly to make a sound, while is also received by the microphone of transmitting terminal, constantly Xun Huan often generates There is self-excitation and utters long and high-pitched sounds problem, bad experience is brought to user in the self-excitation of sound.

In correlation technique, typically problem of uttering long and high-pitched sounds closely is handled by strengthening echo compacting.But this pass through Strengthen the mode of echo compacting, can cause normal background sound scene, the double of voice stress results very poor, and are present with audio discontinuity Situation about not hearing, user experience are deteriorated.

The content of the invention

The purpose of the application purport is solving one of the technical issues of above-mentioned at least to a certain extent.

For this purpose, first purpose of the application is the processing method for proposing that a kind of voice is uttered long and high-pitched sounds.This method can be to avoid Howling is excessive to cause echo to inhibit sordid problem, and then the situation of winding the self-oscillation amplifying occurs, improve it is double stress results, And improve the usage experience of user.

Second purpose of the application is the processing unit for proposing that a kind of voice is uttered long and high-pitched sounds.

The 3rd purpose of the application is to propose a kind of electronic equipment.

The 4th purpose of the application is to propose a kind of computer readable storage medium.

In order to achieve the above objectives, the processing method that the voice that the application first aspect embodiment proposes is uttered long and high-pitched sounds, including：Detection Whether current scene meets the chat scenario that closely multi-person speech is hands-free；If so, gather the voice in the chat scenario Signal, and the voice signal of the present frame collected is converted into frequency-region signal；Extract the feature of the frequency-region signal；According to institute The feature for stating frequency-region signal judges whether the voice signal of the present frame is signal of uttering long and high-pitched sounds；If so, by the present frame Voice signal is removed.

In order to achieve the above objectives, the processing unit that the voice that the application second aspect embodiment proposes is uttered long and high-pitched sounds, including：Detection Module, for detecting whether current scene meets the chat scenario that closely multi-person speech is hands-free；Acquisition module, for detecting When meeting the hands-free chat scenario of closely multi-person speech to current scene, the voice signal in the chat scenario is gathered；Letter Number modular converter, for the voice signal of the present frame collected to be converted to frequency-region signal；Characteristic extracting module, for extracting The feature of the frequency-region signal；Judgment module, for judging that the voice of the present frame is believed according to the feature of the frequency-region signal Number whether it is signal of uttering long and high-pitched sounds；Remove module, for the voice signal for judging the present frame for utter long and high-pitched sounds signal when, described will work as The voice signal of previous frame is removed.

In order to achieve the above objectives, the application third aspect embodiment propose electronic equipment, including memory, processor and The computer program that can be run on the memory and on the processor is stored in, the processor performs described program When, realize the processing method that the voice described in the application first aspect embodiment is uttered long and high-pitched sounds.

In order to achieve the above objectives, the non-transitorycomputer readable storage medium that the application fourth aspect embodiment proposes, Computer program is stored thereon with, the voice described in the application first aspect embodiment is realized when described program is executed by processor The processing method uttered long and high-pitched sounds.

Processing method, device, electronic equipment and the computer-readable storage medium uttered long and high-pitched sounds according to the voice of the embodiment of the present application Whether matter, detectable current scene meet the chat scenario that closely multi-person speech is hands-free, if so, in acquisition chat scenario Voice signal, and the voice signal of the present frame collected is converted into frequency-region signal, and the feature of frequency-region signal is extracted, and root Whether the voice signal that present frame is judged according to the feature of frequency-region signal is signal of uttering long and high-pitched sounds, if so, the voice signal by present frame It is removed.I.e. for the closely hands-free chat scenario of multi-person speech, asked due to closely hands-free there are echo amount is excessive Topic, in general background sound and human voice signal are smaller, when according to the feature of frequency-region signal detect exist utter long and high-pitched sounds signal when, can will The signal is removed, and avoids howling is excessive echo is caused to inhibit sordid problem, and then winding the self-oscillation amplifying occurs Situation improves the usage experience of user.

The additional aspect of the application and advantage will be set forth in part in the description, and will partly become from the following description It obtains substantially or is recognized by the practice of the application.

Description of the drawings

It in order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is the flow chart for the processing method uttered long and high-pitched sounds according to the voice of the application one embodiment；

Fig. 2 is the structure diagram for the processing unit uttered long and high-pitched sounds according to the voice of the application one embodiment；

Fig. 3 is the structure diagram for the processing unit uttered long and high-pitched sounds according to the voice of one specific embodiment of the application；

Fig. 4 is the structure diagram according to the electronic equipment of the application one embodiment.

Specific embodiment

Embodiments herein is described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or has the function of same or like element.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the application, and it is not intended that limitation to the application.

Below with reference to the accompanying drawings processing method, device, electronic equipment and calculating that the voice of the embodiment of the present application is uttered long and high-pitched sounds are described Machine readable storage medium storing program for executing.

Fig. 1 is the flow chart for the processing method uttered long and high-pitched sounds according to the voice of the application one embodiment.It should be noted that this The processing method that the voice of application embodiment is uttered long and high-pitched sounds can be applied to the processing unit that the voice of the embodiment of the present application is uttered long and high-pitched sounds, the processing Device can be configured in electronic equipment.Wherein, in embodiments herein, which can be mobile terminal (such as hand Machine, tablet computer, personal digital assistant etc. have the hardware device of various operating systems).

As shown in Figure 1, the processing method that the voice is uttered long and high-pitched sounds can include：

Whether S110, detection current scene meet the chat scenario that closely multi-person speech is hands-free.Wherein, the application's In embodiment, the hands-free chat scenario of the closely multi-person speech is used to indicate in closely scene, and there are more people to pass through The scene of hands-free mode voice-enabled chat.

It should be noted that be adapted to closely multi-person speech hands-free for the processing method that the voice of the embodiment of the present application is uttered long and high-pitched sounds Chat scenario, for example, suitable for more people's athletic games closely hands-free scene, there are echo amount is excessive due to closely hands-free The problem of, it can to occur under hands-free scene self-excitation and utter long and high-pitched sounds problem.

In this step, it can detect whether current scene meets the chat scenario that closely multi-person speech is hands-free, if satisfied, Step S120 is then performed, i.e., processing of uttering long and high-pitched sounds is carried out to the voice signal collected.As a kind of exemplary realization method, can pass through Testing audio is sent to current scene, and is detected whether to receive in certain period of time after other equipment first receives by microphone The testing audio exported by loud speaker, if so, can determine that the current scene is hands-free for closely multi-person speech Chat scenario.

S120, if so, the voice signal in acquisition chat scenario, and the voice signal of the present frame collected is converted For frequency-region signal.

Optionally, detect current scene meet described in closely multi-person speech hands-free chat scenario when, can pass through Microphone gathers the voice signal in the chat scenario, and Discrete Fourier Transform or discrete cosine transform can be used or change Into cosine transform, the voice signal of the present frame collected is converted into frequency-region signal.Wherein, in embodiments herein In, the present frame is that frame in the voice signal for receiving the current time after the voice signal framing, obtained Signal.

S130 extracts the feature of frequency-region signal.

As a kind of exemplary realization method, the single-frequency energy of the frequency point and the frequency point in the frequency-region signal can extract.It can Selection of land can determine multiple sampling frequency points in the frequency-region signal according to preset sample frequency, wherein, each sampling frequency point is right Answer a frequency.

For example, the voice signal collected is carried out framing, under the preset sample frequency, multiple sampling frequencies are included per frame Point, wherein, frequency point refers to specific absolute frequency value, is after being sampled according to the preset sample frequency to every frame signal, incites somebody to action After being ranked up per all frequencies gathered in frame signal, obtained number, so, each frequency point that samples corresponds to a frequency Rate.

After the frequency point in extracting the frequency-region signal, it may be determined that the single-frequency energy of the frequency point.For example, due to each sampling Frequency point corresponds to a specific signal frequency, so, each single-frequency energy sampled corresponding to frequency point is exactly that the frequency point corresponds to Signal frequency energy (i.e. the range value of the corresponding signal frequency of the frequency point).

S140, whether the voice signal that present frame is judged according to the feature of frequency-region signal is signal of uttering long and high-pitched sounds.

Optionally, can determine whether the frequency point in the frequency-region signal and single-frequency energy is to judge the voice signal of the present frame No is signal of uttering long and high-pitched sounds.As a kind of exemplary realization method, can determine whether frequency point in the frequency-region signal single-frequency energy whether In preset time period exponentially type rise, if so, determine whether this exponentially type rise after single-frequency energy value it is whether big In predetermined threshold value, if so, can determine that the voice signal of the present frame for signal of uttering long and high-pitched sounds.

That is, the single-frequency energy of the frequency point in preset time period can be counted, and judge the single-frequency energy of the frequency point Amount whether in the preset time period exponentially type rises and is more than predetermined threshold value, if so, can determine that the language of the present frame Sound signal is signal of uttering long and high-pitched sounds.

S150, if so, the voice signal of present frame is removed.

Optionally, the voice signal for judging the present frame for utter long and high-pitched sounds signal when, the signal of the access can be removed. As a kind of example, processing can be reset to the frequency-region signal and is converted back into time domain to eliminate signal of uttering long and high-pitched sounds.Thus, it is possible to it avoids making a whistling sound Calling excessive causes echo to inhibit sordid problem.

In conclusion the processing method that the voice of the embodiment of the present application is uttered long and high-pitched sounds, is chatted for closely multi-person speech is hands-free Its scene, due to closely it is hands-free there are echo amount it is excessive the problem of, in general background sound and human voice signal are smaller, work as detection The signal risen to single-frequency nergy Index type, and when the signal amplitude is more than certain threshold value, it can be by the target signal filter, to reduce back The input quantity of sound avoids winding self-excitation, while improves double stress results.

According to the processing method that the voice of the embodiment of the present application is uttered long and high-pitched sounds, whether detectable current scene meets closely more people The hands-free chat scenario of voice, if so, the voice signal in acquisition chat scenario, and the voice of the present frame collected is believed Number frequency-region signal is converted to, and extracts the feature of frequency-region signal, and judge that the voice of present frame is believed according to the feature of frequency-region signal Number whether it is signal of uttering long and high-pitched sounds, if so, the voice signal of present frame is removed.It is i.e. hands-free for closely multi-person speech Chat scenario, due to closely it is hands-free there are echo amount it is excessive the problem of, in general background sound and human voice signal are smaller, work as root According to the feature of frequency-region signal detect exist utter long and high-pitched sounds signal when, which can be removed, avoid howling is excessive from causing back Sound inhibits sordid problem, and then the situation of winding the self-oscillation amplifying occurs, improve it is double stress results, and improve making for user With experience.

Corresponding with the processing method that the voice that above-mentioned several embodiments provide is uttered long and high-pitched sounds, a kind of embodiment of the application also carries For the processing unit that a kind of voice is uttered long and high-pitched sounds, due to the processing unit that voice provided by the embodiments of the present application is uttered long and high-pitched sounds and above-mentioned several realities Apply that the processing method uttered long and high-pitched sounds of voice of example offer is corresponding, therefore the embodiment for the processing method uttered long and high-pitched sounds in aforementioned voice is also fitted For the processing unit that voice provided in this embodiment is uttered long and high-pitched sounds, it is not described in detail in the present embodiment.Fig. 2 is according to the application The structure diagram for the processing unit that the voice of one embodiment is uttered long and high-pitched sounds.It should be noted that the voice of the embodiment of the present application is maked a whistling sound The processing unit cried can be configured in electronic equipment.Wherein, in embodiments herein, which can be mobile whole End (such as mobile phone, tablet computer, personal digital assistant have the hardware device of various operating systems).

As shown in Fig. 2, the processing unit 200 that the voice is uttered long and high-pitched sounds can include：Detection module 210, acquisition module 220, letter Number modular converter 230, characteristic extracting module 240, judgment module 250 and remove module 260.

Specifically, detection module 210 is used to detect whether current scene meets the hands-free chat field of closely multi-person speech Scape.

Acquisition module 220 is used to, when detecting that current scene meets the hands-free chat scenario of closely multi-person speech, adopt Collect the voice signal in chat scenario.

Signal conversion module 230 is used to the voice signal of the present frame collected being converted to frequency-region signal.

Characteristic extracting module 240 is used to extract the feature of frequency-region signal.As a kind of example, this feature extraction module 240 It can extract the single-frequency energy of the frequency point and the frequency point in the frequency-region signal.

Judgment module 250 is used to judge whether the voice signal of present frame is signal of uttering long and high-pitched sounds according to the feature of frequency-region signal. As a kind of example, as shown in figure 3, the judgment module 250 may include：First judging unit 251,252 and of second judgment unit Identifying unit 253.

Wherein, whether the first judging unit 251 is used to judge the single-frequency energy of the frequency point in frequency-region signal in preset time Exponentially type rises in section；Second judgment unit 252 is used for the single-frequency energy of the frequency point in frequency-region signal in preset time period When inside exponentially type rises, judge whether the single-frequency energy value after exponentially type rises is more than predetermined threshold value；Identifying unit 253 is used When the single-frequency energy value after the rising of exponentially type is more than predetermined threshold value, judge the voice signal of present frame for signal of uttering long and high-pitched sounds.

Remove module 260 be used for the voice signal for judging present frame for utter long and high-pitched sounds signal when, by the voice signal of present frame It is removed.As a kind of example, this can reset the frequency-region signal processing and be converted back into time domain to eliminate signal of uttering long and high-pitched sounds.

According to the processing unit that the voice of the embodiment of the present application is uttered long and high-pitched sounds, whether full current scene can be detected by detection module The sufficient closely hands-free chat scenario of multi-person speech, if so, acquisition module then gathers the voice signal in chat scenario, signal turns The voice signal of the present frame collected is converted to frequency-region signal by mold changing block, and characteristic extracting module extracts the spy of frequency-region signal Sign, judgment module judges whether the voice signal of present frame is signal of uttering long and high-pitched sounds according to the feature of frequency-region signal, if so, remove module Then the voice signal of present frame is removed.I.e. for the closely hands-free chat scenario of multi-person speech, due to closely exempting from It withdraws deposit echo amount is excessive the problem of, in general background sound and human voice signal are smaller, are detected when according to the feature of frequency-region signal Go out to exist when uttering long and high-pitched sounds signal, which can be removed, avoid howling is excessive echo is caused to inhibit sordid problem, into And there is the situation of winding the self-oscillation amplifying, improve it is double stress results, and improve the usage experience of user.

In order to realize above-described embodiment, the application also proposed a kind of electronic equipment.

Fig. 4 is the structure diagram according to the electronic equipment of the application one embodiment.It should be noted that in the application Embodiment in, the electronic equipment can be mobile terminal (such as mobile phone, tablet computer, personal digital assistant have various behaviour Make the hardware device of system).

As shown in figure 4, the electronic equipment 400 can include：Memory 410, processor 420 and it is stored in memory 410 Computer program 430 that is upper and being run on processor 420, when processor 420 performs described program 430, realizes the application The processing method that voice described in any of the above-described a embodiment is uttered long and high-pitched sounds.

In order to realize above-described embodiment, the application also proposed a kind of non-transitorycomputer readable storage medium, thereon Computer program is stored with, realizes that the voice described in any of the above-described a embodiment of the application is maked a whistling sound when described program is executed by processor The processing method cried.

In the description of the present application, it is to be understood that term " first ", " second " are only used for description purpose, and cannot It is interpreted as indicating or implies relative importance or imply the quantity of the technical characteristic indicated by indicating.Define as a result, " the One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In the description of the present application, " multiple " It is meant that at least two, such as two, three etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms is not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It is combined in an appropriate manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the different embodiments described in this specification or example and different embodiments or exemplary feature It closes and combines.

Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, segment or the portion of the code of the executable instruction of one or more the step of being used to implement specific logical function or process Point, and the scope of the preferred embodiment of the application includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be by the application Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction Row system, device or equipment instruction fetch and the system executed instruction) it uses or combines these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress used with reference to these instruction execution systems, device or equipment It puts.The more specific example (non-exhaustive list) of computer-readable medium includes following：Electricity with one or more wiring Connecting portion (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can be for example by carrying out optical scanner to paper or other media, then into edlin, interpretation or if necessary with it His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each several part of the application can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.If for example, with hardware come realize in another embodiment, can be under well known in the art Any one of row technology or their combination are realized：With for the logic gates to data-signal realization logic function Discrete logic, have suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, one or a combination set of the step of including embodiment of the method.

In addition, each functional unit in each embodiment of the application can be integrated in a processing module, it can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be employed in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and is independent production marketing or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..Although it has been shown and retouches above Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the application System, those of ordinary skill in the art can be changed above-described embodiment, change, replace and become within the scope of application Type.

Claims

1. the processing method that a kind of voice is uttered long and high-pitched sounds, which is characterized in that comprise the following steps：

Whether detection current scene meets the chat scenario that closely multi-person speech is hands-free；

If so, gathering the voice signal in the chat scenario, and the voice signal of the present frame collected is converted into frequency Domain signal；

Extract the feature of the frequency-region signal；

Whether the voice signal that the present frame is judged according to the feature of the frequency-region signal is signal of uttering long and high-pitched sounds；

If so, the voice signal of the present frame is removed.

2. the method as described in claim 1, which is characterized in that the feature of the extraction frequency-region signal, including：

Extract the single-frequency energy of the frequency point and the frequency point in the frequency-region signal.

3. method as claimed in claim 2, which is characterized in that the feature according to frequency-region signal judges the present frame Whether voice signal is signal of uttering long and high-pitched sounds, including：

Judge the frequency point in the frequency-region signal single-frequency energy whether in preset time period exponentially type rise；

If so, determining whether the single-frequency energy value after exponentially type rising is more than predetermined threshold value；

If so, judge the voice signal of the present frame for signal of uttering long and high-pitched sounds.

4. the method as described in claim 1, which is characterized in that it is described to remove the voice signal of present frame, including：

Processing is reset to the frequency-region signal and is converted back into time domain to eliminate signal of uttering long and high-pitched sounds.

5. a kind of processing unit that voice is uttered long and high-pitched sounds, which is characterized in that including：

Detection module, for detecting whether current scene meets the chat scenario that closely multi-person speech is hands-free；

Acquisition module, for when detecting that current scene meets the closely hands-free chat scenario of multi-person speech, described in acquisition Voice signal in chat scenario；

Signal conversion module, for the voice signal of the present frame collected to be converted to frequency-region signal；

Characteristic extracting module, for extracting the feature of the frequency-region signal；

Judgment module, for judging whether the voice signal of the present frame is letter of uttering long and high-pitched sounds according to the feature of the frequency-region signal Number；

Remove module, for the voice signal for judging the present frame for utter long and high-pitched sounds signal when, the voice of the present frame is believed It number is removed.

6. device as claimed in claim 5, which is characterized in that the characteristic extracting module is specifically used for：

7. device as claimed in claim 6, which is characterized in that the judgment module includes：

First judging unit, for judging the single-frequency energy of the frequency point in the frequency-region signal whether in preset time period Exponentially type rises；

Second judgment unit, the single-frequency energy for the frequency point in the frequency-region signal are in the preset time period When exponential type rises, judge whether the single-frequency energy value after exponentially type rises is more than predetermined threshold value；

Identifying unit, when being more than the predetermined threshold value for the single-frequency energy value after exponentially type rising, described in judgement The voice signal of present frame is signal of uttering long and high-pitched sounds.

8. device as claimed in claim 5, which is characterized in that the remove module is specifically used for：

9. a kind of electronic equipment, including memory, processor and it is stored on the memory and can transports on the processor Capable computer program, which is characterized in that when the processor performs described program, realize such as any one of claims 1 to 4 The processing method that the voice is uttered long and high-pitched sounds.

10. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the journey The processing method that voice is uttered long and high-pitched sounds according to any one of claims 1 to 4 is realized when sequence is executed by processor.