CN111899726A

CN111899726A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN111899726A
Application number: CN202010737225.0A
Authority: CN
Inventors: 李�杰; 成凯; 郭少军
Original assignee: Shanghai Xiri Electronic Technology Co ltd
Current assignee: Shanghai Xiri Electronic Technology Co ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-11-06

Abstract

The embodiment of the application provides an audio processing method, an audio processing device, electronic equipment and a storage medium. According to the audio processing method, the obtained audio segments are analyzed by controlling the difference between the voice and the interference sound in the duration, and the effective audio segments are determined. Because the interference sound and the control voice in the audio data can be distinguished, the accuracy in the voice interaction process with the equipment can be improved.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

When a user interacts with the device through the control voice, other interference sounds in the environment cause certain interference to the control voice sent by the user. When the equipment performs voice interaction, it is difficult to distinguish whether the audio data acquired by the equipment is control voice, and then the accuracy of the voice interaction is influenced.

Disclosure of Invention

To overcome at least one of the deficiencies in the prior art, an object of an embodiment of the present application is to provide an audio processing method, including:

acquiring audio data during voice control;

determining an audio segment to be identified, of which each silence duration does not exceed a first duration threshold, from the audio data, wherein within the silence duration, the sound intensity corresponding to the audio data does not exceed the silence threshold;

and comparing the acquisition time length of the audio clip to be identified with a second time length threshold value, and determining the effective audio clip exceeding the second time length threshold value.

Optionally, the step of comparing the acquisition duration of the audio segment to be identified with a second duration threshold and determining a valid audio segment exceeding the second duration threshold includes:

comparing the acquisition time of the audio clip to be identified with the second time threshold;

if the acquisition time length of the audio clip to be identified exceeds the second time length threshold, determining the audio clip to be identified as the effective audio clip:

and if the acquisition time length of the audio clip to be identified does not exceed the second time length threshold, acquiring new audio data, and determining the audio clip to be identified with the silence time length not exceeding the first time length threshold from the audio data again.

Optionally, the step of determining, from the audio data, an audio segment to be identified whose silence duration does not exceed a first duration threshold each time includes:

determining at least one audio clip to be identified, of which each silence duration does not exceed a first duration threshold, from the audio data;

the step of comparing the acquisition time length of the audio clip to be identified with a second time length threshold value and determining the effective audio clip exceeding the second time length threshold value comprises the following steps:

and for each audio clip to be identified, comparing the recording time length of the audio clip to be identified with the first time length threshold value.

Optionally, the method further comprises:

counting the recording duration of a plurality of effective voice fragments;

and adjusting the second duration threshold value according to the statistical result of the recording durations of the effective voice fragments to obtain a new second duration threshold value.

Optionally, the step of adjusting the second duration threshold according to the statistical result of the recording durations of the plurality of valid voice segments includes:

acquiring Gaussian distribution of the recording durations of the effective voices according to the recording durations of the effective voices;

and determining the time length of which the confidence interval exceeds the confidence threshold value as the new second time length threshold value according to the Gaussian distribution result.

sequencing the recording durations of the effective voice fragments according to an increasing or decreasing sequence to obtain a sequencing result;

and selecting the recording time length of a preset position in the sorting result as the new second time length threshold value according to the sorting result.

Optionally, recording a corresponding relationship between the voiceprint information and a second duration threshold to be matched, and before comparing the acquisition duration of the audio clip to be identified with the second duration threshold, further including:

acquiring voiceprint information of the audio clip to be identified;

and determining the second time length threshold value from the second time length threshold value to be matched according to the voiceprint information of the audio clip to be identified.

It is another object of the embodiments of the present application to provide an audio processing apparatus, including:

the audio acquisition module is used for acquiring audio data during voice control;

the segment determining module is used for determining an audio segment to be identified, of which each silence duration does not exceed a first duration threshold, from the audio data, wherein the sound intensity corresponding to the audio data does not exceed the silence threshold within the silence duration;

and the segment identification module is used for comparing the acquisition time length of the audio segment to be identified with a second time length threshold value and determining the effective audio segment exceeding the second time length threshold value.

Optionally, the audio processing apparatus further includes:

the duration counting module is used for counting the recording durations of the effective voice fragments;

and the duration adjusting module is used for adjusting the second duration threshold according to the statistical result of the recording durations of the effective voice fragments.

It is a further object of the embodiments of the present application to provide an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions, and the machine executable instructions, when executed by the processor, implement the audio processing method.

It is a fourth object of the embodiments of the present application to provide a storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the audio processing method.

Compared with the prior art, the method has the following beneficial effects:

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart illustrating steps of an audio processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of audio data provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an effective audio clip acquisition provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Icon: 500-a first audio piece; 600-a second audio piece; 1101-an audio acquisition module; 1102-a fragment determination module; 1103-a fragment recognition module; 1104-duration statistics module; 1105-duration adjustment module; 130-a processor; 120-a memory; 110-audio processing means.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it is noted that the terms "first", "second", "third", and the like are used merely for distinguishing between descriptions and are not intended to indicate or imply relative importance.

As described in the background, when a user interacts with a device through a control voice, other interfering sounds in the environment may cause a certain interference to the control voice uttered by the user. When the equipment performs voice interaction, it is difficult to distinguish whether the audio data acquired by the equipment is control voice, and then the accuracy of the voice interaction is influenced.

In view of this, an embodiment of the present application provides an audio processing method. By the audio processing method, interference sound and control voice in the environment are recognized, and accuracy of voice recognition is improved.

The audio processing method can be applied to the intelligent terminal. For example, smart speakers, smart televisions, smart phones, and the like. Of course, the audio processing method can also be applied to a server in communication connection with the intelligent terminal. It should be noted that, the embodiment of the present application is not limited to specific application objects of the audio processing method.

Referring to fig. 1, a flowchart of steps of the audio processing method according to an embodiment of the present application is shown. Taking the intelligent terminal as an example, the method including each step will be described in detail below.

Step S100, audio data during voice control is acquired.

Step S200, determining an audio clip to be identified, wherein the silence duration of each time does not exceed a first duration threshold, from the audio data, wherein the sound intensity corresponding to the audio data does not exceed the silence threshold within the silence duration.

It should be understood that, when the user interacts with the intelligent terminal through the control voice, the intelligent terminal needs to judge whether the user has finished a complete control statement to obtain the corresponding audio segment to be recognized.

In which the pause duration between the words contained in a complete control speech is usually short. After a user speaks a complete control voice, the user usually pauses for a longer period of time. Therefore, the pause duration during the user speaking is the silence duration. Based on the characteristics, the intelligent terminal can determine the audio segments to be identified, of which the silencing time length does not exceed the first time length threshold value, according to the silencing time length of each time in the audio data.

Taking the first time threshold as 500ms as an example, please refer to fig. 2, which provides a schematic diagram of audio data. The audio data includes a first audio segment 500 and a second audio segment 600, wherein the first audio segment 500 includes a plurality of silence durations each not exceeding 500ms, which are 100ms, 110ms, and 105ms, respectively; the second audio piece 600 includes a plurality of silence periods, each not exceeding 500ms, of 120ms and 110ms, respectively. Since the silence period between the first audio piece 500 and the second audio piece 600 is 600ms, it can be considered that the first audio piece 500 corresponds to one complete voice and the second audio piece 600 corresponds to another complete voice.

Step S300, comparing the acquisition time length of the audio clip to be identified with a second time length threshold value, and determining the effective audio clip exceeding the second time length threshold value.

When a user interacts with the intelligent terminal through control voice, the audio data collected by the intelligent terminal may include the control voice of the user and interference sound in the environment. Research shows that the text corresponding to a complete control speech usually includes a subject, a predicate and an object.

For example, "i want to listen to a song of XX", "play news of today", and "tune the air conditioner to a question of 23 ℃". That is, the text corresponding to a complete control speech includes at least 4 texts, so that it takes at least a certain time for the user to finish a complete control speech at a normal speech speed. Statistics show that the average duration of a complete control speech spoken at normal speech speed is 800 ms.

However, the disturbing sounds in the environment mainly include conversational speech between other users, tapping sounds, sounds generated by machine operations, and the like.

For example, the user has spoken the control voice "I want to listen to the song of XX", other users may say "listen to what", "eat", "go shopping", "turn on the light", etc. Therefore, if the control voice and the interference sound are not distinguished, the text information carried by the audio data collected by the intelligent terminal becomes "what i want to listen to the XX song", "i want to listen to the XX song for eating", "i want to listen to the XX song for shopping", and the like. The intelligent terminal is difficult to determine the real intention of the user according to the audio data, and then the intelligent terminal is difficult to control correctly.

Research shows that the interference sounds such as 'what is heard', 'dining', 'shopping', and 'lighting', are not a complete control voice, and the number of corresponding characters is less than that of a complete control voice. Correspondingly, the corresponding time duration is also less than the time duration required by a complete control speech.

Therefore, the intelligent terminal distinguishes the control voice and the interference sound based on the time length relation of the control voice and the interference sound.

According to the audio processing method provided by the embodiment of the application, the obtained audio segments are analyzed by controlling the difference between the voice and the interference sound in the duration, and the effective audio segments are determined. Because the interference sound and the control voice in the audio data can be distinguished, the accuracy in the voice interaction process with the equipment can be improved.

When the effective audio clip is determined, the intelligent terminal collects audio data in real time in an application scene. In the process of acquiring the line audio data, each intelligent terminal acquires an audio segment to be identified, whether the audio segment to be identified is an effective audio segment is judged, and audio acquisition is stopped until the effective audio segment is determined.

Specifically, after the intelligent terminal is awakened through a specific control instruction, the intelligent terminal starts to acquire audio data in real time. And aiming at the collected audio clip to be identified, the intelligent terminal compares the recording time length of the audio clip to be identified with a second time length threshold value, and judges whether the audio clip to be identified is an effective audio clip.

And if the recording time length of the audio clip to be identified exceeds the second time length threshold value, the intelligent terminal determines the audio clip to be identified as an effective audio clip and stops recording. That is, the duration of the audio segment to be recognized is long enough to satisfy the duration required by a complete control speech.

And if the recording duration of the audio clip to be identified does not exceed the second duration threshold, the intelligent terminal acquires a new audio clip to be identified again and judges whether the new audio clip to be identified is an effective audio clip again. That is, the duration of the audio segment to be recognized is too short to satisfy the duration required by a complete control voice.

For example, referring to fig. 3, when the intelligent terminal collects audio data in real time, an impact sound of "clattering-clattering" appears in the environment. Because the duration of the impact sound of the 'clattering and clattering' does not reach the second duration threshold, the intelligent terminal judges the 'clattering and clattering' as the interference sound, discards the corresponding audio segment to be identified and continues to perform real-time audio acquisition.

If in the subsequent audio acquisition process, the audio clip of 'i want to listen to the song of xxx' is acquired. And because the duration of the song that I want to listen to the xxx reaches a second duration threshold, the intelligent terminal judges the song that I want to listen to the xxx as an effective audio clip, identifies the audio clip of the song that I want to listen to the xxx locally or sends the audio clip to a server for identification, and plays the corresponding song according to the identification result.

In another application scenario, the intelligent terminal obtains audio data with a preset duration. Wherein, the audio data may include a plurality of audio segments to be identified. The intelligent terminal determines at least one audio clip to be identified, of which each silence duration does not exceed a first duration threshold value, from the audio data; and for each audio clip to be identified, comparing the recording time length of the audio clip to be identified with the first time length threshold value.

If the intelligent terminal determines a plurality of effective audio segments from the audio segments to be recognized, as a possible implementation manner, the intelligent terminal selects a first effective audio segment from the effective audio segments to recognize, and performs corresponding control according to a recognition result.

As another possible implementation manner, the intelligent terminal respectively identifies each valid audio segment, and obtains a corresponding identified result. It should be understood that the recognition result of the valid audio segment by the intelligent terminal is the text corresponding to the valid audio segment. The intelligent terminal selects the recognition result with the largest word number as a target recognition result and correspondingly controls according to the target recognition result.

There is a difference in speaking habits between different users. For example, some users speak at a faster speech rate, and some users speak at a slower speech rate. Therefore, if the same second duration threshold (e.g., 800ms) is used for different users, the speaking habits of the different users cannot be adapted.

In view of this, the intelligent terminal counts the recording duration of a plurality of effective voice segments; and adjusting the second duration threshold value according to the statistical result of the recording durations of the effective voice fragments to obtain a new second duration threshold value. Since different second duration thresholds can be provided for different users, the speaking habits of different users can be adapted.

The method is used as a possible implementation mode when the recording time lengths of a plurality of effective voice fragments are counted. The intelligent terminal acquires Gaussian distribution of the recording duration of the effective voices according to the recording duration of the effective voices; and determining the time length of the confidence interval exceeding the confidence threshold value as the new second time length threshold value according to the Gaussian distribution result.

As another implementation manner, the intelligent terminal sorts the recording durations of the multiple effective voice segments according to an increasing or decreasing order to obtain a sorting result; and selecting the recording time length of a preset position in the sorting result as the new second time length threshold value according to the sorting result. Compared with the Gaussian distribution of the recording duration of the effective voice of the statistical strip, the implementation method has relatively low requirement on computing resources and can improve the computing efficiency.

For example, the intelligent terminal sorts the recording durations of 1000 effective voice segments in a descending order to obtain a sorting result. Based on the sorting result, the intelligent terminal selects the recording time length positioned at the 100 th position from small to large as a new second time length threshold value.

In addition, considering that the same intelligent terminal may have a use scene of a plurality of users, the intelligent terminal records a corresponding relation between the voiceprint information and the second time length threshold to be matched, so before the intelligent terminal compares the acquisition time length of the audio clip to be identified with the second time length threshold, the voiceprint information of the audio clip to be identified is acquired, and the second time length threshold is determined from the second time length threshold to be matched according to the voiceprint information of the audio clip to be identified.

Therefore, the same intelligent terminal can adapt to the speaking habits of different users.

The embodiment of the present application further provides an audio processing apparatus, which includes at least one functional module that can be stored in the memory 120 in the form of software. Referring to fig. 4, functionally, the audio processing apparatus may include:

an audio obtaining module 1101, configured to obtain audio data during voice control.

In the embodiment of the present application, the audio acquisition module 1101 is configured to perform step S100 in fig. 1, and as to the detailed description of the audio acquisition module 1101, reference may be made to the detailed description of step S100.

The segment determining module 1102 is configured to determine, from the audio data, an audio segment to be identified, where a silence duration does not exceed a first duration threshold each time, where, within the silence duration, a sound intensity corresponding to the audio data does not exceed a silence threshold.

In the embodiment of the present application, the fragment determining module 1102 is configured to execute step S200 in fig. 1, and as to the detailed description of the fragment determining module 1102, reference may be made to the detailed description of step S200.

The segment identifying module 1103 is configured to compare the acquisition duration of the audio segment to be identified with a second duration threshold, and determine an effective audio segment exceeding the second duration threshold.

In this embodiment of the application, the fragment recognition module 1103 is configured to perform step S300 in fig. 1, and as to the detailed description of the fragment recognition module 1103, reference may be made to the detailed description of step S300.

Wherein, this audio processing apparatus still includes:

and a duration counting module 1104, configured to count recording durations of the multiple valid voice segments.

A duration adjusting module 1105, configured to adjust the second duration threshold according to the statistical result of the recording durations of the multiple effective speech segments.

Referring to fig. 5, the electronic device includes an audio processing apparatus 110, a memory 120, and a processor 130. The electronic device may be an intelligent terminal, or a server in communication connection with the intelligent terminal.

The memory 120, the processor 130, and other components are electrically connected to each other directly or indirectly to enable data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The audio processing apparatus 110 includes at least one software function module which can be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device. The processor 130 is used for executing executable modules stored in the memory 120, such as software functional modules and computer programs included in the audio processing device 110.

The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 120 is used for storing a program, and the processor 130 executes the program after receiving the execution instruction.

The processor 130 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

An embodiment of the present application further provides a storage medium, where a computer program is stored, and when the computer program is executed by the processor 130, the audio processing method is implemented.

In summary, the audio processing method, the audio processing apparatus, the electronic device, and the storage medium provided in the embodiments of the present application are provided. According to the audio processing method, the obtained audio segments are analyzed by controlling the difference between the voice and the interference sound in the duration, and the effective audio segments are determined. Because the interference sound and the control voice in the audio data can be distinguished, the accuracy in the voice interaction process with the equipment can be improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of audio processing, the method comprising:

acquiring audio data during voice control;

2. The audio processing method according to claim 1, wherein the step of comparing the acquisition duration of the audio segment to be identified with a second duration threshold to determine valid audio segments exceeding the second duration threshold comprises:

3. The audio processing method according to claim 1, wherein the step of determining from the audio data the audio segment to be identified having a length of silence per time not exceeding a first length threshold comprises:

4. The audio processing method of claim 1, wherein the method further comprises:

counting the recording duration of a plurality of effective voice fragments;

5. The audio processing method of claim 4, wherein the step of adjusting the second duration threshold according to the statistics of the recording durations of the plurality of valid speech segments comprises:

and determining the time length of which the confidence interval exceeds the confidence threshold value as the new second time length threshold value according to the result of the Gaussian distribution.

6. The audio processing method of claim 4, wherein the step of adjusting the second duration threshold according to the statistics of the recording durations of the plurality of valid speech segments comprises:

7. The audio processing method according to claim 1, wherein a correspondence between voiceprint information and a second duration threshold to be matched is recorded, and before comparing the acquisition duration of the audio segment to be identified with the second duration threshold, the method further comprises:

acquiring voiceprint information of the audio clip to be identified;

8. An audio processing apparatus, characterized in that the audio processing apparatus comprises:

9. The audio processing device according to claim 8, characterized in that the audio processing device further comprises:

10. An electronic device comprising a processor and a memory, the memory storing machine executable instructions that, when executed by the processor, implement the audio processing method of any of claims 1-6.

11. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements an audio processing method according to any one of claims 1 to 7.