CN117975983A

CN117975983A - Voice processing method, device, equipment and storage medium

Info

Publication number: CN117975983A
Application number: CN202211305355.2A
Authority: CN
Inventors: 周新权; 莫昌星
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2024-05-03

Abstract

The embodiment of the disclosure provides a voice processing method, a voice processing device, voice processing equipment and a storage medium. The method comprises the following steps: by obtaining an indicator associated with multimedia, the multimedia comprising speech; according to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration when the terminal equipment processes the voice; and processing the voice based on the adjusted voice processing configuration. According to the embodiment of the disclosure, the voice processing configuration can be dynamically adjusted by the scheme of adjusting the current voice processing configuration based on the index related to the multimedia, so that system resources required by voice processing and multimedia playing are reasonably distributed.

Description

Voice processing method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of voice processing, and in particular relates to a voice processing method, a device, equipment and a storage medium.

Background

Performance is often the most critical indicator for game play, with the most important being the impact of game play on game smoothness, and specific indicators include frame rate, click through rate, size card, etc. In the prior art, corresponding voice processing parameters are determined by hardware configuration according to the model of the terminal. For example: different configurations are carried out on the high-medium-low three-gear machine type, the higher the gear is, the higher the configuration is, the better the call quality is, and the larger the performance consumption is, and vice versa. However, it is not entirely reasonable that the high-end game voice can consume more performance resources, because the high-end will generally start the high-profile setting of the game, and the game base is often consumed very much. Thus, the use of static speech processing parameters is not a completely reasonable choice.

Disclosure of Invention

The embodiment of the disclosure provides a voice processing method, a device, equipment and a storage medium, which dynamically adjust voice processing configuration so as to reasonably allocate system resources required by voice processing and multimedia playing.

In a first aspect, an embodiment of the present disclosure provides a voice processing method, including: acquiring an index associated with multimedia, the multimedia comprising speech; according to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration when the terminal equipment processes the voice; and processing the voice based on the adjusted voice processing configuration.

In a second aspect, an embodiment of the present disclosure further provides a speech processing apparatus, including: the system comprises an index acquisition module, a processing module and a processing module, wherein the index acquisition module is used for acquiring an index associated with multimedia, and the multimedia comprises voice; a voice processing configuration adjustment module, configured to adjust a current voice processing configuration associated with the voice according to the index, where the voice processing configuration indicates a processing configuration when the terminal device processes the voice; and the first voice processing module is used for processing the voice based on the adjusted voice processing configuration.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

one or more processors;

Storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech processing methods as described in embodiments of the present disclosure.

In a fourth aspect, the disclosed embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a speech processing method as described in the disclosed embodiments.

According to the technical scheme, indexes associated with multimedia are acquired, wherein the multimedia comprises voice; according to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration when the terminal equipment processes the voice; and processing the voice based on the adjusted voice processing configuration. According to the embodiment of the disclosure, the voice processing configuration can be dynamically adjusted by the scheme of adjusting the current voice processing configuration based on the index related to the multimedia, so that system resources required by voice processing and multimedia playing are reasonably distributed.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a voice processing method according to an embodiment of the disclosure;

FIG. 2 is a flowchart of another speech processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another speech processing method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a speech processing device according to an embodiment of the disclosure;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

Fig. 1 is a schematic flow chart of a voice processing method provided by an embodiment of the present disclosure, where the embodiment of the present disclosure is suitable for a situation of processing voice in a scene of playing a game while making a voice call or a scene of playing a video while making a voice call, the method may be performed by a voice processing apparatus, and the apparatus may be implemented in a form of software and/or hardware, optionally, by an electronic device, where the electronic device may be a mobile terminal, a PC end, a server, or the like. As shown in fig. 1, the method includes:

s110, acquiring indexes associated with multimedia, wherein the multimedia comprises voice.

Specifically, the index associated with the multimedia in the first set duration is obtained, where the first set duration may be any value from 2 seconds to 10 seconds, for example, may be 2 seconds, 3 seconds, or 5 seconds, and the embodiment is not limited to this. The index related to the multimedia can be obtained by calculation through a voice processing module or through a game service layer. Optionally, the index includes at least one of a play frame rate of the multimedia, a click-through rate during the play of the multimedia, a system resource occupancy rate of the terminal device, and a difference between an open-wheat frame rate and a close-wheat frame rate during the play of the multimedia.

Where the frame rate may be understood as the video frames per unit time. For example, if the first set duration is two seconds, the frame rate is: the total number of video frames in two seconds/two seconds. The click through rate may be a click through length/total length. For example, if the video is blocked twice (e.g., two seconds) within a first set duration in the playing process, the first time is blocked for 0.1 seconds, and the second time is blocked for 0.5 seconds, the blocking rate is: (0.1+0.5)/2. The system resource occupancy may be a central processing unit (Central Processing Unit, CPU) occupancy and/or an average of memory or graphics processor (Graphic Processing Unit, GPU) over a set period of time. For example, if the CPU occupancy rate is counted once per second, the system resource occupancy rate is respectively 70% and 80% in the first set period (for example, two seconds): (70% + 80%)/2.

The microphone is opened by the terminal equipment, wherein the wheat opening frame rate is the playing frame rate of the multimedia after the microphone is opened by the terminal equipment; the closing frame rate is the playing frame rate of the multimedia after the terminal equipment closes the microphone. The instant-on frame rate may be understood as the frame rate in the state where the microphone is turned on. The closed microphone frame rate may be understood as a frame rate in a state where the microphone is turned off. The difference between the open-microphone frame rate and the close-microphone frame rate can be used for judging whether the smoothness of the game or video playing is influenced by the open microphone or not, so that the difference is used as a basis for adjusting the voice processing configuration.

In this embodiment, the indexes associated with the multimedia include a frame rate, a katon rate, a system resource occupancy rate, a difference between an open frame rate and a close frame rate, which are used as specific basis for whether the current speech processing configuration needs to be adjusted in the following process.

S120, according to the index, the current voice processing configuration associated with the voice is adjusted.

The voice processing configuration indicates the processing configuration when the terminal equipment processes the voice. The speech processing configuration is comprised of at least one speech processing configuration information, wherein the speech processing configuration information comprises at least one of: the artificial intelligence AI noise reduction is turned on or off; high-complexity or low-complexity audio codec and echo cancellation are turned on or off. The speech processing configuration may be understood as a number of speech processing configurations determined from the speech processing configuration information. For example, artificial intelligence AI noise reduction is turned on or off with speech processing configuration information; high-complexity or low-complexity audio codec and echo cancellation are exemplified as on or off. The highest configuration, such as the third speech processing configuration, may be determined based on the artificial intelligence AI noise reduction on, the audio codec to high complexity, and the echo cancellation on. The minimum configuration, such as the first speech processing configuration, may be determined based on the artificial intelligence AI noise reduction shutdown, the audio codec being low complexity, and the echo cancellation shutdown. Similarly, the rest of the cases may be intermediate configurations, such as a second speech processing configuration. If the artificial intelligence AI is started, the effect of better noise elimination can be obtained, the echo cancellation can be started to bring better tone quality experience, the audio codec with higher complexity can obtain better coding tone quality under the same code rate, and better voice call experience can be obtained.

In the embodiment, the configuration information comprises the noise reduction start or stop of the artificial intelligence AI through voice processing; high-complexity or low-complexity audio codec and echo cancellation are turned on or off to facilitate subsequent implementation of voice processing configuration adjustments.

In this embodiment, the current speech processing configuration may be adjusted by determining whether the multimedia playing index is within the corresponding index range in the game or video playing, so as to dynamically adjust the speech processing configuration. It should be noted that, before the current speech processing configuration is adjusted according to the related indicators of the multimedia, the related indicators of the multimedia may be preprocessed, for example, the statistics of mean variance, the outlier rejection, etc. are processed, so as to ensure that the related indicators of the multimedia are accurately determined later.

Optionally, the manner of adjusting the current speech processing configuration according to the index may be: if the index meets any one of the following, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is larger than a first set threshold value; the blocking rate in the multimedia playing process is larger than a second set threshold value; the system resource occupancy rate of the terminal equipment exceeds a third set threshold value; the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is larger than a fourth set threshold value.

The target frame rate may be understood as a desired frame rate set in advance. The frame rate, relative to the target frame rate, may be understood as the actual frame rate of the game or video during play. For example, if the target frame rate is 30, the frame rate is 20, and the first set threshold is 5, the difference between the frame rate and the target frame rate is 10, which is greater than the first set threshold. If the difference between the frame rate and the target frame rate is greater than the first set threshold, it indicates that the frame rate in the actual playing process is very low compared with the target frame rate, and the voice processing configuration needs to be turned down.

If the click-through rate is greater than the second set threshold, for example, the second set threshold is 0.2 and the click-through rate is 0.4, it indicates that the current game or the video playing frame is not smooth enough, and the voice processing configuration needs to be turned down.

If the system resource occupancy rate exceeds the third set threshold, for example, if the system resource occupancy rate is 80% when performing voice processing and the third set threshold is 60%, the voice processing configuration needs to be turned down to reduce the system resource occupancy rate when performing voice processing and improve the system resource occupancy rate when playing a game or video picture, thereby ensuring the streaming degree of the game or video picture.

If the difference between the open frame rate and the close frame rate is greater than the fourth set threshold, for example, the open frame rate is 20, the close frame rate is 40, the difference between the open frame rate and the close frame rate is 20, and the fourth set threshold is 5, it indicates that the voice processing configuration needs to be turned down possibly because the open frame affects the streaming degree of the game or the video playing. Fig. 2 is a schematic flow chart of another voice processing method according to an embodiment of the present invention. Judging whether the difference value between the playing frame rate of the multimedia and the target frame rate is larger than a first set threshold value, if so, adjusting the voice processing configuration. Otherwise, judging whether the difference value of the open wheat frame rate and the close wheat frame rate in the multimedia playing process is larger than a fourth set threshold value, and if so, adjusting the voice processing configuration. Otherwise, judging whether the blocking rate in the multimedia playing process is larger than a second set threshold value, and if so, adjusting the voice processing configuration. Otherwise, judging whether the system resource occupancy rate of the terminal equipment exceeds a third set threshold value, and if so, adjusting the voice processing configuration. Otherwise, no adjustments are made to the speech processing configuration.

In this embodiment, if the index satisfies any one of the following, the voice processing configuration is turned down, so that the flow of playing the game or video image can be ensured: the difference value between the frame rate and the target frame rate is larger than a first set threshold value; the stuck rate is larger than a second set threshold; the occupancy rate of system resources exceeds a third set threshold; the difference between the open wheat frame rate and the closed wheat frame rate is greater than a fourth set threshold.

Optionally, the manner of adjusting the voice processing configuration may be: if the artificial intelligence AI noise reduction function of the voice is started, the AI noise reduction function is closed; and/or if the complexity of the audio codec of the speech is high, adjusting the complexity of the audio codec to low complexity; and/or if the echo cancellation function for speech is on, the echo cancellation function is off.

The AI noise reduction function is a function of performing noise reduction processing on the collected voice; the echo cancellation function is a function of performing echo cancellation on the collected voice.

Specifically, if the AI noise reduction is started, the AI noise reduction is closed; and/or if AI noise reduction is closed, judging whether the audio coding and decoding are of high complexity; if the audio codec is of high complexity, adjusting the audio codec to low complexity; and/or if the audio codec is low complexity, judging whether the echo cancellation is started; if the echo cancellation is on, the echo cancellation is off.

Specifically, if the voice processing configuration needs to be adjusted, the adjustment is performed according to the following sequence: firstly, judging whether AI noise reduction is started or not, if the AI noise reduction is started, closing the AI noise reduction; if the AI noise reduction is closed, judging whether the audio codec is of high complexity, if the audio codec is of high complexity, adjusting the audio codec to be of low complexity, so as to simplify the audio processing flow and reduce the complexity of the audio codec by using a lighter-weight audio processing algorithm; if the audio frequency coding and decoding are low in complexity, finally judging whether the echo cancellation is started or not, and if the echo cancellation is started, closing the echo cancellation.

In the embodiment, if the artificial intelligence AI noise reduction function of the voice is started, the AI noise reduction function is closed; and/or if the complexity of the audio codec of the speech is high, adjusting the complexity of the audio codec to low complexity; and/or if the echo cancellation function for speech is on, the echo cancellation function is off. By the above decision of degrading the voice processing configuration, the degradation of the voice processing configuration in the game or video playing process can be effectively realized, namely, a lighter-weight audio processing configuration is used to ensure the streaming degree of the game or video picture playing.

Optionally, the manner of adjusting the current speech processing configuration according to the index may be: if the index simultaneously meets the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is smaller than a fifth set threshold value; the blocking rate in the multimedia playing process is smaller than a sixth set threshold value; the system resource occupancy rate of the terminal equipment is lower than a seventh set threshold value; and the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is smaller than an eighth set threshold value.

The fifth set threshold is different from the first set threshold, and the fifth set threshold is smaller than the first set threshold. For example, the fifth set threshold is 3. The sixth set threshold is different from the second set threshold, and the sixth set threshold is smaller than the second set threshold, for example, the sixth set threshold is 0.2. The seventh set threshold is different from the third set threshold and is smaller than the third set threshold, for example, the seventh set threshold is 40%, the eighth set threshold is different from the fourth set threshold and is smaller than the fourth set threshold, for example, the eighth set threshold is 3.

In this embodiment, if the indexes simultaneously meet the following conditions, the voice processing configuration is adjusted to be high, so as to improve the quality of the talking voice while ensuring smooth playing of the game or video picture: the difference value between the frame rate and the target frame rate is smaller than a fifth set threshold value; the stuck rate is smaller than a sixth set threshold; the occupancy rate of the system resources is lower than a seventh set threshold value; and the difference value between the open wheat frame rate and the close wheat frame rate is smaller than an eighth set threshold value.

It should be noted that if the difference between the frame rate and the target frame rate is between the fifth set threshold and the first set threshold; if the stuck rate is between the sixth set threshold and the second set threshold; if the system resource occupancy rate is between the seventh set threshold value and the third set threshold value; if the difference between the open wheat frame rate and the close wheat frame rate is between the eighth set threshold and less than the fourth set threshold, the voice processing configuration is not required to be processed.

Optionally, the manner of adjusting the voice processing configuration may be: if the echo cancellation function of the voice is closed, starting echo cancellation; and/or if the complexity of the audio codec of the speech is low, adjusting the complexity of the audio codec to high complexity; and/or if the AI noise reduction function of the voice is closed, starting AI noise reduction.

Specifically, if the echo cancellation is closed, the echo cancellation is started; and/or if the echo cancellation is on, judging whether the audio codec is low complexity; if the audio codec is low complexity, adjusting the audio codec to high complexity; and/or if the audio codec is of high complexity, determining whether AI noise reduction is off; if the AI noise reduction is closed, the AI noise reduction is started.

Specifically, if the voice processing configuration needs to be adjusted, the adjustment is performed according to the following sequence: firstly, judging whether the echo cancellation is closed or not, and if the echo cancellation is closed, starting the echo cancellation; if the echo cancellation is started, judging whether the audio coding and decoding are of low complexity, and if the audio coding and decoding are of low complexity, adjusting the audio coding and decoding to be of high complexity; if the audio coding and decoding are high in complexity, finally judging whether the AI noise reduction is closed or not, and if the AI noise reduction is closed, starting the AI noise reduction.

In this embodiment, if the echo cancellation function for voice is turned off, the echo cancellation is turned on; and/or if the complexity of the audio codec of the speech is low, adjusting the complexity of the audio codec to high complexity; and/or if the AI noise reduction function of the voice is closed, starting AI noise reduction. By the decision of upgrading the voice processing configuration, the voice processing configuration can be effectively upgraded in the playing process of the game or the video picture, so that the quality of voice communication is effectively improved.

It should be noted that if the echo cancellation is turned on and the audio codec is high-complexity and the AI noise reduction is turned on in the process of determining whether the echo cancellation is turned off, determining whether the audio codec is low-complexity and determining whether the AI noise reduction is turned off, it indicates that the playing smoothness of the game or video picture and the speech quality of the call are both in good states.

S130, processing the voice based on the adjusted voice processing configuration.

In this embodiment, the voice may be processed by using a lighter-weight audio processing algorithm to simplify the audio processing procedure. According to the embodiment, the voice is processed based on the adjusted voice processing configuration, so that system resources required by voice processing and multimedia playing can be dynamically and reasonably allocated, and the user experience is improved.

Optionally, after processing the voice based on the adjusted voice processing configuration, the method further includes: acquiring the index again after a set time length from the beginning of processing the collected voice based on the adjusted voice processing configuration; and adjusting the voice processing configuration according to the re-acquired index.

Wherein the set duration is greater than the first set duration, for example, the set duration is 5 seconds, and the first set duration is 2 seconds. In this embodiment, after the above-mentioned acquiring the multimedia-related index within the first set duration and adjusting the current speech processing configuration based on the multimedia-related index, that is, after the game or video runs for the set duration based on the adjusted speech processing configuration, the multimedia-related index is acquired again, and the speech processing configuration is adjusted according to the acquired multimedia-related index again. In this embodiment, by reserving the time for adjusting the speech processing configuration, the accuracy of adjusting the speech processing configuration again can be improved. Fig. 3 is a schematic flow chart of another voice processing method according to an embodiment of the disclosure. The method comprises the following specific steps:

S210, acquiring indexes associated with the multimedia.

The index comprises the playing frame rate of the multimedia, the blocking rate in the playing process of the multimedia, the system resource occupancy rate of the terminal equipment, and the difference value of the wheat opening frame rate and the wheat closing frame rate in the playing process of the multimedia.

S220, preprocessing the index related to the multimedia to obtain the index related to the preprocessed multimedia.

S230, judging whether the current voice processing configuration related to the voice needs to be adjusted according to the preprocessed multimedia related index, if so, executing S240, otherwise, executing S210 again after the set duration.

S240, judging whether the current voice processing configuration associated with the voice needs to be adjusted again, if so, executing S250, otherwise, executing S260.

S250, adjusting voice processing configuration.

S260, reminding that the voice processing configuration cannot be adjusted.

According to the technical scheme, indexes associated with the multimedia are acquired, and the multimedia comprises voice; according to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration of the terminal equipment when the voice is processed; and processing the voice based on the adjusted voice processing configuration. According to the embodiment of the disclosure, the voice processing configuration can be dynamically adjusted by the scheme of adjusting the current voice processing configuration based on the index related to the multimedia, so that system resources required by voice processing and multimedia playing are reasonably distributed.

Fig. 4 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the disclosure, as shown in fig. 2, where the apparatus includes: an index acquisition module 401, a speech processing configuration adjustment module 402, and a first speech processing module 403.

An index obtaining module 401, configured to obtain an index associated with multimedia, where the multimedia includes speech;

a voice processing configuration adjustment module 402, configured to adjust a current voice processing configuration associated with the voice according to the index, where the voice processing configuration indicates a processing configuration when the terminal device processes the voice;

the first speech processing module 403 is configured to process the speech based on the adjusted speech processing configuration.

According to the technical scheme provided by the embodiment of the disclosure, the index obtaining module is used for obtaining the index associated with the multimedia, wherein the multimedia comprises voice; adjusting current voice processing configuration related to the voice through a voice processing configuration adjusting module according to the index, wherein the voice processing configuration indicates processing configuration when terminal equipment processes the voice; and processing the voice through the first voice processing module based on the adjusted voice processing configuration. According to the embodiment of the disclosure, the voice processing configuration can be dynamically adjusted by the scheme of adjusting the current voice processing configuration based on the index related to the multimedia, so that system resources required by voice processing and multimedia playing are reasonably distributed.

Optionally, the index includes at least one of a playing frame rate of the multimedia, a click-through rate in a playing process of the multimedia, a system resource occupancy rate of the terminal device, and a difference value between an open-wheat frame rate and a close-wheat frame rate in the playing process of the multimedia; the open frame rate is the play frame rate of the multimedia after the microphone is opened by the terminal equipment; the closing frame rate is the playing frame rate of the multimedia after the microphone is closed by the terminal equipment.

Optionally, the speech processing configuration adjustment module 402 is specifically configured to: and if the index meets any one of the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is larger than a first set threshold value; the blocking rate in the multimedia playing process is larger than a second set threshold value; the system resource occupancy rate of the terminal equipment exceeds a third set threshold; the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is larger than a fourth set threshold value.

Optionally, the speech processing configuration adjustment module 402 is further configured to: if the artificial intelligence AI noise reduction function of the voice is started, the AI noise reduction function is closed; and/or if the complexity of the audio codec of the speech is high, adjusting the complexity of the audio codec to low complexity; and/or if the echo cancellation function of the voice is started, the echo cancellation function is closed; the AI noise reduction function is a function of performing noise reduction processing on the collected voice; the echo cancellation function is a function of performing echo cancellation on the collected voice.

Optionally, the speech processing configuration adjustment module 402 is further configured to: and if the index simultaneously meets the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is smaller than a fifth set threshold value; the blocking rate in the multimedia playing process is smaller than a sixth set threshold value; the system resource occupancy rate of the terminal equipment is lower than a seventh set threshold value; and the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is smaller than an eighth set threshold value.

Optionally, the speech processing configuration adjustment module 402 is further configured to: if the echo cancellation is closed, the echo cancellation is started; if the echo cancellation is started, judging whether the audio coding and decoding are low in complexity or not; if the audio codec is low complexity, adjusting the audio codec to high complexity; if the audio coding and decoding are high in complexity, judging whether the AI noise reduction is closed or not; and if the AI noise reduction is closed, starting the AI noise reduction.

Optionally, the above device further includes a second voice processing module, where the second voice processing module is specifically configured to: acquiring the index again after a set time length from the beginning of processing the collected voice based on the adjusted voice processing configuration; and adjusting the voice processing configuration according to the re-acquired index.

The voice processing device provided by the embodiment of the disclosure can execute the voice processing method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. Referring now to fig. 5, a schematic diagram of an electronic device (e.g., a terminal device or server in fig. 5) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

The electronic device provided by the embodiment of the present disclosure and the voice processing method provided by the foregoing embodiment belong to the same inventive concept, and technical details not described in detail in the present embodiment may be referred to the foregoing embodiment, and the present embodiment has the same beneficial effects as the foregoing embodiment.

The embodiment of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the speech processing method provided by the above embodiment.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an index associated with multimedia, the multimedia comprising speech; according to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration when the terminal equipment processes the voice; and processing the voice based on the adjusted voice processing configuration.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, the embodiments of the present disclosure disclose a voice processing method, including: acquiring an index associated with multimedia, the multimedia comprising speech; according to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration when the terminal equipment processes the voice; and processing the voice based on the adjusted voice processing configuration.

Further, the index comprises at least one of a playing frame rate of the multimedia, a blocking rate in the playing process of the multimedia, a system resource occupancy rate of the terminal equipment, and a difference value of a wheat opening frame rate and a wheat closing frame rate in the playing process of the multimedia; the open frame rate is the play frame rate of the multimedia after the microphone is opened by the terminal equipment; the closing frame rate is the playing frame rate of the multimedia after the microphone is closed by the terminal equipment.

Further, adjusting the current speech processing configuration according to the index includes: and if the index meets any one of the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is larger than a first set threshold value; the blocking rate in the multimedia playing process is larger than a second set threshold value; the system resource occupancy rate of the terminal equipment exceeds a third set threshold; the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is larger than a fourth set threshold value.

Further, adjusting the speech processing configuration includes: if the artificial intelligence AI noise reduction function of the voice is started, the AI noise reduction function is closed; and/or if the complexity of the audio codec of the speech is high, adjusting the complexity of the audio codec to low complexity; and/or if the echo cancellation function of the voice is started, the echo cancellation function is closed; the AI noise reduction function is a function of performing noise reduction processing on the collected voice; the echo cancellation function is a function of performing echo cancellation on the collected voice.

Further, adjusting the current speech processing configuration according to the index includes: and if the index simultaneously meets the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is smaller than a fifth set threshold value; the blocking rate in the multimedia playing process is smaller than a sixth set threshold value; the system resource occupancy rate of the terminal equipment is lower than a seventh set threshold value; and the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is smaller than an eighth set threshold value.

Further, adjusting the speech processing configuration includes: if the echo cancellation function of the voice is closed, starting the echo cancellation; and/or if the complexity of the audio codec of the speech is low, adjusting the complexity of the audio codec to high complexity; and/or if the AI noise reduction function of the voice is closed, starting the AI noise reduction.

Further, after processing the speech based on the adjusted speech processing configuration, the method further comprises: acquiring the index again after a set time length from the beginning of processing the collected voice based on the adjusted voice processing configuration; and adjusting the voice processing configuration according to the re-acquired index.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of speech processing, comprising:

Acquiring an index associated with multimedia, the multimedia comprising speech;

According to the index, adjusting the current voice processing configuration associated with the voice, wherein the voice processing configuration indicates the processing configuration when the terminal equipment processes the voice;

and processing the voice based on the adjusted voice processing configuration.

2. The method of claim 1, wherein the indicator comprises at least one of a play frame rate of the multimedia, a click-through rate during the multimedia play, a system resource occupancy of the terminal device, a difference between an open frame rate and a closed frame rate during the multimedia play; the open frame rate is the play frame rate of the multimedia after the microphone is opened by the terminal equipment; the closing frame rate is the playing frame rate of the multimedia after the microphone is closed by the terminal equipment.

3. The method according to claim 1 or 2, wherein adjusting the current speech processing configuration according to the indicator comprises:

And if the index meets any one of the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is larger than a first set threshold value; the blocking rate in the multimedia playing process is larger than a second set threshold value; the system resource occupancy rate of the terminal equipment exceeds a third set threshold; the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is larger than a fourth set threshold value.

4. A method according to claim 3, wherein adjusting the speech processing configuration comprises:

If the artificial intelligence AI noise reduction function of the voice is started, the AI noise reduction function is closed; and/or

If the complexity of the audio coding and decoding of the voice is high, the complexity of the audio coding and decoding is adjusted to be low; and/or

If the echo cancellation function of the voice is started, the echo cancellation function is closed; the AI noise reduction function is a function of performing noise reduction processing on the collected voice; the echo cancellation function is a function of performing echo cancellation on the collected voice.

5. The method of claim 2, wherein adjusting the current speech processing configuration according to the indicator comprises:

And if the index simultaneously meets the following conditions, adjusting the voice processing configuration: the difference value between the playing frame rate of the multimedia and the target frame rate is smaller than a fifth set threshold value; the blocking rate in the multimedia playing process is smaller than a sixth set threshold value; the system resource occupancy rate of the terminal equipment is lower than a seventh set threshold value; and the difference value between the open wheat frame rate and the close wheat frame rate in the multimedia playing process is smaller than an eighth set threshold value.

6. The method of claim 5, wherein adjusting the speech processing configuration comprises:

if the echo cancellation function of the voice is closed, starting the echo cancellation; and/or

If the complexity of the audio coding and decoding of the voice is low, the complexity of the audio coding and decoding is adjusted to be high; and/or the number of the groups of groups,

And if the AI noise reduction function of the voice is closed, starting the AI noise reduction.

7. The method of claim 1, further comprising, after processing the speech based on the adjusted speech processing configuration:

Acquiring the index again after a set time length from the beginning of processing the collected voice based on the adjusted voice processing configuration;

and adjusting the voice processing configuration according to the re-acquired index.

8. A speech processing apparatus, comprising:

the system comprises an index acquisition module, a processing module and a processing module, wherein the index acquisition module is used for acquiring an index associated with multimedia, and the multimedia comprises voice;

A voice processing configuration adjustment module, configured to adjust a current voice processing configuration associated with the voice according to the index, where the voice processing configuration indicates a processing configuration when the terminal device processes the voice;

And the first voice processing module is used for processing the voice based on the adjusted voice processing configuration.

9. An electronic device, the electronic device comprising:

one or more processors;

Storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the speech processing method of any of claims 1-7.

10. A storage medium containing computer executable instructions for performing the speech processing method of any of claims 1-7 when executed by a computer processor.