CN113395579A

CN113395579A - Audio processing method and video communication system

Info

Publication number: CN113395579A
Application number: CN202110596392.2A
Authority: CN
Inventors: 刘风华
Original assignee: Individual
Current assignee: Wuhan Intelligent Convergence Communication Technology Co ltd
Priority date: 2020-10-08
Filing date: 2020-10-08
Publication date: 2021-09-14
Anticipated expiration: 2040-10-08
Also published as: CN112153448A; CN112153448B; CN113395580B; CN113395579B; CN113395580A

Abstract

The application provides an audio processing method and a video communication system, and relates to the technical field of audio processing. Firstly, in a plurality of audio processing modes included in a preset audio processing method set, determining a target audio processing mode based on audio attribute information of at least one second processed audio data packet sent by a second video communication terminal before the first time; secondly, processing the first to-be-processed voice information in the first to-be-processed audio data packet based on the target audio processing mode to obtain first processed voice information; then, based on the first processed voice information and the first timestamp information, a first processed audio data packet corresponding to the first to-be-processed audio data packet is obtained and sent to the second video communication terminal. By the method, the problem that unreasonable processing exists on the audio data in the existing video communication can be solved.

Description

Audio processing method and video communication system

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio processing method and a video communication system.

Background

The development of audio processing technology has led to a continuous expansion of its range of applications, for example, in video communications. However, the inventor researches and finds that in the conventional video communication, the problem that the processing of the audio data is unreasonable still exists.

Disclosure of Invention

In view of the above, an object of the present invention is to provide an audio processing method and a video communication system in video communication, so as to solve the problem that the processing of audio data in the existing video communication is unreasonable.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

an audio processing method in video communication, comprising:

the method comprises the steps of obtaining a first audio data packet to be processed sent by a first video communication terminal at a first time, wherein the first audio data packet to be processed comprises first voice information to be processed and first timestamp information corresponding to the first voice information to be processed;

obtaining audio attribute information of at least one second processed audio data packet transmitted by a second video communication terminal before the first time;

determining a target audio processing mode based on the audio attribute information in a plurality of audio processing modes included in a preset audio processing method set;

processing first voice information to be processed in the first audio data packet to be processed based on the target audio processing mode to obtain first processed voice information;

obtaining a first processed audio data packet corresponding to the first to-be-processed audio data packet based on the first processed voice information and the first timestamp information;

and sending the first processed audio data packet to the second video communication terminal, so that the second video communication terminal carries out synchronous playing processing on the first processed voice information in the first processed audio data packet and the video information in the video data packet based on the first timestamp information in the first processed audio data packet and the second timestamp information in the acquired video data packet.

An embodiment of the present application further provides an audio processing system in video communication, including:

the audio data packet to be processed acquiring module is used for acquiring a first audio data packet to be processed sent by a first video communication terminal at a first time, wherein the first audio data packet to be processed comprises first voice information to be processed and first timestamp information corresponding to the first voice information to be processed;

an audio attribute information obtaining module, configured to obtain audio attribute information of at least one second processed audio data packet sent by a second video communication terminal before the first time;

the audio processing mode determining module is used for determining a target audio processing mode based on the audio attribute information in a plurality of audio processing modes included in a preset audio processing method set;

the to-be-processed voice information processing module is used for processing first to-be-processed voice information in the first to-be-processed audio data packet based on the target audio processing mode to obtain first processed voice information;

a processed audio data packet obtaining module, configured to obtain a first processed audio data packet corresponding to the first to-be-processed audio data packet based on the first processed voice information and the first timestamp information;

and the processed audio data packet sending module is used for sending the first processed audio data packet to the second video communication terminal so that the second video communication terminal can synchronously play and process the first processed voice information in the first processed audio data packet and the video information in the video data packet based on the first timestamp information in the first processed audio data packet and the second timestamp information in the acquired video data packet.

According to the audio processing method and system in video communication, the target audio processing mode is selected from the multiple audio processing modes based on the attribute information of the processed audio data, the audio data to be processed is processed, the adaptability of the audio data processing mode is better, and therefore the problem that the existing video communication is unreasonable in the processing of the audio data can be solved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a system block diagram of a video communication system according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of each flow included in an audio processing method in video communication according to an embodiment of the present application.

Fig. 3 is a block diagram illustrating functional modules included in an audio processing system in video communication according to an embodiment of the present disclosure.

Icon: 10-a video communication system; 20-a first video communication terminal; 30-a second video communication terminal; 40-a video communication server; 100-audio processing system in video communication; 110-a pending audio data packet obtaining module; 120-audio attribute information obtaining module; 130-an audio processing mode determining module; 140-a voice information processing module to be processed; 150-a processed audio data packet obtaining module; 160-processed audio data packet transmission module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, the present embodiment provides a video communication system 10. The video communication system 10 may include, among other things, a first video communication terminal 20, a second video communication terminal 30, and a video communication server 40.

In detail, the video communication server 40 may be in communication connection with the first video communication terminal 20 and the second video communication terminal 30, and is configured to process the audio and video data sent by the first video communication terminal 20 and send the processed audio and video data to the second video communication terminal 30, and process the audio and video data sent by the second video communication terminal 30 and send the processed audio and video data to the first video communication terminal 20. In this way, the interaction of audio and video data between the first video communication terminal 20 and the second video communication terminal 30 can be realized, so that the video communication between the first user corresponding to the first video communication terminal 20 and the second user corresponding to the second video communication terminal 30 is realized.

The first video communication terminal 20 and the second video communication terminal 30 may be mobile devices such as mobile phones.

With reference to fig. 2, an embodiment of the present application further provides an audio processing method in video communication, which can be applied to the video communication server 40. The method steps defined by the flow related to the audio processing method in the video communication can be implemented by the video communication server 40. The specific process shown in FIG. 2 will be described in detail below.

In step S110, a first to-be-processed audio data packet sent by the first video communication terminal 20 at a first time is obtained.

In this embodiment, the video communication server 40 may first obtain the first to-be-processed audio data packet transmitted by the first video communication terminal 20 at the first time.

The first to-be-processed audio data packet may include first to-be-processed voice information and first timestamp information corresponding to the first to-be-processed voice information (e.g., the first timestamp information may be a formation time of the first to-be-processed voice information).

Step S120, obtaining the audio attribute information of at least one second processed audio data packet sent by the second video communication terminal 30 before the first time.

In this embodiment, after obtaining the first to-be-processed audio data packet based on step S110, in order to be able to adaptively process the first to-be-processed audio data packet, the video communication server 40 may further obtain audio attribute information of at least one second processed audio data packet.

The at least one second processed audio data packet may be obtained by processing a second to-be-processed audio data packet sent by a second video communication terminal 30 (a terminal performing video communication with the first video communication terminal 20) before the first time.

Step S130, determining a target audio processing mode based on the audio attribute information in a plurality of audio processing modes included in a preset audio processing method set.

In this embodiment, after obtaining the audio attribute information based on step S120, the video communication server 40 may determine, based on the audio attribute information, a target audio processing manner from a plurality of audio processing manners included in a preset audio processing method set.

Step S140, processing the first to-be-processed speech information in the first to-be-processed audio data packet based on the target audio processing manner, so as to obtain first processed speech information.

In this embodiment, after determining the target audio processing manner based on step S130, the video communication server 40 may process the first to-be-processed voice information in the first to-be-processed audio data packet based on the target audio processing manner to obtain first processed voice information.

Step S150, obtaining a first processed audio data packet corresponding to the first to-be-processed audio data packet based on the first processed voice information and the first timestamp information.

In this embodiment, after obtaining the first processed voice information based on step S140, the video communication server 40 may obtain a first processed audio data packet corresponding to the first to-be-processed audio data packet based on the first processed voice information and the first timestamp information (e.g., packetizing, encoding and compressing, etc.).

Step S160, sending the first processed audio data packet to the second video communication terminal 30.

In this embodiment, after obtaining the first processed audio packet based on step S150, the video communication server 40 may send the first processed audio packet to the second video communication terminal 30. In this way, the second video communication terminal 30 can perform synchronous playing processing on the first processed voice information in the first processed audio data packet and the video information in the video data packet based on the first time stamp information in the first processed audio data packet and the second time stamp information in the acquired video data packet.

Based on this, video communication between the first user of the first video communication terminal 20 and the second user of the second video communication terminal 30 can be achieved.

The processing mode for processing the first to-be-processed audio data packet of the first video communication terminal 20 is determined based on the audio attribute information of the second processed audio data packet of the second video communication terminal 30, so that the target audio processing mode can be selected from a plurality of audio processing modes based on the attribute information of the processed audio data, and the to-be-processed audio data is processed, thereby ensuring that the processing mode of the audio data has better adaptability in the current video communication, and further improving the problem that the processing of the audio data in the existing video communication is unreasonable.

In the above steps, it should be noted that, in step S120, a specific manner of obtaining the audio attribute information is not limited, and may be selected according to actual application requirements.

For example, in an alternative example, the second video communication terminal 30 may generate audio attribute information of the second to-be-processed audio packet in response to an operation of a second user when obtaining the second to-be-processed audio packet, and send the audio attribute information and the second to-be-processed audio packet to the video communication server 40.

In this way, the video communication server 40 may use the audio attribute information received together with the second to-be-processed audio packet as the audio attribute information of the second processed audio packet corresponding to the second to-be-processed audio packet.

For another example, in another alternative example, step S120 may include sub-steps 11-15.

Substep 11, obtaining at least one second processed audio data packet obtained by processing based on at least one second to-be-processed audio data packet sent by the second video communication terminal 30 before the first time.

In this embodiment, at least one second to-be-processed audio data packet that has been sent by the second video communication terminal 30 before the first time (i.e., before the first to-be-processed audio data packet is sent by the first video communication terminal 20) may be determined, and then at least one second processed audio data packet obtained by processing the at least one second to-be-processed audio data packet may be obtained.

And a substep 12, performing traversal processing on the target database based on the at least one second processed audio data packet to obtain a first traversal result corresponding to the at least one second processed audio data packet.

In this embodiment, after obtaining the at least one second processed audio data packet based on substep 11, a traversal may be performed in a target database (e.g., a database server communicatively coupled to the video communication server 40) based on the at least one second processed audio data packet to obtain a corresponding first traversal result.

Wherein the first traversal result has target first audio attribute information therein. And, each second processed audio data packet has a first traversal result, so that at least one first traversal result can be obtained.

And substep 13, performing association search processing in the audio attribute association relationship of the local database based on the target first audio attribute information in the first traversal result to obtain target second audio attribute information corresponding to the target first audio attribute information.

In this embodiment, after obtaining the first traversal result based on substep 12, a correlation search may be performed based on the target first audio attribute information in the first traversal result in the audio attribute correlation included in the local database of the video communication server 40, so as to obtain target second audio attribute information corresponding to the target first audio attribute information.

Each piece of information association sub-relation in the audio attribute association relation comprises a first audio attribute sub-information set and corresponding second audio attribute information. And, the audio attribute association relationship includes a plurality of information association sub-relationships. In this way, the target first audio attribute information may be associated with the target first audio attribute sub-information set based on the target first audio attribute information, and the target second audio attribute information may be associated with the target first audio attribute sub-information set based on the target first audio attribute information.

And a substep 14, updating the target first audio attribute information in the first traversal result based on the target second audio attribute information, so as to obtain a second traversal result.

In this embodiment, after obtaining the target second audio attribute information based on substep 13, the target first audio attribute information in the first traversal result may be updated according to the target second audio attribute information, so that a second traversal result may be obtained. For example, the first traversal result may be directly replaced by the target second audio attribute information to obtain a second traversal result; the target first audio attribute information in the first traversal result may also be replaced with the target second audio attribute information (in this example, the first traversal result may also include other information, such as identification information of the second processed audio packet).

Substep 15, using said second traversal result as audio attribute information of said at least one second processed audio data packet.

In this embodiment, after obtaining the second traversal result (which is at least one) based on sub-step 14, the second traversal result may be used as the audio attribute information of the at least one second processed audio data packet.

Optionally, the specific manner of obtaining the first traversal result in sub-step 12 is not limited, for example, in one example, sub-step 12 may include:

first, a plurality of sequentially decreasing sample values may be determined;

secondly, a target second processed data packet may be obtained by sequentially spacing each sampling value in the plurality of second processed audio data packets from the beginning to the end according to the time direction (for example, a target second processed data packet may be obtained by first spacing 3 second processed audio data packets, and thus performed 3 times, then a target second processed data packet is obtained by spacing 2 second processed audio data packets, and thus performed 2 times, and then a target second processed data packet is obtained by spacing 1 second processed audio data packet, and thus performed 1 time, and finally, if there are second processed data packets, all of the second processed data packets are used as target second processed data packets);

then, traversal processing may be performed on the target database based on the target second processed audio data packet, so as to obtain a first traversal result corresponding to the at least one second processed audio data packet.

Further, considering that the association search needs to be performed based on the audio attribute association relationship in the above sub-step 13, the audio data processing method in video communication further includes a step of generating the audio attribute association relationship. Wherein in an alternative example this step may comprise the sub-steps 21-31.

And a substep 21 of obtaining at least one audio attribute correspondence.

In this embodiment, at least one audio attribute correspondence may be obtained first (e.g., generated in response to a user operation, or sent by another device receiving a communication connection).

Each of the audio attribute correspondence relationships may include first audio attribute information and corresponding second audio attribute information. In this way, at least one pair of first audio attribute information and second audio attribute information may be obtained.

And a substep 22, performing format check processing on the information content included in the audio attribute corresponding relation aiming at each audio attribute corresponding relation, so as to determine whether the format of the information content included in the audio attribute corresponding relation is standard or not based on the result of the format check processing.

In this embodiment, after obtaining the at least one audio attribute information corresponding relationship based on the substep 21, for each piece of the audio attribute corresponding relationship, information content included in the audio attribute corresponding relationship may be obtained first, and then format check processing is performed on the information content, so as to determine whether a format of the information content included in the audio attribute corresponding relationship is standard based on a result of the format check processing. For example, whether two dimensions of information are included in the audio attribute correspondence relationship, whether the two dimensions of information belong to different audio attribute information, that is, the first audio attribute information and the second audio attribute information, and the like.

And a substep 23, for each of the audio attribute correspondences, comparing the audio attribute correspondences in the historical audio attribute correspondence set if the format of the information content included in the audio attribute correspondences is standard, so as to determine whether the audio attribute correspondences belong to a repeated audio attribute correspondence based on the comparison result.

In this embodiment, after determining that the format of the information content of the obtained audio attribute correspondence belongs to the canonical format based on sub-step 22, the audio attribute correspondence may be compared with a historical audio attribute correspondence set (a set of historically obtained or generated audio attribute correspondences), so as to determine whether the audio attribute correspondence belongs to a repeated audio attribute correspondence according to a result of the comparison (if there is no historical audio attribute correspondence that is the same as the audio attribute correspondence in the historical audio attribute correspondence set, the audio attribute correspondence does not belong to the repeated audio attribute correspondence).

And a substep 24, for each audio attribute corresponding relationship, if the audio attribute corresponding relationship does not belong to the repeated audio attribute corresponding relationship in the historical audio attribute corresponding relationship set, extracting at least part of audio attribute corresponding relationships from the historical audio attribute corresponding relationship set based on the target number determined by the time information for obtaining the audio attribute corresponding relationship.

In this embodiment, after determining whether the audio attribute corresponding relationship belongs to the repeated audio attribute corresponding relationship based on the substep 23, if the audio attribute corresponding relationship does not belong to the repeated audio attribute corresponding relationship, time information for obtaining the audio attribute corresponding relationship may be determined first, and then a target number may be determined based on the time information, so that at least a part of the audio attribute corresponding relationship (for example, the audio attribute corresponding relationship of the target number is extracted, or the audio attribute corresponding relationship of not less than the target number is extracted) in the historical audio attribute corresponding relationship set based on the target number.

Wherein the later the time information is, the larger the target number may be.

And a substep 25, aiming at each audio attribute corresponding relation which does not belong to the repeated audio attribute corresponding relation, performing first verification processing on the first audio attribute information in the audio attribute corresponding relation in each part of information content included in the first audio attribute information in the at least part of audio attribute corresponding relation.

In this embodiment, after obtaining the at least part of audio attribute information corresponding relationship based on substep 24, first audio attribute information of the audio attribute corresponding relationship may be obtained first for each piece of audio attribute corresponding relationship that does not belong to the repeated audio attribute corresponding relationship, and then, first verification processing (for example, verifying whether the first audio attribute information of the duplicate negation is the same) may be performed on the first audio attribute information based on each part of information content included in the first audio attribute information of the at least part of audio attribute corresponding relationship.

And a substep 26, for each audio attribute corresponding relation not belonging to the repeated audio attribute corresponding relation, if a verification result of the first audio attribute information in the audio attribute corresponding relation in each part of information content included in the first audio attribute information in the at least part of audio attribute corresponding relation meets a preset condition, determining that the audio attribute corresponding relation passes the first verification.

In this embodiment, after performing the first verification process on each piece of the first audio attribute information not belonging to the audio attribute corresponding relationship of the repeated audio attribute corresponding relationship based on the sub-step 25, the verification result corresponding to the first audio attribute information may be compared with a preset condition, and then when the verification result corresponding to the first audio attribute information satisfies the preset condition (for example, the verification result is that the first audio attribute information of the verification pair is completely different or at least partially different), it may be determined that the audio attribute corresponding relationship passes the first verification.

And substep 27, for each audio attribute corresponding relation not belonging to the repeated audio attribute corresponding relation, if the audio attribute corresponding relation passes the first check, performing decomposition processing on the first audio attribute information in the audio attribute corresponding relation to obtain a first audio attribute sub-information set.

In this embodiment, after determining that the audio attribute correspondence passes the first check based on sub-step 27, the first audio attribute information may be decomposed to obtain a plurality of first audio attribute sub-information, so as to form a first audio attribute sub-information set (the first audio attribute sub-information set at least includes a part of the plurality of first audio attribute sub-information).

And a substep 28, for each audio attribute corresponding relation not belonging to the repeated audio attribute corresponding relation, generating an information association sub-relation based on the first audio attribute sub-information set corresponding to the audio attribute corresponding relation and the corresponding second audio attribute information.

In this embodiment, after obtaining the first audio attribute sub-information set based on substep 27, an information association sub-relationship may be generated based on the first audio attribute sub-information set and second audio attribute information corresponding to first audio attribute information corresponding to the first audio attribute sub-information set.

That is, one information association sub-relationship includes the corresponding first audio attribute sub-information set and second audio attribute information.

And a substep 29 of performing association relation verification processing on each information association subrelation.

In this embodiment, after the information association sub-relationships are generated based on sub-step 28, an association relationship verification process may be performed on each of the information association sub-relationships.

Substep 30, obtaining an association relationship verification result of the information association sub-relationship for each piece of information association sub-relationship, and determining whether the information association sub-relationship passes the association relationship verification based on the association relationship verification result.

In this embodiment, after the association relationship verification processing is performed on the information association sub-relationship based on the sub-step 29, an association relationship verification result of the association relationship verification processing may be obtained first, and then it is determined whether the corresponding information relationship sub-relationship passes the association relationship verification based on the association relationship verification result.

And a substep 31, for each information association sub-relationship, if the information association sub-relationship passes the association relationship verification, obtaining an audio attribute association relationship based on the information association sub-relationship.

In this embodiment, after the sub-step 30 is performed to determine whether the information association sub-relationship passes the association relationship verification, if the information association sub-relationship passes the association relationship verification, the audio attribute association relationship may be obtained based on the information association sub-relationship.

That is, each information association sub-relationship verified by the association relationship may be made a part of the audio attribute association relationship.

In the above example, the specific way of performing the association relation verification processing based on the sub-step 29 is not limited, and may be selected according to the actual application requirements.

For example, in one alternative example, sub-step 29 may include:

the first step, extracting at least part of audio attribute corresponding relations from the historical audio attribute corresponding relation set aiming at each information associated sub-relation to obtain a first audio attribute corresponding relation group corresponding to each information associated sub-relation (in this way, aiming at a plurality of information associated sub-relations, a plurality of first audio attribute information corresponding relation groups can be obtained), wherein each first audio attribute corresponding relation group comprises a plurality of audio attribute corresponding relations;

secondly, aiming at each information association sub-relationship, performing first association verification processing on the information association sub-relationship based on each audio attribute corresponding relationship in a first audio attribute corresponding relationship group corresponding to the information association sub-relationship to obtain first association accuracy of the information association sub-relationship;

third, for each of the information association sub-relationships, determining whether a first association accuracy of the information association sub-relationship is greater than a first preset association accuracy (for example, if there is a piece of first audio attribute information of an audio attribute corresponding relationship in the first audio attribute corresponding relationship group and all first audio attribute sub-information in a first audio attribute sub-information set including the information association sub-relationship, the first association accuracy of the information association sub-relationship is considered to be greater than the first preset association accuracy);

fourth, for each of the information association sub-relationships, if the first association accuracy of the information association sub-relationship is greater than the first preset association accuracy, extracting an audio attribute corresponding relationship (in an example, time information association may refer to a difference between time information that is greater than a preset duration) between at least part of the acquired time information and time information associated with the audio attribute corresponding relationship corresponding to the information association sub-relationship from the historical audio attribute corresponding relationship set, to obtain a second audio attribute corresponding relationship group corresponding to each of the information association sub-relationships (in this way, for a plurality of information association sub-relationships, a plurality of second audio attribute information corresponding relationship groups may be obtained), where each of the second audio attribute corresponding relationship groups includes a plurality of audio attribute corresponding relationships (in an example, the number of audio attribute corresponding relations in a first audio attribute information corresponding relation group corresponding to the same information association sub-relation is less than the number of audio attribute corresponding relations in a corresponding second audio attribute information corresponding relation group);

fifthly, aiming at each information association sub-relationship, performing second association verification processing on the information association sub-relationship based on each audio attribute corresponding relationship in a second audio attribute corresponding relationship group corresponding to the information association sub-relationship to obtain second association accuracy of the information association sub-relationship;

sixthly, judging whether second association accuracy of the information association sub-relationship is greater than second preset association accuracy or not for each piece of information association sub-relationship (for example, if first audio attribute information of an audio attribute corresponding relationship exists in the second audio attribute corresponding relationship group and all first audio attribute sub-information in a first audio attribute sub-information set containing the information association sub-relationship, the second association accuracy of the information association sub-relationship is considered to be greater than the second preset association accuracy);

and seventhly, aiming at each information association sub-relation, if the second association accuracy of the information association sub-relation is greater than the second preset association accuracy, judging that the information association sub-relation passes verification.

In the above example, the specific way of performing the decomposition processing on the first audio attribute information based on the sub-step 27 is not limited, and may be selected according to the actual application requirement.

For example, in one alternative example, sub-step 27 may include:

aiming at each audio attribute corresponding relation which does not belong to the repeated audio attribute corresponding relation, if the audio attribute corresponding relation passes the first verification, first audio attribute information in the audio attribute corresponding relation is obtained, wherein the first audio attribute information comprises a plurality of types of audio attribute sub-information, and the audio attribute sub-information at least comprises a voice energy range value, a voice speed range value, a voice duration range value and target keyword coverage information (such as a speaking fast point, a speaking slow point and the like);

and acquiring partial audio attribute sub-information included in the first audio attribute information in the audio attribute corresponding relation aiming at each audio attribute corresponding relation which does not belong to the repeated audio attribute corresponding relation, and forming a first audio attribute sub-information set based on the partial audio attribute sub-information.

In the above example, the specific manner of obtaining the at least one audio attribute corresponding relationship based on the sub-step 21 is not limited, and may be selected according to the actual application requirement.

For example, in an alternative example, the substep 21 may comprise:

generating at least one first information group in response to at least one first operation of a target user, wherein each first information group comprises multiple types of audio attribute sub-information;

generating at least one second information group in response to at least one second operation of the target user, wherein each second information group comprises a plurality of audio attribute hierarchy information;

and responding to at least one third operation of the target user, and performing one-to-one correspondence processing on the at least one first information group and the at least one second information group to obtain at least one audio attribute corresponding relation, wherein a plurality of audio attribute level information in each audio attribute corresponding relation is used for calculating to obtain audio attribute information used for determining a target audio processing mode.

In detail, in a specific application example, a first information group includes a voice energy range value a1, a voice speed range value B1, and a voice duration range value C1, and a second information group includes a first audio attribute hierarchy information a2 (for hierarchy information corresponding to a 1), a second audio attribute hierarchy information B2 (for hierarchy information corresponding to B1), and a third audio attribute hierarchy information C2 (for hierarchy information corresponding to C1). The higher the energy value is, the higher the speed is, and the longer the duration is, the higher the corresponding hierarchy is.

Then, according to actual requirements, a weighting coefficient may be assigned to the audio attribute level information corresponding to each type of audio attribute sub-information, for example, the weighting coefficient of the attribute level information corresponding to the target keyword is greater than the weighting coefficient of the attribute level information corresponding to the speech speed, the weighting coefficient of the attribute level information corresponding to the speech speed is greater than the weighting coefficient of the attribute level information corresponding to the speech duration, and the weighting coefficient of the attribute level information corresponding to the speech duration is greater than the attribute level information corresponding to the speech energy. In this way, the corresponding audio attribute information for determining the target audio processing mode can be calculated in a weighted manner.

In the above steps, it should be noted that, in step S130, a specific manner for determining the target audio processing manner is not limited, and may be selected according to actual application requirements.

For example, in one example, step S130 may include sub-steps 41-42.

A substep 41, obtaining audio processing efficiency information corresponding to the audio attribute information based on a preset first corresponding relationship, where the first corresponding relationship has a plurality of audio attribute information and a plurality of audio processing efficiency information in one-to-one correspondence (for example, when the audio attribute information is represented based on the above hierarchical information, the higher the hierarchical information is, the higher the corresponding efficiency information is);

and a substep 42, determining a target audio processing mode according to the processing efficiency of the voice information to be processed by each audio processing mode and the audio processing efficiency information, in a plurality of audio processing modes included in a preset audio processing method set.

It can be understood that the audio processing manner may refer to processing the to-be-processed voice information based on a preset neural network model, such as a voice adding model, a voice denoising model, and the like. In addition, the number of the neural network models included in different audio processing modes may be different, and the number of training samples and the number of iterations of the same type of neural network models included in different audio processing modes may be different in the training process. For example, the audio processing mode with the lowest processing efficiency may include the neural network model with the largest number, and the training samples and the largest number of iterations of the neural network model.

In the above example, the specific manner of determining the target audio processing manner based on sub-step 42 is also not limited, and may be selected according to the actual application requirements.

For example, in one example, substep 42 may comprise:

first, semantic recognition processing may be performed on the second processed voice information included in the at least one second processed audio data packet, and it is determined whether the second user corresponding to the second video communication terminal 30 has a representation requesting restatement by the first user corresponding to the first video communication terminal 20 (for example, the just-mentioned speech is troublesome to say again);

secondly, if the second user corresponding to the second video communication terminal 30 indicates that the request for restating the first user corresponding to the first video communication terminal 20 is made, performing efficiency value reduction processing on the audio processing efficiency information to obtain new audio processing efficiency information;

then, in a plurality of audio processing modes included in a preset audio processing method set, a target audio processing mode may be determined according to the processing efficiency of the speech information to be processed by each audio processing mode and the new audio processing efficiency information.

With reference to fig. 3, an embodiment of the present application further provides an audio processing system 100 in video communication, which can be applied to the video communication server 40. The audio processing system 100 in video communication may include a to-be-processed audio packet obtaining module 110, an audio attribute information obtaining module 120, an audio processing mode determining module 130, a to-be-processed voice information processing module 140, a processed audio packet obtaining module 150, and a processed audio packet sending module 160.

The to-be-processed audio data packet obtaining module 110 is configured to obtain a first to-be-processed audio data packet sent by the first video communication terminal 20 at a first time, where the first to-be-processed audio data packet includes first to-be-processed voice information and first timestamp information corresponding to the first to-be-processed voice information. In this embodiment, the to-be-processed audio data packet obtaining module 110 may be configured to perform step S110 shown in fig. 2, and for the relevant content of the to-be-processed audio data packet obtaining module 110, reference may be made to the foregoing description of step S110.

The audio attribute information obtaining module 120 is configured to obtain audio attribute information of at least one second processed audio data packet sent by the second video communication terminal 30 before the first time. In this embodiment, the audio attribute information obtaining module 120 may be configured to perform step S120 shown in fig. 2, and reference may be made to the foregoing description of step S120 for relevant contents of the audio attribute information obtaining module 120.

The audio processing mode determining module 130 is configured to determine, based on the audio attribute information, a target audio processing mode among a plurality of audio processing modes included in a preset audio processing method set. In this embodiment, the audio processing manner determining module 130 may be configured to execute step S130 shown in fig. 2, and reference may be made to the foregoing description of step S130 for relevant contents of the audio processing manner determining module 130.

The to-be-processed voice information processing module 140 is configured to process the first to-be-processed voice information in the first to-be-processed audio data packet based on the target audio processing manner, so as to obtain first processed voice information. In this embodiment, the to-be-processed speech information processing module 140 can be configured to execute step S140 shown in fig. 2, and reference may be made to the foregoing description of step S140 for relevant contents of the to-be-processed speech information processing module 140.

The processed audio data packet obtaining module 150 is configured to obtain a first processed audio data packet corresponding to the first to-be-processed audio data packet based on the first processed voice information and the first timestamp information. In this embodiment, the processed audio packet obtaining module 150 may be configured to perform step S150 shown in fig. 2, and reference may be made to the foregoing description of step S150 regarding the related content of the processed audio packet obtaining module 150.

The processed audio data packet sending module 160 is configured to send the first processed audio data packet to the second video communication terminal 30, so that the second video communication terminal 30 performs synchronous playing processing on the first processed voice information in the first processed audio data packet and the video information in the video data packet based on the first timestamp information in the first processed audio data packet and the second timestamp information in the obtained video data packet. In this embodiment, the processed audio packet sending module 160 may be configured to execute step S160 shown in fig. 2, and reference may be made to the foregoing description of step S160 for relevant contents of the processed audio packet sending module 160.

In summary, according to the audio processing method and system in video communication provided by the application, the target audio processing mode is selected from the multiple audio processing modes based on the attribute information of the processed audio data, and the audio data to be processed is processed, so that the adaptability of the processing mode of the audio data is better, and thus, the problem that the processing of the audio data in the existing video communication is unreasonable can be solved.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An audio processing method, comprising:

acquiring audio attribute information of at least one second processed audio data packet sent by a second video communication terminal before the first time, wherein the second video communication terminal responds to the operation of a second user to generate the audio attribute information of the second to-be-processed audio data packet when acquiring the second to-be-processed audio data packet;

2. The audio processing method according to claim 1, wherein the step of obtaining audio attribute information of at least one second processed audio packet transmitted by the second video communication terminal before the first time comprises:

and obtaining at least one second processed audio data packet and audio attribute information of the second processed audio data packet which are sent by the second video communication terminal before the first time.

3. The audio processing method according to claim 1 or 2, wherein the step of determining the target audio processing mode based on the audio attribute information among a plurality of audio processing modes included in a preset audio processing method set includes:

obtaining audio processing efficiency information corresponding to the audio attribute information based on a preset first corresponding relation, wherein the first corresponding relation has a plurality of audio attribute information and a plurality of audio processing efficiency information which are in one-to-one correspondence;

and determining a target audio processing mode according to the processing efficiency of the voice information to be processed of each audio processing mode and the audio processing efficiency information in a plurality of audio processing modes included in a preset audio processing method set.

4. The audio processing method according to claim 3, wherein the step of determining, in the plurality of audio processing modes included in the preset audio processing method set, a target audio processing mode according to the processing efficiency of each of the audio processing modes for the to-be-processed speech information and the information of the audio processing efficiency includes:

performing semantic recognition processing on second processed voice information included in the at least one second processed audio data packet, and determining whether a second user corresponding to the second video communication terminal has a representation of requesting a first user corresponding to the first video communication terminal to restate;

if a second user corresponding to the second video communication terminal indicates that a first user corresponding to the first video communication terminal is requested to perform statement again, performing efficiency value reduction processing on the audio processing efficiency information to obtain new audio processing efficiency information;

and determining a target audio processing mode according to the processing efficiency of the voice information to be processed of each audio processing mode and the new audio processing efficiency information in a plurality of audio processing modes included in a preset audio processing method set.

5. A video communication system is characterized by comprising a first video communication terminal, a second video communication terminal and a video communication server, wherein the video communication server is respectively in communication connection with the first video communication terminal and the second video communication terminal and is used for processing audio and video data sent by the first video communication terminal and then sending the processed audio and video data to the second video communication terminal and sending the processed audio and video data sent by the second video communication terminal to the first video communication terminal, and the video communication server is used for executing the audio processing method according to any one of claims 1 to 4.