CN110225285B

CN110225285B - Audio and video communication method and device, computer device and readable storage medium

Info

Publication number: CN110225285B
Application number: CN201910305621.3A
Authority: CN
Inventors: 齐燕
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2022-09-02
Anticipated expiration: 2039-04-16
Also published as: CN110225285A

Abstract

The invention provides an audio and video communication method, which comprises the following steps: when audio and video communication is carried out with external equipment, audio and video data to be transmitted are obtained, and audio and video related parameters are extracted from the audio and video data to be transmitted; calling a scene recognition model generated by pre-training, and recognizing the current scene of the user according to the acquired audio and video related parameters; determining a processing mode of the audio and video data to be transmitted according to the current scene of the user; and processing the audio and video data to be transmitted according to the determined processing mode, and transmitting the processed audio and video data to the external equipment. The invention also provides a device, a computer device and a readable storage medium for realizing the audio and video communication method. The invention can solve the technical problem of poor audio and video communication experience of users.

Description

Audio and video communication method and device, computer device and readable storage medium

Technical Field

The invention relates to the technical field of computers, in particular to an audio and video communication method, an audio and video communication device, a computer device and a readable storage medium.

Background

In the process of audio and video communication, the current environment of the user has a great influence on the experience of the audio and video communication. For example, a noisy environment may make a communicating party inaudible to the user.

Disclosure of Invention

In view of the above, it is necessary to provide an audio/video communication method, an audio/video communication device, a computer device, and a readable storage medium to solve the technical problem of poor experience of audio/video communication for users.

A first aspect of the present invention provides an audio-video communication method, including:

when audio and video communication is carried out with external equipment, audio and video data to be transmitted are obtained, and audio and video related parameters are extracted from the audio and video data to be transmitted;

calling a scene recognition model generated by pre-training, and recognizing the current scene of the user according to the acquired audio and video related parameters;

determining a processing mode of the audio and video data to be transmitted according to the current scene of the user; and

and processing the audio and video data to be transmitted according to the determined processing mode, and transmitting the processed audio and video data to the external equipment.

Preferably, the method of training the scene recognition model comprises:

acquiring a preset number of audio and video related parameters respectively corresponding to different scenes, and labeling the category of the audio and video related parameters corresponding to each scene, so that the audio and video related parameters corresponding to each scene carry a category label;

respectively randomly dividing audio and video related parameters corresponding to different scenes into a training set with a first preset proportion and a verification set with a second preset proportion, training the scene recognition model by using the training set, and verifying the accuracy of the trained scene recognition model by using the verification set; and

if the accuracy is greater than or equal to the preset accuracy, ending the training; and if the accuracy is smaller than the preset accuracy, increasing the number of samples to retrain the scene recognition model until the accuracy is larger than or equal to the preset accuracy.

Preferably, the determining, according to the current scene of the user, a processing manner of the audio/video data to be transmitted includes:

when the current scene of a user is outdoor, determining that the processing mode of the audio and video data to be transmitted is a first mode, wherein the first mode is that the processing of the audio and video data to be transmitted at least comprises noise reduction processing;

and when the current scene of the user is indoor, determining that the processing mode of the audio and video data to be transmitted is a second mode, wherein the second mode refers to the processing of the audio and video data to be transmitted according to the indoor area and the material of the indoor wall.

Preferably, the processing of the audio and video data to be transmitted according to the indoor area and the material of the indoor wall includes the steps of:

estimating the size of the indoor area;

intercepting a frame of image comprising a wall from the audio and video data to be transmitted;

matching the intercepted image of the wall with a plurality of pre-stored images with different materials by using an image recognition algorithm to determine the material of the wall; determining a sound absorption coefficient according to the material of the wall;

multiplying the indoor area by the determined sound absorption coefficient to estimate a sound absorption amount; and

and processing the audio and video data to be transmitted according to the estimated sound absorption quantity, wherein when the estimated sound absorption quantity is larger than a preset sound absorption quantity value, the processing of the audio and video data to be transmitted at least comprises dereverberation processing, and when the estimated sound absorption quantity is smaller than or equal to the preset sound absorption quantity value, the processing of the audio and video data to be transmitted does not comprise dereverberation processing.

Preferably, the estimating of the size of the indoor area comprises:

intercepting a frame of image comprising a head portrait of a user from the audio and video data;

calculating the total number of first pixel points included in the head portrait of the user, and calculating the total number of second pixel points included in the intercepted image;

and estimating the size of the indoor area according to the ratio of the total number of the first pixel points to the total number of the second pixel points, wherein the size of the indoor area is equal to a preset value divided by the ratio.

Preferably, after the audio/video data to be transmitted is processed according to the determined processing manner, the method further includes:

determining whether a plurality of human figures exist in a video image included in the audio and video data to be transmitted;

when a plurality of portrait images exist in the video image, the portrait image which is over against the lens in the video image is identified, and when the plurality of portrait images do not exist in the video image, the portrait image which is over against the lens in the video image is not identified; and

and blurring other portraits except the portraits facing the lens in the video image.

acquiring the average brightness of a video image in the audio and video data to be transmitted;

judging whether the average brightness of the video image is smaller than a preset brightness threshold value or not; and

and when the average brightness of the video image is less than the preset brightness threshold, performing brightness enhancement on the video image, and when the average brightness of the video image is greater than or equal to the preset brightness threshold, not performing brightness enhancement on the video image.

A second aspect of the present invention provides a computer apparatus comprising a memory for storing at least one instruction and a processor for implementing the audio-video communication method when the at least one instruction is executed.

A third aspect of the invention provides a computer-readable storage medium having stored thereon at least one instruction which, when executed by a processor, implements the audio-video communication method.

A fourth aspect of the present invention provides an audio-video communication device, the device comprising:

the acquisition module is used for acquiring audio and video data to be transmitted when audio and video communication is carried out with external equipment, and extracting audio and video related parameters from the audio and video data to be transmitted;

the execution module is used for calling a scene recognition model generated by pre-training and recognizing the current scene of the user according to the acquired audio and video related parameters;

the execution module is also used for determining a processing mode of the audio and video data to be transmitted according to the current scene of the user; and

the execution module is further configured to process the audio and video data to be transmitted according to the determined processing mode, and transmit the processed audio and video data to the external device.

According to the audio and video communication method and device, the computer device and the readable storage medium, when the computer device is in audio and video communication with external equipment, audio and video data to be transmitted are obtained, and audio and video related parameters are extracted from the audio and video data to be transmitted; calling a scene recognition model generated by pre-training, and recognizing the current scene of the user according to the acquired audio and video related parameters; determining a processing mode of the audio and video data to be transmitted according to the current scene of the user; and processing the audio and video data to be transmitted according to the determined processing mode, and transmitting the processed audio and video data to the external equipment, so that the audio and video communication experience of a user can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of an audio-video communication method according to an embodiment of the present invention.

Fig. 2 is a structural diagram of an audio/video communication apparatus according to a second embodiment of the present invention.

Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention.

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Example one

In this embodiment, the audio-video communication method may be applied to a computer device, and for a computer device that needs to perform audio-video communication, the function provided by the method for audio-video communication may be directly integrated on the computer device, or may be run on the computer device in a form of a Software Development Kit (SDK).

As shown in fig. 1, the audio-video communication method specifically includes the following steps, and according to different requirements, the order of the steps in the flowchart may be changed, and some steps may be omitted.

And step S1, when the computer device is in audio-video communication with external equipment, acquiring audio-video data to be transmitted, and extracting audio-video related parameters from the audio-video data to be transmitted.

In this embodiment, the audio/video related parameters include, but are not limited to, audio frequency spectrum characteristics, volume, frequency distribution, number of human figures and human figures included in the video image, ground, and background.

In one embodiment, the audio and video data refers to audio data collected by a microphone and video data captured by a camera in synchronization.

In one embodiment, the audio data may be windowed and framed first. For example, a hanning window may be used to divide the audio data into a plurality of frames that are, for example, 10-30ms (milliseconds) long, and the frame shift may take 10ms, so that the audio data may be divided into a plurality of frames. After windowing and framing the audio data, performing fast Fourier transform on the windowed and framed audio data, thereby obtaining the frequency spectrum of the audio data. And then extracting the spectral features corresponding to the audio data according to the frequency spectrum of the audio data.

In one embodiment, the volume level included in the audio-video related parameter may be an average value of the volume.

In one embodiment, the number, ground and background of the human figures included in the video image and the human figures can be identified from the audio and video data by using an image identification algorithm.

In one embodiment, the microphone and camera may be built into the computer device or be externally connected to the computer device in a wired/wireless manner.

For example, the microphone and camera may be communicatively coupled to the computer device using USB data cables.

In one embodiment, the computer apparatus and the external device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, or the like.

In one embodiment, the computer apparatus and external device may be communicatively coupled via any conventional wired and/or wireless network. The wired network may be any type of conventional wired communication, such as the internet, a local area network. The Wireless network may be of any type of conventional Wireless communication, such as radio, Wireless Fidelity (WIFI), cellular, satellite, broadcast, etc. The wireless communication technology may include, but is not limited to, Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), wideband Code Division Multiple Access (W-CDMA), CDMA2000, IMT Single Carrier (IMT Single Carrier), Enhanced Data rate GSM Evolution (Enhanced Data Rates for GSM Evolution), Long Term Evolution (Long-Term Evolution), LTE, advanced Long Term Evolution (LTE), Time-Division Long Term Evolution (Time-Division LTE, TD-LTE), High Performance Radio Local Area Network (High lan), High Performance Wide Area Radio Network (High-Term Area Network, High-Performance Wide Area Network (wan), Local multi-point Wide Area Network (wan), and wireless Wide Access Network (lmwide Area Network, Wide Area Network (wlan), WiMAX), ZigBee protocol (ZigBee), bluetooth, Orthogonal Frequency Division Multiplexing (Flash Orthogonal-Division Multiplexing, Flash-OFDM), High Capacity space Division Multiple Access (HC-SDMA), Universal Mobile Telecommunications System (UMTS), Universal Mobile Telecommunications System Time Division duplex (UMTS Time Division Multiplexing, UMTS-TDD), Evolved High Speed Packet Access (Evolved High Speed Packet Access, HSPA +), Time Division Synchronous Code Division Multiple Access (TD-SCDMA), Evolved Data optimization (EV-Optimized, EV-DO), digitally Enhanced Cordless communication (Digital Enhanced Cordless Telecommunications, Digital and other core networks, Digital Enhanced Cordless Telecommunications (Digital Enhanced Digital Telecommunications, DECT).

And step S2, calling a scene recognition model generated by pre-training, and recognizing the current scene of the user according to the acquired audio and video related parameters.

Specifically, the acquired audio and video related parameters are input to the scene recognition model generated by the pre-training, so that the current scene of the user is obtained.

In this embodiment, the scene may be divided into indoor and outdoor. Different scenes correspond to different audio and video related parameters.

Preferably, the method for training the scene recognition model comprises:

1) and acquiring a preset number of audio and video related parameters respectively corresponding to the different scenes, and labeling the category of the audio and video related parameters corresponding to each scene, so that the audio and video related parameters corresponding to each scene carry category labels.

For example, 1000 records of audio and video related parameters corresponding to the indoor environment are selected, and the 1000 records are respectively marked as "1", that is, "1" is used as a label. Similarly, 1000 pens of audio and video related parameters corresponding to the outdoor are selected, and the 1000 pens are respectively marked as "2", namely "2" is used as a label.

2) And respectively randomly dividing the audio and video related parameters corresponding to the different scenes into a training set with a first preset proportion and a verification set with a second preset proportion, training the scene recognition model by using the training set, and verifying the accuracy of the trained scene recognition model by using the verification set.

For example, training samples (i.e., audio-video related parameters) corresponding to different scenes may be distributed to different folders first. For example, training samples corresponding to indoors are distributed into a first folder, and training samples corresponding to outdoors are distributed into a second folder. Then, training samples with a first preset proportion (for example, 70%) are respectively extracted from different folders to serve as total training samples to perform training of the scene recognition model, and training samples with a remaining second preset proportion (for example, 30%) are respectively extracted from different folders to serve as total test samples to perform accuracy verification on the trained scene recognition model.

3) If the accuracy is greater than or equal to a preset accuracy, ending the training, and identifying the current environment of the user by taking the trained scene identification model as a classifier; and if the accuracy is smaller than the preset accuracy, increasing the number of samples to retrain the scene recognition model until the accuracy is larger than or equal to the preset accuracy.

And step S3, determining the processing mode of the audio and video data to be transmitted according to the current scene of the user, wherein different scenes correspond to different processing modes.

In this embodiment, the determining, according to the current scene of the user, a processing manner of the audio and video data to be transmitted includes:

when the current scene of the user is outdoor, determining that the processing mode of the audio and video data to be transmitted is a first mode; and

and when the current scene of the user is indoor, determining that the processing mode of the audio and video data to be transmitted is a second mode.

In one embodiment, the first mode refers to that the processing of the audio-video data to be transmitted at least includes noise reduction (noise reduction) processing. In one embodiment, speech enhancement may also be further included.

In one embodiment, the second mode is to process the audio/video data to be transmitted according to the indoor area and the material of the indoor wall.

In one embodiment, the processing the audio and video data to be transmitted according to the indoor area and the material of the indoor wall comprises the following steps (a1) - (a 4):

(a1) the size of the indoor area is estimated.

In one embodiment, said estimating the size of the indoor area comprises steps (a11) - (a 13):

(a11) intercepting a frame of image comprising a head portrait of a user from the audio and video data;

(a12) calculating the total number of pixels included in the user's avatar (for convenience of description, referred to as "first total number of pixels"), and calculating the total number of pixels included in the captured image (for convenience of description, referred to as "second total number of pixels");

(a13) and estimating the size of the indoor area according to the ratio of the total number of the first pixel points to the total number of the second pixel points.

In one embodiment, the size of the indoor area is equal to a preset value divided by the ratio.

(a2) And determining the material of the indoor wall, and determining the sound absorption coefficient according to the material of the wall.

Specifically, the step of determining the material of the indoor wall comprises the following steps (a21) - (a 22):

(a21) and intercepting a frame including the image of the wall from the audio and video data.

In one embodiment, the image including the wall may be intercepted from the audio-visual data according to an operation of a user.

(a22) And matching the intercepted image with a plurality of pre-stored images with different materials by using an image recognition algorithm to determine the material of the wall.

Specifically, when the similarity between the captured image and a pre-stored image of a certain material is greater than a preset similarity value, it is determined that the material of the wall is the certain material.

Different materials correspond to different sound absorption coefficients. Thus, once the material of the wall is determined, the sound absorption coefficient can be determined.

(a3) And multiplying the indoor area by the determined sound absorption coefficient to estimate the sound absorption amount.

(a4) And processing the audio and video data to be transmitted according to the estimated sound absorption quantity.

In one embodiment, when the estimated sound absorption amount is greater than a preset sound absorption amount value, the processing of the audio/video data to be transmitted at least comprises dereverberation (dereverberation) processing. When the estimated sound absorption amount is less than or equal to the preset sound absorption amount value, the processing of the audio/video data to be transmitted may not include dereverberation processing.

In an embodiment, the processing the audio/video data to be transmitted according to the estimated sound absorption amount may further include echo cancellation and speech enhancement.

And step S4, processing the audio and video data to be transmitted according to the determined processing mode, and transmitting the processed audio and video data to the external equipment.

For example, assuming that the determined processing mode is the first mode, at least noise reduction processing is performed on the audio/video data to be transmitted.

In an embodiment, whether the first manner or the second manner is adopted to process the audio-video data to be transmitted, the following processing is further performed on the audio-video data to be transmitted, and the processing includes steps (b1) - (b 3):

(b1) determining whether a plurality of human images exist in video images included in the to-be-transmitted audio-video data (for example, the number of the human images is greater than or equal to 2);

(b2) when a plurality of portrait images exist in the video image, the portrait image which is over against the lens in the video image is identified, and when the plurality of portrait images do not exist in the video image, the portrait image which is over against the lens in the video image is not identified;

(b3) and blurring other portraits except the portrait facing the lens in the video image so as to highlight the portrait facing the lens.

In an embodiment, whether the first manner or the second manner is adopted to process the audio-video data to be transmitted, the following processing is further performed on the audio-video data to be transmitted, and the processing includes steps (c1) - (c 3):

(c1) and acquiring the average brightness of the video image in the audio and video data to be transmitted.

Specifically, the average brightness of the video image may be obtained by an image brightness detection algorithm.

Specifically, in the embodiment of the present application, a specific implementation process of acquiring the average brightness of the video image may include: the method comprises the steps of obtaining the resolution ratio of a video image, determining a corresponding sampling interval according to the resolution ratio, and sampling the brightness of pixel points in the video image according to the sampling interval to generate average brightness.

In one embodiment, the image brightness detection algorithm may include an averaging algorithm, a histogram algorithm, or the like.

In one embodiment, the corresponding brightness detection algorithm may be selected according to the scene where the user is currently located to obtain the average brightness of the video image.

In one embodiment, taking an averaging algorithm as an example, the sampling calculation may be performed according to the resolution of the video image.

For example, the resolution of the video image may be first obtained, and then the corresponding sampling interval may be determined according to the size of the resolution of the video image. For example, when the resolution of the video image is smaller than the preset resolution, determining the sampling interval to be 1, namely calculating the whole video image; when the resolution of the video image is 1-4 times of the preset resolution, determining that the sampling interval in the horizontal direction and the vertical direction is 2, namely selecting one pixel point for every two pixel points in the video image; when the resolution of the video image is 4-8 times of the preset resolution, determining that the sampling interval in the horizontal direction and the vertical direction is 4, namely selecting one pixel point for every four pixel points in the video image; and when the resolution of the video image is more than 8 times of the preset resolution, determining that the sampling interval in the horizontal direction and the vertical direction is 8, namely selecting one pixel point from every eight pixel points in the video image. And the like, determining the sampling interval for the video image with the larger resolution. After the sampling interval is determined, the brightness values of the pixel points sampled according to the sampling interval in the video image are calculated, the brightness values of all the sampled pixel points are added and averaged, and the calculated value is used as the average brightness of the whole video image.

(c2) And judging whether the average brightness of the video image is smaller than a preset brightness threshold value.

The preset brightness threshold value can be selected according to the current scene of the user, that is, different scenes need different threshold values.

In one embodiment, the brightness threshold corresponding to the outdoor scene of the user is greater than the brightness threshold corresponding to the indoor scene of the user.

(c3) And if the average brightness of the video image is less than the preset brightness threshold value, performing brightness enhancement on the video image. And if the average brightness of the video image is greater than or equal to the preset brightness threshold value, not performing brightness enhancement processing on the video image.

In one embodiment, when the average brightness of the video image is less than the preset brightness threshold, a linear brightness enhancement algorithm may be used to enhance the brightness of the video image.

In summary, in the audio/video communication method in the embodiment of the present invention, when the computer device performs audio/video communication with an external device, audio/video data to be transmitted is obtained, and audio/video related parameters are extracted from the audio/video data to be transmitted; calling a scene recognition model generated by pre-training, and recognizing the current scene of the user according to the acquired audio and video related parameters; determining a processing mode of the audio and video data to be transmitted according to the current scene of the user; and processing the audio and video data to be transmitted according to the determined processing mode, and transmitting the processed audio and video data to the external equipment, so that the audio and video communication experience of a user can be improved.

The above fig. 1 describes the audio-video communication method of the present invention in detail, and the functional modules of the software device for implementing the audio-video communication method and the hardware device architecture for implementing the audio-video communication method are described below with reference to fig. 2 and 3.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

Example two

Fig. 2 is a structural diagram of an audio/video communication device according to a second embodiment of the present invention.

In some embodiments, the audiovisual communication device 30 is run in a computer device. The computer apparatus is connected to an external device via a network. The audio-visual communication device 30 may include a plurality of functional modules composed of program code segments. Program code of the various program segments in the audiovisual communication device 30 may be stored in a memory of a computer device and executed by the at least one processor to implement audiovisual communication functions (described in detail in fig. 2).

In this embodiment, the audio/video communication device 30 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: an acquisition module 301 and an execution module 302. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The obtaining module 301 obtains audio and video data to be transmitted when the computer device performs audio and video communication with an external device, and extracts audio and video related parameters from the audio and video data to be transmitted.

In one embodiment, the audio data may be windowed and framed first. For example, a hanning window may be used to divide the audio data into a plurality of frames that are, for example, 10-30ms (milliseconds) long, and the frame shift may take 10ms, so that the audio data may be divided into a plurality of frames. After windowing and framing the audio data, performing fast Fourier transform on the windowed and framed audio data, thereby obtaining the frequency spectrum of the audio data. And then extracting the frequency spectrum characteristics corresponding to the audio data according to the frequency spectrum of the audio data.

In one embodiment, the human figures included in the video image and the number, ground and background of the human figures can be identified from the audio and video data by utilizing an image identification algorithm.

In one embodiment, the microphone and camera may be built into the computer device or may be externally connected to the computer device in a wired/wireless manner.

The execution module 302 calls a scene recognition model generated by pre-training and recognizes the current scene of the user according to the acquired audio and video related parameters.

Specifically, the execution module 302 inputs the acquired audio/video related parameters to the scene recognition model generated by the pre-training, so as to obtain the current scene of the user.

Preferably, the method for training the scene recognition model comprises:

1) and acquiring a preset number of audio and video related parameters respectively corresponding to the different scenes, and labeling the category of the audio and video related parameters corresponding to each scene, so that the audio and video related parameters corresponding to each scene carry a category label.

For example, 1000 records of audio and video related parameters corresponding to the indoor environment are selected, and the 1000 records are respectively marked as "1", that is, "1" is used as a label. Similarly, 1000 pens of audio and video related parameters corresponding to the outdoor environment are selected, and the 1000 pens are respectively marked with "2", namely "2" is taken as a label.

The execution module 302 determines a processing mode of the audio/video data to be transmitted according to a current scene of a user, wherein different scenes correspond to different processing modes.

In one embodiment, the first mode is that the processing of the audio/video data to be transmitted at least includes noise reduction (noise reduction) processing. In one embodiment, speech enhancement may also be further included.

In one embodiment, the processing of the audio and video data to be transmitted according to the indoor area and the material of the indoor wall comprises the following steps (a1) - (a 4):

(a1) the size of the indoor area is estimated.

The execution module 302 processes the audio and video data to be transmitted according to the determined processing mode, and transmits the processed audio and video data to the external device.

In one embodiment, whether the first manner or the second manner is adopted to process the audio-video data to be transmitted, the method further includes the following steps of (b1) - (b 3):

For example, the resolution of the video image may be first obtained, and then the corresponding sampling interval may be determined according to the size of the resolution of the video image. For example, when the resolution of the video image is smaller than a preset resolution, determining that the sampling interval is 1, that is, calculating the whole video image; when the resolution of the video image is 1-4 times of the preset resolution, determining that the sampling interval in the horizontal direction and the vertical direction is 2, namely selecting one pixel point for every two pixel points in the video image; when the resolution of the video image is 4-8 times of the preset resolution, determining that the sampling interval in the horizontal direction and the vertical direction is 4, namely selecting one pixel point for every four pixel points in the video image; and when the resolution of the video image is more than 8 times of the preset resolution, determining that the sampling interval in the horizontal direction and the vertical direction is 8, namely selecting one pixel point from every eight pixel points in the video image. And the like, determining the sampling interval for the video image with larger resolution. After the sampling interval is determined, the brightness values of the pixel points sampled according to the sampling interval in the video image are calculated, the brightness values of all the sampled pixel points are added and averaged, and the calculated value is used as the average brightness of the whole video image.

In summary, the audio-video communication device in the embodiment of the present invention obtains the audio-video data to be transmitted when the computer device performs audio-video communication with the external device, and extracts the audio-video related parameters from the audio-video data to be transmitted; calling a scene recognition model generated by pre-training, and recognizing the current scene of the user according to the acquired audio and video related parameters; determining a processing mode of the audio and video data to be transmitted according to a current scene of a user, wherein different scenes correspond to different processing modes; and processing the audio and video data to be transmitted according to the determined processing mode, and transmitting the processed audio and video data to the external equipment, so that the audio and video communication experience of a user can be improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 3 comprises a memory 31, at least one processor 32, and at least one communication bus 33. It will be appreciated by those skilled in the art that the configuration of the computer apparatus shown in fig. 3 does not constitute a limitation of the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer apparatus 3 may include more or less hardware or software than those shown, or a different arrangement of components.

In some embodiments, the computer device 3 includes a terminal capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like.

It should be noted that the computer device 3 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

In some embodiments, the memory 31 is used for storing program codes and various data, such as the audio-video communication device 30 installed in the computer device 3, and realizes high-speed and automatic access to programs or data during the operation of the computer device 3. The Memory 31 includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable rewritable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc Memory, a magnetic disk Memory, a tape Memory, or any other computer-readable storage medium that can be used to carry or store data.

In some embodiments, the at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 32 is a Control Unit (Control Unit) of the computer apparatus 3, connects various components of the entire computer apparatus 3 by using various interfaces and lines, and executes various functions of the computer apparatus 3 and processes data, for example, functions of audio-video communication, by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the computer device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes instructions for causing a computer device (which may be a server, a personal computer, etc.) or a processor (processor) to perform parts of the methods according to the embodiments of the present invention.

In a further embodiment, in conjunction with fig. 2, the at least one processor 32 may execute an operating device of the computer device 3 and various installed application programs (such as the audio-video communication device 30), program codes, and the like, for example, the above modules.

The memory 31 has program code stored therein, and the at least one processor 32 can call the program code stored in the memory 31 to perform related functions. For example, the modules illustrated in fig. 2 are program codes stored in the memory 31 and executed by the at least one processor 32, so as to implement the functions of the modules for the purpose of audio-video communication.

In one embodiment of the invention, the memory 31 stores one or more instructions (i.e., at least one instruction) that are executed by the at least one processor 32 for the purpose of audio-visual communication.

Specifically, the at least one processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, and details are not repeated here.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. An audio-video communication method, characterized in that the method comprises:

determining a processing mode of the audio and video data to be transmitted according to a current scene of a user, and determining that the processing mode of the audio and video data to be transmitted is a first mode when the current scene of the user is outdoor, wherein the first mode is that the processing of the audio and video data to be transmitted at least comprises noise reduction processing; when the current scene of the user is indoor, determining that the processing mode of the audio and video data to be transmitted is a second mode, wherein the second mode is to process the audio and video data to be transmitted according to the indoor area and the material of the indoor wall; and

2. The audio-visual communication method of claim 1, wherein the method of training the scene recognition model comprises:

3. The audio-video communication method according to claim 1, wherein the processing of the audio-video data to be transmitted according to the indoor area and the material of the indoor wall comprises the steps of:

estimating the size of the indoor area;

4. The audio-visual communication method of claim 3, wherein said estimating the size of the indoor area comprises:

5. The audio-video communication method according to claim 1, wherein after processing the audio-video data to be transmitted according to the determined processing manner, the method further comprises:

6. The audio-video communication method according to claim 1, wherein after processing the audio-video data to be transmitted according to the determined processing manner, the method further comprises:

7. A computer device, characterized in that the computer device comprises a memory for storing at least one instruction and a processor for implementing the audio-visual communication method according to any one of claims 1 to 6 when executing the at least one instruction.

8. A computer-readable storage medium, characterized in that it stores at least one instruction which, when executed by a processor, implements the audio-video communication method according to any one of claims 1 to 6.

9. An audio-visual communication device, the device comprising:

the execution module is further configured to determine a processing mode of the audio and video data to be transmitted according to a current scene of a user, and when the current scene of the user is outdoor, determine that the processing mode of the audio and video data to be transmitted is a first mode, where the first mode is that the processing of the audio and video data to be transmitted at least includes noise reduction processing; when the current scene of the user is indoor, determining that the processing mode of the audio and video data to be transmitted is a second mode, wherein the second mode is to process the audio and video data to be transmitted according to the indoor area and the material of the indoor wall; and