CN111931690A

CN111931690A - Model training method, device, equipment and storage medium

Info

Publication number: CN111931690A
Application number: CN202010883864.8A
Authority: CN
Inventors: 崔志佳; 范泽华
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-11-13

Abstract

The application discloses a model training method, a model training device, model training equipment and a storage medium, and belongs to the technical field of deep learning. The method comprises the following steps: acquiring a training video sample, wherein the training video sample comprises a training video and at least two real labels with an incidence relation; respectively inputting the training video into an initial convolutional neural network and an initial cyclic neural network to obtain at least two training labels which are output by the initial convolutional neural network and the initial cyclic neural network and have an incidence relation; and training the initial convolutional neural network and the initial cyclic neural network based on the difference between the at least two training labels and the at least two real labels to obtain a trained target convolutional neural network and a trained target cyclic neural network. The technical scheme provided by the embodiment of the application provides a method for training a convolutional neural network and a cyclic neural network which can simultaneously identify the associated features in the video.

Description

Model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a model training method, apparatus, device, and storage medium.

Background

In practical application, some features in the video can be identified through the convolutional neural network, and meanwhile, other features in the video can be identified through the cyclic neural network. In the related art, the convolutional neural network can be independently selected to be used for identifying the video according to the characteristics needing to be identified, or the cyclic neural network can be independently selected to be used for identifying the video.

However, many features in a video are relevant, and identifying a certain feature alone and neglecting to identify other features associated therewith causes a certain negative effect, and in order to avoid such negative effect, a convolutional neural network and a cyclic neural network can be simultaneously used to identify the relevant features, and how to train the convolutional neural network and the cyclic neural network in such a scenario becomes a problem to be solved at present.

Disclosure of Invention

Based on this, embodiments of the present application provide a model training method, apparatus, device, and storage medium, and provide a method for training a convolutional neural network and a cyclic neural network that can simultaneously recognize associated features in a video.

In a first aspect, a model training method is provided, which includes:

acquiring a training video sample, wherein the training video sample comprises a training video and at least two real labels which correspond to the training video and have an incidence relation; inputting the training video into an initial convolutional neural network and an initial cyclic neural network respectively to obtain at least two training labels which are output by the initial convolutional neural network and the initial cyclic neural network and have an incidence relation; and training the initial convolutional neural network and the initial cyclic neural network based on the difference between the at least two training labels and the at least two real labels to obtain a trained target convolutional neural network and a trained target cyclic neural network.

In a second aspect, there is provided a model training apparatus, the apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training video sample, and the training video sample comprises a training video and at least two real labels which correspond to the training video and have an incidence relation;

the second acquisition module is used for respectively inputting the training video into the initial convolutional neural network and the initial cyclic neural network to obtain at least two training labels which are output by the initial convolutional neural network and the initial cyclic neural network and have an incidence relation;

and the training module is used for training the initial convolutional neural network and the initial cyclic neural network based on the difference between the at least two training labels and the at least two real labels to obtain a trained target convolutional neural network and a trained target cyclic neural network.

In a third aspect, there is provided a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the model training method as described in any one of the first aspects above.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a model training method as described in any of the first aspects above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

by obtaining a training video sample, wherein the training video sample comprises a training video and at least two real labels having an association relation corresponding to the training video, then inputting the training video into an initial convolutional neural network and an initial cyclic neural network, respectively, obtaining at least two training labels having an association relation output by the initial convolutional neural network and the initial cyclic neural network, then training the initial convolutional neural network and the initial cyclic neural network based on a difference between the at least two training labels and the at least two real labels, obtaining a trained target convolutional neural network and a target cyclic neural network, in this embodiment, considering that the trained target convolutional neural network and the target cyclic neural network need to be capable of accurately identifying respective required features from the video on one hand, on the other hand, the relevance between the features recognized by the trained target convolutional neural network and the trained target cyclic neural network from the video needs to be accurate, and therefore, in the embodiment of the application, on one hand, a training video sample includes at least two real labels, the initial convolutional neural network and the initial training neural network output at least two training labels in the training process, the initial convolutional neural network and the initial cyclic neural network are respectively trained based on the difference between the at least two training labels and the difference between the at least two real labels, it can be ensured that the trained target convolutional neural network and the trained target cyclic neural network can accurately recognize the features which are respectively needed to be recognized from the video, and on the other hand, the initial convolutional neural network and the initial cyclic neural network are trained based on the difference between the at least two training labels, it can be ensured that the trained target convolutional neural network and the trained target cyclic neural network respectively recognize the features from the video The accuracy of the correlation between the features is high, so that the convolutional neural network and the cyclic neural network which can simultaneously identify the correlated features in the video can be trained through the technical scheme provided by the embodiment of the application.

Drawings

FIG. 1 is a block diagram of a computer device provided by an embodiment of the present application;

FIG. 2 is a flow chart of a model training method provided in an embodiment of the present application;

fig. 3 is a flowchart of a method for inputting training video into an initial convolutional neural network and an initial cyclic neural network, respectively, according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for identifying correlated features in a video by using a target convolutional neural network and a target cyclic neural network and optimizing the video based on the identification result according to an embodiment of the present application;

fig. 5 is a flowchart of a method for performing stereo processing on audio including various sounds respectively based on positions of various sound objects in a video frame according to an embodiment of the present application;

FIG. 6 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of another model training apparatus according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In practical applications, some content in the video without the serialization characteristic can be identified by the convolutional neural network, for example, a video frame in the video does not have the serialization characteristic, so that the feature thereof can be identified by the convolutional neural network. While some of the content in the video having the serialization characteristic is typically characterized by the recurrent neural network, for example, the audio contained in the video has the serialization characteristic, and thus, the feature thereof is typically identified by the recurrent neural network.

In the related art, the convolutional neural network can be independently selected to be used for identifying the video according to the characteristics needing to be identified, or the cyclic neural network can be independently selected to be used for identifying the video. For example, if features in a video frame need to be identified, a convolutional neural network alone can be used for identification, and if features of audio contained in a video need to be identified, a cyclic neural network alone can be used for identification.

However, many features in a video are relevant, and identifying a feature alone and neglecting identifying other features associated with it can have a certain negative effect. For example, there is a correlation between the types of objects included in a video frame and the types of sound objects corresponding to various sounds in audio included in the video, and if only one of the above features is identified alone, the two features cannot be combined to perform corresponding optimization processing on the video.

In order to avoid such negative effects, in some application scenarios, it is necessary to identify the associated features by using the convolutional neural network and the cyclic neural network at the same time, and how to train the convolutional neural network and the cyclic neural network in such a scenario becomes a problem to be solved at present.

In view of this, an embodiment of the present application provides a model training method, in which a training video sample may be obtained, where the training video sample includes a training video and at least two real tags having an association relationship corresponding to the training video, then the training video is input into an initial convolutional neural network and an initial cyclic neural network, respectively, to obtain at least two training tags having an association relationship output by the initial convolutional neural network and the initial cyclic neural network, then the initial convolutional neural network and the initial cyclic neural network are trained based on a difference between the at least two training tags and the at least two real tags, to obtain a trained target convolutional neural network and a trained target cyclic neural network, in this embodiment of the present application, considering that, on the one hand, the trained target convolutional neural network and the target cyclic neural network need to be able to accurately identify the features that need to be identified from the video, and on the other hand, the correlation between the features that are identified from the video by the trained target convolutional neural network and the target cyclic neural network need to be accurate, in this embodiment of the present application, on the one hand, the training video sample includes at least two real tags, the initial convolutional neural network and the initial training neural network output at least two training tags in the training process, and training the initial convolutional neural network and the initial cyclic neural network based on the difference between the at least two training tags and the at least two real tags, respectively, it can be ensured that the trained target convolutional neural network and the target cyclic neural network can accurately identify the features that need to be identified from the video, on the other hand, the initial convolutional neural network and the initial cyclic neural network are trained based on the difference between at least two training labels, and the accuracy of the relevance between the features respectively recognized from the video by the trained target convolutional neural network and the trained target cyclic neural network can be high.

The model training method provided by the embodiment of the present application may be applied to a computer device, for example, the computer device may be a server or a terminal, and the embodiment of the present application does not limit a specific type of an execution subject of the model training method.

Referring to fig. 1, an internal structure diagram of a computer device provided in an embodiment of the present application is shown, and as shown in fig. 1, the computer device may include a processor and a memory connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The computer program is executed by a processor to implement a model training method provided by the embodiment of the application.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Referring to fig. 2, a flowchart of a model training method provided by an embodiment of the present application is shown, and the model training method may be applied to the computer device described above. As shown in fig. 2, the model training method may include the steps of:

step 201, a computer device obtains a training video sample.

The training video sample comprises a training video and at least two real labels which are corresponding to the training video and have an incidence relation, and the at least two real labels can respectively correspond to at least two real features which have an incidence relation in the training video.

In an alternative embodiment of the present application, at least two authentic labels in the training video sample may be manually labeled by a technician.

Step 202, the computer device inputs the training video into the initial convolutional neural network and the initial cyclic neural network respectively to obtain at least two training labels which are output by the initial convolutional neural network and the initial cyclic neural network and have an incidence relation.

Wherein the at least two training labels may correspond to at least two recognition features having an association relationship, which are recognized from the training video by the initial convolutional neural network and the initial cyclic neural network. It is noted that in an alternative embodiment of the present application, the at least two training labels and the at least two real labels have a one-to-one correspondence.

Step 203, training the initial convolutional neural network and the initial cyclic neural network by the computer device based on the difference between the at least two training labels and the at least two real labels to obtain a trained target convolutional neural network and a trained target cyclic neural network.

Since the initial convolutional neural network and the initial cyclic neural network are both untrained neural networks, at least two training labels output by the initial convolutional neural network and the initial cyclic neural network are likely to be different from at least two real labels included in the training video sample.

Therefore, in step 203, the initial convolutional neural network and the initial cyclic neural network may be trained by using the difference between the at least two training labels and the at least two real labels, so that after the initial convolutional neural network and the initial cyclic neural network are trained, labels output by the target convolutional neural network and the target cyclic neural network obtained are close to the real labels, and thus, it can be ensured that the trained target convolutional neural network and the target cyclic neural network can accurately identify the features required to be identified from the video.

In addition, in consideration of the fact that the trained target convolutional neural network and the trained target cyclic neural network need to identify the features having the association relationship in the video, in other words, in consideration of the fact that the labels output by the trained target convolutional neural network and the trained target cyclic neural network need to have the accurate association relationship, in the training process, the initial convolutional neural network and the initial cyclic neural network can be trained together based on the difference of at least two training labels, so that the accuracy of the association between the features identified by the trained target convolutional neural network and the trained target cyclic neural network from the video is high, and the accuracy of the output of the target convolutional neural network and the target cyclic neural network can be improved by the training mode.

It is noted that in an alternative embodiment of the present application, the at least two real labels may include a first real label and a second real label, wherein the first real label is used for indicating a real type of an object included in a training video frame of the training video, and the second real label is used for indicating a real type of a sound object corresponding to a sound in a training audio of the training video.

Typically, the training audio may include various sounds, for example, the training audio may include a voice of a person speaking, a voice of a bird, a voice of a dog barking, a voice of a telephone ring, and so on. In general, sound-emitting objects corresponding to various sounds in training audio are different from each other, and a sound-emitting object of a certain sound refers to an object that emits the sound, and for example, a sound-emitting object corresponding to a sound that a person speaks is a human, a sound-emitting object corresponding to a sound that a bird calls is a bird, a sound-emitting object corresponding to a sound that a dog barks is a dog, and a sound-emitting object corresponding to a sound that a telephone rings is a telephone. In practical applications, it is highly likely that the sound-emitting object of the sound included in the training audio is located in the frame of the training video frame, and therefore, the type of the object included in the training video frame and the type of the sound-emitting object corresponding to the sound in the training audio are two features associated with each other in the training video, and the first real label and the second real label respectively used for indicating the truth of the two features are also two labels associated with each other.

Taking a training video in a training video sample as an example of a video in which a person speaks, the training video sample may include a first real label and a second real label, where the first real label may indicate that a real type of an object included in a training video frame of the training video is a human type, and the second real label may indicate that a real type of a sound-generating object corresponding to a sound in a training audio of the training video is a human type.

In an alternative embodiment of the present application, the at least two training labels may include a first training label and a second training label, wherein the first training label and the first real label may correspond to each other for indicating a type of an object included in the training video frame recognized by the initial convolutional network, and the second training label and the second real label may correspond to each other for indicating a type of a sound object corresponding to a sound in the training audio recognized by the initial recurrent neural network.

Based on a similar reasoning as described above, since the type of the object included in the training video frame and the type of the sound-emitting object corresponding to the sound in the training audio are two features associated with each other in the training video, the first training label and the second training label respectively indicating the recognition conditions of the two features, that is, the two labels associated with each other.

Taking the training video in the training video sample as the video of the human speaking as an example, in the step 202, the training video may be input to the initial convolutional neural network and the initial cyclic neural network, respectively, so as to obtain the first training label and the second training label output by the initial convolutional neural network and the initial cyclic neural network. Because the initial convolutional neural network and the initial cyclic neural network are untrained neural networks, the output first training label and the output second training label may be different from the first real label and the second real label.

For example, optionally, the first training label may indicate that the recognition type of an object included in a training video frame of a training video recognized by the initial convolutional network is a simian type, and the second training label may indicate that the recognition type of a sound-emitting object corresponding to a sound in a training audio of a training video recognized by the initial recurrent neural network is a bear type.

Referring to fig. 3, in an alternative embodiment of the present application, the computer device may input the training video into the initial convolutional neural network and the initial cyclic neural network respectively according to the following exemplary technical process, as shown in fig. 3, the technical process may include the following steps:

step 301, the computer device extracts a training video frame and a training audio from the training video respectively.

Step 302, the computer device inputs the training video frame to the initial convolutional neural network to obtain a first training label output by the initial convolutional neural network.

Since convolutional neural networks are typically used to identify features of some content in a video that does not have serialization characteristics (e.g., video frames), in embodiments of the present application, a target convolutional neural network may be trained to identify features of video frames.

In order to achieve the purpose of training the target convolutional neural network, in the training process, a training video frame may be input to the initial convolutional neural network to obtain a first training label output by the initial convolutional neural network, and in the subsequent steps, the initial convolutional neural network is trained by using the first training label, so as to obtain the target convolutional neural network.

Step 303, inputting the training audio to the initial recurrent neural network by the computer device to obtain a second training label output by the initial recurrent neural network.

Since the recurrent neural network is typically used to identify features of some content (e.g., audio) in the video that has a serialization characteristic, in embodiments of the present application, the target recurrent neural network may be trained to identify features of the audio.

In order to achieve the purpose of training the target recurrent neural network, in the training process, the training audio may be input to the initial recurrent neural network to obtain a second training label output by the initial recurrent neural network, and in the subsequent steps, the initial recurrent neural network is trained by using the second training label, so as to obtain the target recurrent neural network.

Based on the technical process shown in fig. 3, the computer device may train the initial convolutional neural network and the initial cyclic neural network based on the difference between the first training label and the second training label, the difference between the first training label and the first real label, and the difference between the second training label and the second real label, so as to obtain a trained target convolutional neural network and a trained target cyclic neural network.

After the target convolutional neural network and the target cyclic neural network are obtained through training, the target convolutional neural network and the target cyclic neural network can be utilized to identify the characteristics which are related to each other in the video, so that the video is optimized based on the identification result.

For example, the target convolutional neural network may be used to identify the type of each object included in the video frame, and the target convolutional neural network may be used to identify the type of the sound-emitting object corresponding to each sound in the audio included in the video, so as to perform optimization processing on the video based on the identification result.

Referring to fig. 4, in the following, an exemplary technical process of identifying the correlated features in the video by using the target convolutional neural network and the target cyclic neural network and optimizing the video based on the identification result will be briefly described in the embodiment of the present application.

Step 401, the computer device extracts video frames and audio from the target video to be processed respectively.

In practical applications, the computer device may display the target video based on a target interface, for example, the target interface may be a video playing interface, the computer device may play the target video in the video playing interface, for example, the target interface may be a file management interface, the computer device may display an icon corresponding to the target video in the file management interface, and the computer device may perform processing such as deleting and editing on the target video based on the file management interface. Alternatively, the computer device may receive a video processing instruction for the target video based on the target interface, and after receiving the video processing instruction, the computer device may perform the technical process of step 201, and after performing the technical process of step 201, perform subsequent technical processes described below.

In a possible implementation manner, a video processing option may be set in the target interface, and when a trigger operation on the video processing option is detected, the computer device may receive a video processing instruction for the target video. In another possible implementation manner, when the computer device detects a preset type of touch operation in the target interface, the computer device may receive a video processing instruction for the target video, where the touch operation may be a double-click operation, a single-machine operation, or a sliding operation.

In an alternative embodiment of the present application, the computer device may extract all of the audio included in the target video from the target video, the computer device may divide the extracted audio into a plurality of audio segments, then, for each audio segment, the computer device may extract one or more video frames from the target video that are within a sampling period of the audio segment, and in a subsequent step, the computer device may process each audio segment based on the corresponding one or more video frames for each audio segment.

For example, the audio extracted by the computer device from the target video may have a duration of 10 minutes, the audio may be divided into 10 audio segments, each audio segment having a duration of 1 minute, for each audio segment, the computer device may extract one or more video frames from the target video that are within a sampling period of the audio segment, the sampling period having a length of 1 minute, and in subsequent steps, the computer device may process each audio segment based on the corresponding one or more video frames of each audio segment.

Step 402, inputting the extracted video frame into a target convolutional neural network and the extracted audio into a target cyclic neural network by the computer device, and respectively identifying the extracted video frame and the extracted audio through the target convolutional neural network and the target cyclic neural network to obtain an identification result.

The recognition result comprises a first type of each object included in the extracted video frame and a second type of a sound production object corresponding to each sound included in the extracted audio.

For example, the target video may be a documentary of a reporter visiting an original forest, the computer device may input a video frame extracted from the target video into the target convolutional neural network, and at the same time, the computer device may input an audio extracted from the target video into the target cyclic neural network, after the video frame and the audio of the target video are respectively identified and processed by the target convolutional neural network and the target cyclic neural network, a tag respectively output by the target convolutional neural network and the target cyclic neural network may be obtained, and an identification result may be obtained based on the tag, where the identification result may be: the video frame includes objects of bird type, human type and wolf type, and the audio includes sound production objects corresponding to various sounds of bird type, human type and wolf type.

Step 403, the computer device locates the sound-emitting objects corresponding to the various sounds included in the audio in the video frame according to the recognition result.

In an alternative embodiment of the present application, for each sound in the audio, the computer device may determine a candidate object from the objects comprised in the video frame based on the second type corresponding to the sound, wherein the first type of the candidate object matches the second type corresponding to the sound.

For example, the audio includes sound a, sound B, and sound C, and the video frame includes object a, object B, and object C, where after the video frame and the audio are subjected to the recognition processing, the obtained recognition result may be: object a is bird type, object B is human type, object C is wolf type, the sound production object corresponding to sound a is bird type, the sound production object corresponding to sound B is human type, and the sound production object corresponding to sound C is wolf type, in step 2022, the computer device may take object a as a candidate object for sound a, the computer device may take object B as a candidate object for sound B, and the computer device may take object C as a candidate object for sound C.

For each sound in the audio, if the number of candidate objects corresponding to the sound is 1, the computer device may directly take the candidate object as a sound production object corresponding to the sound; if the number of the candidate objects corresponding to the sound is greater than 1, the computer device may screen the candidate objects for a sound-generating object corresponding to the sound, where a probability that the screened sound-generating object emits the sound is greater than probabilities that other candidate objects emit the sound.

For example, the audio includes sound a, sound B, and sound C, and the video frame includes object a, object B, object C, and object d, where after the video frame and the audio are subjected to the recognition processing, the obtained recognition result may be: the object a is a bird type, the object B is a human type, the object C is a wolf type, the object d is a human type, the sound-producing object corresponding to the sound a is a bird type, the sound-producing object corresponding to the sound B is a human type, and the sound-producing object corresponding to the sound C is a wolf type, so that for the sound a, the computer device can use the object a as a candidate object, for the sound B, the computer device can use both the object B and the object d as candidate objects, and for the sound C, the computer device can use the object C as a candidate object.

Since the number of candidate objects corresponding to the sound a is 1, that is, the sound generating object corresponding to the sound a is of a bird type, and only one object in the video frame is of a bird type, in this case, the computer device may consider that the object of only one bird type in the video frame generates the sound a (bird call sound), and at this time, the computer device may directly use the object of the bird type in the video frame (that is, the candidate object a) as the sound generating object corresponding to the sound a. Similarly, the computer device may also set the candidate object C corresponding to the sound C as the sound emission object corresponding to the sound C.

In this case, the computer device needs to screen the two objects of the human type to obtain an object more likely to emit the sound B from the two objects of the human type, and the computer device may use the screened candidate object as the sound emission object corresponding to the sound B.

In the following, two exemplary methods for screening out a sound-generating object corresponding to a sound from candidate objects are provided in the embodiments of the present application.

In the first method, a computer device determines whether a target portion of each candidate object associated with an utterance is in an utterance state, and the computer device filters the candidate objects with the target portions in the utterance state as utterance objects corresponding to the utterance.

It will be readily appreciated that if a target portion of a candidate associated with an utterance is in an utterance state, then the candidate is more likely to emit the sound, in which case the computer device may filter the candidate as an utterance object of the sound object.

For example, if the sound-generating object corresponding to the sound is of a human type, the candidate object is also of a human type, and the target portion may be a mouth, the computer device may determine whether the mouth of each candidate object is in an open state; if the mouth part is in an open state, the mouth part is in a sounding state; if the mouth is not in the open state, the mouth is not in the sounding state.

In the second method, if the sound-generating object corresponding to the sound is a human type and the candidate object is also a human type, the computer device may determine the type information of the human that generates the sound according to the characteristics of the sound, for example, the type information may be gender information, age information, and the like; the computer equipment can screen out the sound-producing object corresponding to the sound from the candidate objects according to the type information.

And step 404, the computer device performs preset type processing on various sounds included in the audio respectively based on the positions of the sound-producing objects in the video frame to obtain the processed audio.

Optionally, for each sound included in the audio, the computer device may perform preset type processing on the sound according to a position of a sound generating object corresponding to the sound in the video frame. Alternatively, the preset type of processing may be stereo processing.

In practical applications, stereo processing may affect the direction from which a user perceives sound acoustically, for example, after a certain sound is stereo processed, the user may perceive the sound as originating from the right side of the user acoustically, and for example, after a certain sound is stereo processed, the user may perceive the sound as originating from the left side of the user acoustically.

In the embodiment of the application, the computer device can perform stereo processing on the sound based on the position of the corresponding sound object in the video frame, so that the direction of the source of the sound sensed by the user in the sense of hearing can be kept consistent with the position of the sound object corresponding to the sound in the video frame, and sense organ experience with consistent sense of vision and sense of hearing is brought to the user, so that the telepresence of the target video can be enhanced, and the display effect of the target video can be improved.

For example, the position of the bird in the video frame is located on the right side of the video frame, and in step 203, the computer device may perform stereo processing on the bird's cry based on the position of the bird in the video frame, so that the direction of the sound source acoustically perceived by the user is the right side of the user, and thus, the direction of the sound source acoustically perceived by the user as the bird's cry is consistent with the position of the bird in the video frame, and thus, a sensory experience with consistent vision and hearing can be brought to the user, and the presence of the target video can be enhanced.

Optionally, for each audio segment, the computer device may perform stereo processing on various sounds included in the audio segment according to the position of the sound-generating object in one or more video frames corresponding to the audio segment.

Step 405, the computer device generates a processed target video based on the processed audio.

After obtaining the processed target video, the computer device may play the processed target video.

It should be noted that, in one possible implementation, the trained target convolutional neural network and the trained target cyclic neural network may be stored in a server, and then in this embodiment of the present application, the terminal may upload the target video to the server, so that the server performs the technical processes of the above steps 401 to 405, and in another possible implementation, the trained target convolutional neural network and the trained target cyclic neural network may be stored in the terminal, and then in this embodiment of the present application, the terminal may acquire the target video and perform the technical processes of the above steps 401 to 405.

Referring to fig. 5, on the basis of the above-mentioned embodiments, the present application provides a method for performing stereo processing on audio including various sounds respectively based on the positions of the various sound objects in the video frame, and the method may include the following steps:

for each sound included in the audio, the computer device determines the auditory bias position of the sound-generating object corresponding to the sound in the video frame relative to the right-ear auditory range and the left-ear auditory range of the user, step 501.

Wherein, the hearing deviation position can comprise a hearing range close to the left ear and a hearing range far away from the right ear, and a hearing range close to the right ear and a hearing range far away from the left ear. Generally, if the sound-generating object corresponding to the sound is on the left side of the video frame, the auditory sense bias position is close to the auditory sense range of the left ear and far away from the auditory sense range of the right ear, and if the sound-generating object corresponding to the sound is on the right side of the video frame, the auditory sense bias position is close to the auditory sense range of the right ear and far away from the auditory sense range of the left ear.

Step 502, for each sound included in the audio, the computer device processes at least one of the sound intensity and the playing time delay of a target channel of the sound according to the auditory bias position of the sound generation object of the sound, wherein the target channel is at least one of a left channel and a right channel.

In general, the left channel may simulate sound in the human left ear hearing range and the right channel may simulate sound in the human right ear hearing range, and thus, the left and right channels can influence the direction from which sound is audibly perceived by the user.

For example, if the sound intensity of the left channel is higher and the sound intensity of the right channel is lower, that is, the sound in the range of the left ear hearing of the user simulated by the left channel is higher, and the sound in the range of the right ear hearing of the user simulated by the right channel is lower, in this case, the source direction of the sound acoustically felt by the user is the left side of the user. For another example, if the sound intensity of the left channel is low and the sound intensity of the right channel is high, that is, the sound in the range of the left ear hearing of the user simulated by the left channel is low and the sound in the range of the right ear hearing of the user simulated by the right channel is high, in this case, the direction of the sound auditory sensation of the user is from the right side of the user. For another example, if the playback delay of the left channel is longer than the playback delay of the right channel, the sound in the auditory range of the left ear of the user simulated by the left channel is transmitted into the ear of the user later than the sound in the auditory range of the right ear of the user simulated by the right channel, and in this case, the direction of the source of the sound acoustically felt by the user is the right side of the user. For another example, if the playback delay of the left channel is lower than the playback delay of the right channel, the sound in the auditory range of the right ear of the user simulated by the right channel is transmitted into the ear of the user later than the sound in the auditory range of the left ear of the user simulated by the left channel, in which case the direction of origin of the sound acoustically felt by the user is the left side of the user.

Accordingly, stereo processing of a sound can be achieved by processing at least one of the sound intensity and the playback delay of the target channel of the sound.

Optionally, when the auditory offset position is close to the auditory range of the left ear and far away from the auditory range of the right ear, the computer device may perform enhancement processing on the sound intensity of the left channel of the sound and/or perform attenuation processing on the sound intensity of the right channel of the sound; the computer device may perform an enhancement process on the sound intensity of the right channel of sound and/or a reduction process on the sound intensity of the left channel of sound when the auditory bias is positioned closer to the right ear auditory range and farther from the left ear auditory range.

Furthermore, when the auditory offset position is close to the auditory range of the left ear and far away from the auditory range of the right ear, the computer device can carry out early play processing on the left channel of the sound and/or carry out delayed play processing on the right channel of the sound; when the hearing bias position is close to the right ear hearing range and far away from the left ear hearing range, the computer device may perform a pre-play process on the right channel of the sound and/or a delayed play process on the left channel of the sound.

Referring to fig. 6, a block diagram of a model training apparatus 600 according to an embodiment of the present application is shown, where the model training apparatus 600 may be configured in a computer device. As shown in fig. 6, the model training apparatus 600 may include: a first acquisition module 601, a second acquisition module 602, and a training module 603.

The first obtaining module 601 is configured to obtain a training video sample, where the training video sample includes a training video and at least two real tags having an association relationship and corresponding to the training video;

the second obtaining module 602 is configured to input the training video into an initial convolutional neural network and an initial cyclic neural network, respectively, to obtain at least two training labels having an association relationship, output by the initial convolutional neural network and the initial cyclic neural network;

the training module 603 is configured to train the initial convolutional neural network and the initial cyclic neural network based on a difference between the at least two training labels and the at least two real labels, so as to obtain a trained target convolutional neural network and a trained target cyclic neural network.

In an alternative embodiment of the present application, the at least two real tags include a first real tag and a second real tag, the first real tag is used for indicating a real type of an object included in a training video frame of the training video, and the second real tag is used for indicating a real type of a sound-generating object corresponding to a sound in a training audio of the training video;

the at least two training labels include a first training label and a second training label, the first training type is used for indicating the type of the object included in the training video frame recognized by the initial convolutional network, and the second label type is used for indicating the type of the sound-emitting object corresponding to the sound in the training audio recognized by the initial cyclic neural network.

In an optional embodiment of the present application, the second obtaining module 602 is specifically configured to: extracting the training video frame and the training audio from the training video respectively; inputting the training video frame to the initial convolutional neural network to obtain the first training label output by the initial convolutional neural network; and inputting the training audio to the initial recurrent neural network to obtain the second training label output by the initial recurrent neural network.

In an optional embodiment of the present application, the training module 603 is specifically configured to: training the initial convolutional neural network and the initial cyclic neural network based on a difference between the first training label and the second training label, a difference between the first training label and the first real label, and a difference between the second training label and the second real label.

Referring to fig. 7, in an alternative embodiment of the present application, another model training apparatus 700 is provided, where the model training apparatus 700 includes, in addition to the modules included in the model training apparatus 600, an extraction module 604, a third obtaining module 605, a positioning module 606, a processing module 607, and a generation module 608.

The extracting module 604 is configured to extract video frames and audio from the target video to be processed, respectively.

The third obtaining module 605 is configured to input the video frame into the target convolutional neural network, input the audio into the target cyclic neural network, and perform recognition processing on the video frame and the audio through the target convolutional neural network and the target cyclic neural network, respectively, to obtain a recognition result, where the recognition result includes a first type of each object included in the video frame and a second type of a sounding object corresponding to each sound included in the audio.

The positioning module 606 is configured to position, in the video frame, sounding objects corresponding to various sounds included in the audio according to the recognition result.

The processing module 607 is configured to perform preset type processing on various sounds included in the audio respectively based on the position of each sound generating object in the video frame, so as to obtain a processed audio.

The generating module 608 is configured to generate a processed target video based on the processed audio.

In an alternative embodiment of the present application, the positioning module 606 is specifically configured to: for each sound in the audio, determining a candidate object from the objects included in the video frame based on the second type corresponding to the sound, wherein the first type of the candidate object is matched with the second type corresponding to the sound; if the number of the candidate objects is 1, taking the candidate objects as the sound-producing objects corresponding to the sound; and if the number of the candidate objects is greater than 1, screening the sound-producing objects corresponding to the sound from the candidate objects, wherein the probability of the screened sound-producing objects to produce the sound is greater than the probability of other candidate objects to produce the sound.

In an alternative embodiment of the present application, the positioning module 606 is specifically configured to: determining whether a target part of each candidate object, which is associated with the vocalization, is in a vocalization state; and screening the candidate object of the target part in the sounding state as a sounding object corresponding to the sound.

In an optional embodiment of the application, the second type corresponding to the sound is a human type, the first type of the candidate object is a human type, the target portion is a mouth, and the positioning module 606 is specifically configured to: determining whether the mouth of each candidate object is in an open state; if the mouth is in an open state, determining that the mouth is in a sounding state; and if the mouth part is not in the open state, determining that the mouth part is not in the sounding state.

In an optional embodiment of the application, the second type corresponding to the sound is a human type, the first type of the candidate object is a human type, and the positioning module 606 is specifically configured to: determining type information of a human who utters the sound according to the characteristics of the sound; and screening the sound-producing object corresponding to the sound from the candidate object according to the type information.

In an alternative embodiment of the present application, the preset type of processing includes stereo processing, and the processing module 607 is specifically configured to: for each sound included in the audio, determining an auditory deviation position of a sound production object corresponding to the sound in the video frame relative to a right ear auditory range and a left ear auditory range of a user, and processing at least one of sound intensity and playing time delay of a target channel of the sound according to the auditory deviation position, wherein the target channel is at least one of a left channel and a right channel.

In an optional embodiment of the present application, the processing module 607 is specifically configured to: when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, enhancing the sound intensity of the left channel of the sound, and/or weakening the sound intensity of the right channel of the sound; when the hearing deviation position is close to the right ear hearing range and far away from the left ear hearing range, the sound intensity of the right channel of the sound is enhanced, and/or the sound intensity of the left channel of the sound is weakened.

In an optional embodiment of the present application, the processing module 607 is specifically configured to: when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, the left sound channel of the sound is subjected to early play processing, and/or the right sound channel of the sound is subjected to delayed play processing; and when the hearing deviation position is close to the hearing range of the right ear and far away from the hearing range of the left ear, performing advanced playing processing on the right channel of the sound, and/or performing delayed playing processing on the left channel of the sound.

The model training device provided by the embodiment of the application can realize the method embodiment, the realization principle and the technical effect are similar, and the details are not repeated.

For specific limitations of the model training device, reference may be made to the above limitations of the model training method, which are not described herein again. The modules in the model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the terminal, and can also be stored in a memory in the terminal in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment of the present application, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:

In one embodiment of the present application, the at least two real tags include a first real tag and a second real tag, the first real tag is used for indicating a real type of an object included in a training video frame of the training video, and the second real tag is used for indicating a real type of a sound-generating object corresponding to a sound in a training audio of the training video; the at least two training labels include a first training label and a second training label, the first training type is used for indicating the type of the object included in the training video frame recognized by the initial convolutional network, and the second label type is used for indicating the type of the sound-emitting object corresponding to the sound in the training audio recognized by the initial cyclic neural network.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: extracting the training video frame and the training audio from the training video respectively; inputting the training video frame to the initial convolutional neural network to obtain the first training label output by the initial convolutional neural network; and inputting the training audio to the initial recurrent neural network to obtain the second training label output by the initial recurrent neural network.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: training the initial convolutional neural network and the initial cyclic neural network based on a difference between the first training label and the second training label, a difference between the first training label and the first real label, and a difference between the second training label and the second real label.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: respectively extracting a video frame and an audio from a target video to be processed; inputting the video frame into the target convolutional neural network, inputting the audio into the target convolutional neural network, and respectively identifying and processing the video frame and the audio through the target convolutional neural network and the target convolutional neural network to obtain an identification result, wherein the identification result comprises a first type of each object included in the video frame and a second type of a sound object corresponding to each sound included in the audio; positioning sound production objects respectively corresponding to various sounds included in the audio in the video frame according to the identification result; processing various sounds included in the audio respectively in a preset type based on the position of each sounding object in the video frame to obtain processed audio; a processed target video is generated based on the processed audio.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: for each sound in the audio, determining a candidate object from the objects included in the video frame based on the second type corresponding to the sound, wherein the first type of the candidate object is matched with the second type corresponding to the sound; if the number of the candidate objects is 1, taking the candidate objects as the sound-producing objects corresponding to the sound; and if the number of the candidate objects is greater than 1, screening the sound-producing objects corresponding to the sound from the candidate objects, wherein the probability of the screened sound-producing objects to produce the sound is greater than the probability of other candidate objects to produce the sound.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: determining whether a target part of each candidate object, which is associated with the vocalization, is in a vocalization state; and screening the candidate object of the target part in the sounding state as a sounding object corresponding to the sound.

In an embodiment of the application, the second type corresponding to the sound is a human type, the first type of the candidate object is a human type, the target portion is a mouth, and the processor executes the computer program to further implement the following steps: determining whether the mouth of each candidate object is in an open state; if the mouth is in an open state, determining that the mouth is in a sounding state; and if the mouth part is not in the open state, determining that the mouth part is not in the sounding state.

In an embodiment of the application, the second type corresponding to the sound is a human type, the first type of the candidate object is a human type, and the processor executes the computer program to further implement the following steps: determining type information of a human who utters the sound according to the characteristics of the sound; and screening the sound-producing object corresponding to the sound from the candidate object according to the type information.

In an embodiment of the application, the preset type of processing comprises stereo processing, the processor when executing the computer program further realizing the steps of: for each sound included in the audio, determining an auditory deviation position of a sound production object corresponding to the sound in the video frame relative to a right ear auditory range and a left ear auditory range of a user, and processing at least one of sound intensity and playing time delay of a target channel of the sound according to the auditory deviation position, wherein the target channel is at least one of a left channel and a right channel.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, enhancing the sound intensity of the left channel of the sound, and/or weakening the sound intensity of the right channel of the sound; when the hearing deviation position is close to the right ear hearing range and far away from the left ear hearing range, the sound intensity of the right channel of the sound is enhanced, and/or the sound intensity of the left channel of the sound is weakened.

In one embodiment of the application, the processor when executing the computer program further performs the steps of: when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, the left sound channel of the sound is subjected to early play processing, and/or the right sound channel of the sound is subjected to delayed play processing; and when the hearing deviation position is close to the hearing range of the right ear and far away from the hearing range of the left ear, performing advanced playing processing on the right channel of the sound, and/or performing delayed playing processing on the left channel of the sound.

The implementation principle and technical effect of the computer device provided by the embodiment of the present application are similar to those of the method embodiment described above, and are not described herein again.

In an embodiment of the application, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of:

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: extracting the training video frame and the training audio from the training video respectively; inputting the training video frame to the initial convolutional neural network to obtain the first training label output by the initial convolutional neural network; and inputting the training audio to the initial recurrent neural network to obtain the second training label output by the initial recurrent neural network.

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: training the initial convolutional neural network and the initial cyclic neural network based on a difference between the first training label and the second training label, a difference between the first training label and the first real label, and a difference between the second training label and the second real label.

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: respectively extracting a video frame and an audio from a target video to be processed; inputting the video frame into the target convolutional neural network, inputting the audio into the target convolutional neural network, and respectively identifying and processing the video frame and the audio through the target convolutional neural network and the target convolutional neural network to obtain an identification result, wherein the identification result comprises a first type of each object included in the video frame and a second type of a sound object corresponding to each sound included in the audio; positioning sound production objects respectively corresponding to various sounds included in the audio in the video frame according to the identification result; processing various sounds included in the audio respectively in a preset type based on the position of each sounding object in the video frame to obtain processed audio; a processed target video is generated based on the processed audio.

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: for each sound in the audio, determining a candidate object from the objects included in the video frame based on the second type corresponding to the sound, wherein the first type of the candidate object is matched with the second type corresponding to the sound; if the number of the candidate objects is 1, taking the candidate objects as the sound-producing objects corresponding to the sound; and if the number of the candidate objects is greater than 1, screening the sound-producing objects corresponding to the sound from the candidate objects, wherein the probability of the screened sound-producing objects to produce the sound is greater than the probability of other candidate objects to produce the sound.

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: determining whether a target part of each candidate object, which is associated with the vocalization, is in a vocalization state; and screening the candidate object of the target part in the sounding state as a sounding object corresponding to the sound.

In an embodiment of the application, the second type corresponding to the sound is a human type, the first type of the candidate object is a human type, the target portion is a mouth, and the computer program, when executed by the processor, further performs the steps of: determining whether the mouth of each candidate object is in an open state; if the mouth is in an open state, determining that the mouth is in a sounding state; and if the mouth part is not in the open state, determining that the mouth part is not in the sounding state.

In an embodiment of the application, the second type corresponding to the sound is a human type, the first type of the candidate object is a human type, and the computer program, when executed by the processor, further performs the steps of: determining type information of a human who utters the sound according to the characteristics of the sound; and screening the sound-producing object corresponding to the sound from the candidate object according to the type information.

In an embodiment of the application, the preset type of processing comprises stereo processing, the computer program when executed by the processor further realizing the steps of: for each sound included in the audio, determining an auditory deviation position of a sound production object corresponding to the sound in the video frame relative to a right ear auditory range and a left ear auditory range of a user, and processing at least one of sound intensity and playing time delay of a target channel of the sound according to the auditory deviation position, wherein the target channel is at least one of a left channel and a right channel.

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, enhancing the sound intensity of the left channel of the sound, and/or weakening the sound intensity of the right channel of the sound; when the hearing deviation position is close to the right ear hearing range and far away from the left ear hearing range, the sound intensity of the right channel of the sound is enhanced, and/or the sound intensity of the left channel of the sound is weakened.

In one embodiment of the application, the computer program when executed by the processor further performs the steps of: when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, the left sound channel of the sound is subjected to early play processing, and/or the right sound channel of the sound is subjected to delayed play processing; and when the hearing deviation position is close to the hearing range of the right ear and far away from the hearing range of the left ear, performing advanced playing processing on the right channel of the sound, and/or performing delayed playing processing on the left channel of the sound.

The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of model training, the method comprising:

acquiring a training video sample, wherein the training video sample comprises a training video and at least two real labels which correspond to the training video and have an incidence relation;

inputting the training video into an initial convolutional neural network and an initial cyclic neural network respectively to obtain at least two training labels which are output by the initial convolutional neural network and the initial cyclic neural network and have an incidence relation;

and training the initial convolutional neural network and the initial cyclic neural network based on the difference between the at least two training labels and the at least two real labels to obtain a trained target convolutional neural network and a trained target cyclic neural network.

2. The method of claim 1, wherein the at least two real labels comprise a first real label indicating a real type of an object included in a training video frame of the training video and a second real label indicating a real type of a sound object corresponding to a sound in a training audio of the training video;

the at least two training labels comprise a first training label and a second training label, the first training type is used for indicating the type of the object included in the training video frame identified by the initial convolutional network, and the second label type is used for indicating the type of the sound-emitting object corresponding to the sound in the training audio identified by the initial cyclic neural network.

3. The method according to claim 2, wherein the inputting the training video into an initial convolutional neural network and an initial cyclic neural network respectively to obtain at least two training labels with associated relationship output by the initial convolutional neural network and the initial cyclic neural network comprises:

extracting the training video frame and the training audio from the training video respectively;

inputting the training video frame to the initial convolutional neural network to obtain the first training label output by the initial convolutional neural network;

and inputting the training audio to the initial cyclic neural network to obtain the second training label output by the initial cyclic neural network.

4. The method of claim 3, wherein training the initial convolutional neural network and the initial circular neural network based on differences between the at least two training labels and the at least two real labels comprises:

training the initial convolutional neural network and the initial cyclic neural network based on a difference between the first training label and the second training label, a difference between the first training label and the first real label, and a difference between the second training label and the second real label.

5. The method of any of claims 1 to 4, further comprising:

respectively extracting a video frame and an audio from a target video to be processed;

inputting the video frame into the target convolutional neural network, inputting the audio into the target convolutional neural network, and respectively identifying and processing the video frame and the audio through the target convolutional neural network and the target convolutional neural network to obtain an identification result, wherein the identification result comprises a first type of each object included in the video frame and a second type of a sound object corresponding to each sound included in the audio;

positioning sounding objects respectively corresponding to various sounds included in the audio in the video frame according to the identification result;

processing various sounds included in the audio respectively in a preset type based on the position of each sounding object in the video frame to obtain processed audio;

generating a processed target video based on the processed audio.

6. The method according to claim 5, wherein said locating, in the video frame, the sound-generating objects corresponding to the respective sounds included in the audio according to the recognition result comprises:

for each sound in the audio, determining a candidate object from the objects included in the video frame based on the second type corresponding to the sound, wherein the first type of the candidate object is matched with the second type corresponding to the sound; if the number of the candidate objects is 1, taking the candidate objects as sound production objects corresponding to the sound; and if the number of the candidate objects is larger than 1, screening the sound-producing objects corresponding to the sound from the candidate objects, wherein the probability of the screened sound-producing objects to emit the sound is larger than the probability of other candidate objects to emit the sound.

7. The method according to claim 6, wherein the screening out the candidate objects for the sound includes:

determining whether a target portion of each of the candidate objects associated with an utterance is in an utterance state;

and screening the candidate object of the target part in the sounding state as a sounding object corresponding to the sound.

8. The method of claim 7, wherein the second type corresponding to the sound is a human type, wherein the first type of the candidate objects is a human type, wherein the target portion is a mouth, and wherein determining whether the target portion associated with the utterance of each of the candidate objects is in an utterance state comprises:

determining whether a mouth of each of the candidate objects is in an open state;

if the mouth is in an open state, determining that the mouth is in a sounding state;

and if the mouth part is not in the open state, determining that the mouth part is not in the sounding state.

9. The method of claim 6, wherein the second type of sound corresponds to a human type, the first type of candidate object is a human type, and the screening the candidate objects for the sound object comprises:

determining type information of a human emitting the sound according to the characteristics of the sound;

and screening the sound-producing object corresponding to the sound from the candidate objects according to the type information.

10. The method of claim 5, wherein the preset type of processing comprises stereo processing, and wherein the preset type of processing on the audio including various sounds based on the position of each sound-producing object in the video frame comprises:

and for each sound included in the audio, determining an auditory deflection position of a sound production object corresponding to the sound in the video frame relative to a right ear auditory range and a left ear auditory range of a user, and processing at least one of sound intensity and playing time delay of a target channel of the sound according to the auditory deflection position, wherein the target channel is at least one of a left channel and a right channel.

11. The method of claim 10, wherein processing at least one of a sound intensity and a playback delay of a target channel of the sound according to the auditory bias location comprises:

when the auditory deviation position is close to the left ear auditory range and far away from the right ear auditory range, enhancing the sound intensity of the left channel of the sound and/or weakening the sound intensity of the right channel of the sound;

and when the auditory deflection position is close to the right ear auditory range and far away from the left ear auditory range, enhancing the sound intensity of the right sound channel of the sound, and/or weakening the sound intensity of the left sound channel of the sound.

12. The method of claim 10, wherein processing at least one of a sound intensity and a playback delay of a target channel of the sound according to the auditory bias location comprises:

when the hearing deviation position is close to the hearing range of the left ear and far away from the hearing range of the right ear, the left sound channel of the sound is subjected to early play processing, and/or the right sound channel of the sound is subjected to delayed play processing;

and when the auditory deflection position is close to the right ear auditory range and far away from the left ear auditory range, performing advanced playing processing on the right channel of the sound, and/or performing delayed playing processing on the left channel of the sound.

13. A model training apparatus, the apparatus comprising:

the second acquisition module is used for respectively inputting the training video into an initial convolutional neural network and an initial cyclic neural network to obtain at least two training labels which are output by the initial convolutional neural network and the initial cyclic neural network and have an incidence relation;

14. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements a model training method as claimed in any one of claims 1 to 12.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the model training method according to any one of claims 1 to 12.