CN109841225B

CN109841225B - Sound replacement method, electronic device, and storage medium

Info

Publication number: CN109841225B
Application number: CN201910082625.XA
Authority: CN
Inventors: 许栋刚; 邢丽; 张延良; 王伟; 李林; 王静; 王娜; 刘大鹏; 张玲玲
Original assignee: Beijing Yijiesheng Technology Co ltd
Current assignee: Beijing Yijiesheng Technology Co ltd
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2021-04-30
Anticipated expiration: 2039-01-28
Also published as: CN109841225A

Abstract

The invention relates to a sound replacement method, an electronic device and a storage medium. The method determines a first video asset; determining a first person in a first video resource; determining a first audio characteristic of a first persona; determining a second person corresponding to the first person, the second person being different from the first person; determining a second audio characteristic of a second persona; determining a replacement audio feature from the second audio feature and the first audio feature; adjusting the sound of the first person according to the replacement audio features; the audio features include, pitch, loudness, timbre, speech rate, and language style. The method adjusts the voice of the first character in the first video resource according to the audio characteristic of the second character, realizes the voice change of the character after the video resource is made, and improves the participation and the interactivity.

Description

Sound replacement method, electronic device, and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a sound replacement method, an electronic device, and a storage medium.

Background

At present, in video resources such as movies, televisions, animations, cartoons, games and the like, character images are fixed, namely once the video resources are manufactured, character sounds only can be the same as the sound images during the manufacturing process and cannot be changed.

The character sound is presented in a mode that the character image is unchangeable, the interestingness of the video resource can be reduced, and the participation and the interactivity between the video resource and the user are insufficient.

Disclosure of Invention

Technical problem to be solved

In order to improve interactivity of video resources, the invention provides a sound replacement method, an electronic device and a storage medium.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

a sound replacement method includes:

s101, determining a first video resource;

s102, determining a first person in the first video resource;

s103, determining a first audio characteristic of the first person;

s104, determining a second person corresponding to the first person, wherein the second person is different from the first person;

s105, determining a second audio characteristic of the second person;

s106, determining a replacement audio feature according to the second audio feature and the first audio feature;

s107, adjusting the sound of the first person according to the replaced audio features;

the audio features include: pitch, loudness, timbre, speech rate, language style.

Optionally, the S102 includes:

s102-1, determining the total occurrence duration of each character and the total audio duration of each character in the first video resource;

s102-2, determining the ranking value of each person according to the following formula:

C_e＝T_e2/T_e1；

wherein e is any person in the first video resource, C_eIs the ranking value, T, of any character e in the first video resource_e1Is the total occurrence time length T of any character e in the first video resource_e2Is as followsThe total audio time of any character e in a video resource;

s102-3, sequencing all the people in the first video resource from large to small according to the sequencing value;

s102-4, determining the preset number of people ranked in the front as first people;

when the number of the first people is 1, the number of the second people is 1;

when the number of the first people is multiple, the number of the second people is the same as that of the first people, each second person corresponds to one unique first person, and the second people are different from the corresponding first people.

Optionally, the S104 includes:

monitoring whether at least one replacement resource is triggered;

when at least one alternative resource is triggered, determining a second person from the triggered alternative resource;

wherein the at least one replacement resource is triggered, comprising:

at least one stored audio is selected; alternatively, the first and second electrodes may be,

at least one stored second video asset is selected; alternatively, the first and second electrodes may be,

at least one stored audio is clicked on; alternatively, the first and second electrodes may be,

at least one stored second video asset is clicked on; alternatively, the first and second electrodes may be,

at least one audio is uploaded; alternatively, the first and second electrodes may be,

at least one second video asset is uploaded; alternatively, the first and second electrodes may be,

at least one audio is recorded instantly; or, to

At least one second video resource is shot instantly;

the second video asset is different from the first video asset.

Optionally, the first video resource is a dynamic image resource containing audio, and the dynamic image is a movie, a television, an animation, a game, a self-portrait video, an advertisement video, or a small video;

the second video resource is a dynamic image resource containing audio, and the dynamic image is a movie, a television, an animation, a game, a self-portrait video, an advertisement video or a small video.

Optionally, the determining the second person from the triggered alternative resource includes:

determining the person selected by the user in the triggered alternative resources as a second person; alternatively, the first and second electrodes may be,

when the triggered replacement resource is audio, identifying all characters in the triggered replacement resource, calculating the audio time length of each character, respectively calculating the ratio of the audio time length of each character to the total audio time length, and determining a second character according to the ratio of each character; alternatively, the first and second electrodes may be,

and when the triggered alternative resource is the second video resource, identifying all the persons in the triggered alternative resource, and determining the second person according to the importance degree of each person.

Alternatively, the importance level of each character is determined by:

determining that all frames of any person i exist aiming at the person i;

the importance level of any character i is determined according to the following formula:

wherein, W_iN is the degree of importance of any character i_iThe total number of frames for which any character i exists, N is the total number of frames of the second video resource, T_i1Is the total time length of the appearance of any character i, T₁Total duration of video for triggered replacement resource, T_i2Is the total audio duration, T, of the character i₂Total duration of audio for triggered replacement resource, s₁Total effective video duration, s, for triggered character of replacement resource₂The total length of valid audio for the triggered character of the replacement resource.

Optionally, the language style in the first audio feature of the first person is determined by:

s301-1, acquiring all audio of a first person in a first video resource;

s301-2, performing voice recognition on the audio obtained in S301-1, and determining a first sound characteristic;

s301-3, converting the audio obtained in S301-1 into a first text;

s301-4, performing semantic analysis on the first text to determine first word characteristics;

s301-5, taking the first sound characteristic and the first word characteristic as language styles in the first audio characteristic of the first character;

the linguistic style in the second audio feature of the second persona is determined by:

s302-1, acquiring the audio of a second person;

s302-2, performing voice recognition on the audio obtained in the S302-1, and determining a second voice characteristic;

s302-3, converting the audio obtained in the S302-1 into a second text;

s302-4, performing semantic analysis on the second text to determine second word characteristics;

s302-5, taking the second sound characteristic and the second word characteristic as language styles in the second sound characteristic of the second character;

the sound features include: the pronunciation rhythm, the pause between words, the pronunciation tone of sentence, the position of accent and pronunciation rhythm;

the accent includes: parallel stress, contrast stress, responsiveness stress, progressive stress, turning stress, positive stress, emphatic stress, metaphorical stress, anaudic stress, antisense stress;

the word features include: spoken words, modifiers, word combinations, ellipses.

Optionally, the S106 includes:

s106-1, acquiring a first tone, a first loudness, a first tone, a first speech speed and a first language style in the first audio characteristic;

s106-2, acquiring a second tone, a second loudness, a second tone, a second speech speed and a second language style in the second audio characteristic;

s106-3, determining the average value of the first tone and the second tone as the tone in the replacement audio feature;

s106-4, determining the first loudness as the loudness in the replacement audio features;

s106-5, determining the second tone as the tone in the replacement audio feature;

s106-6, the value A of the following formula₃Determining to adjust to speech rate in the replacement audio feature:

wherein A is₃To replace speech rate in audio features, A₁Is the first speech rate, A₂At the second speech rate, B₁For interword pauses in a first language style, B₂An interword pause in a second language style;

s106-7, determining the sum of the word characteristics in the first language style and the word characteristics in the second language style as the word characteristics of the language style in the replacement audio characteristics;

and S106-8, determining the sound features in the second language style as the sound features of the language style in the replacement audio features.

In order to achieve the above purpose, the main technical solution adopted by the present invention further comprises:

an electronic device comprising a memory, a processor, a bus and a computer program stored on the memory and executable on the processor, the processor implementing a method as claimed in any one of the above methods when executing the program.

a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as in any one of the above methods.

(III) advantageous effects

The invention has the beneficial effects that: determining a first video resource; determining a first person in a first video resource; determining a first audio characteristic of a first persona; determining a second person corresponding to the first person, the second person being different from the first person; determining a second audio characteristic of a second persona; determining a replacement audio feature from the second audio feature and the first audio feature; adjusting the sound of the first person according to the replacement audio features; the audio features comprise tone, loudness, timbre, speech rate and language style, so that the character voice change after the video resource is made is realized, and the participation and the interactivity are improved.

Drawings

Fig. 1 is a schematic flow chart of a sound replacement method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to improve interactivity of video resources, the proposal provides a sound replacement method, electronic equipment and a storage medium, and a first video resource is determined; determining a first person in a first video resource; determining a first audio characteristic of a first persona; determining a second person corresponding to the first person, the second person being different from the first person; determining a second audio characteristic of a second persona; determining a replacement audio feature from the second audio feature and the first audio feature; adjusting the sound of the first person according to the replacement audio features; the audio features comprise tone, loudness, timbre, speech rate and language style, so that the character voice change after the video resource is made is realized, and the participation and the interactivity are improved.

Referring to fig. 1, the implementation flow of the sound replacement method provided in this embodiment is as follows:

s101, determining a first video resource.

The first video resource is a dynamic image resource containing audio.

For example, the moving image is a movie, or a television, or an animation, or a game, or a self-timer video, or an advertisement video, or a small video.

Namely a movie with sound, or a television with sound, or an animation with sound, or a game with sound, or a self-timer video with sound, or an advertisement video with sound, or a small video with sound.

For convenience of description, the animation a with sound as the first video resource is taken as an example in this embodiment and the following embodiments. For other forms of the first video asset, this embodiment will not be illustrated.

S102, determining a first person in the first video resource.

The number of first persons in this step may be one or more. The number of first persons is not limited in this embodiment.

In this step, there are various ways to determine the first person, for example, if the user clicks one person, the person clicked by the user is determined as the first person.

For another example, if the user clicks a plurality of characters, all the characters clicked by the user are determined as the first character.

As another example, the first person is determined by:

s102-1, determining the total occurrence time length of each character and the total audio time length of each character in the first video resource.

C_e＝T_e2/T_e1；

wherein e is any person in the first video resource, C_eIs the ranking value, T, of any character e in the first video resource_e1Is the total occurrence time length T of any character e in the first video resource_e2The total audio time of any character e in the first video resource.

S102-3, sequencing all the people in the first video resource from large to small according to the sequencing value.

S102-4, determining the preset number of people ranked at the top as the first people.

For example, if the preset number is 2, there are 4 characters in the animation a, that is, the character 1, the character 2, the character 3, and the character 4, respectively, then the total length of time that the character 1 appears in the animation a (e.g., T11), the total length of time that the character 2 appears in the animation a (e.g., T21), the total length of time that the character 3 appears in the animation a (e.g., T31), the total length of time that the character 4 appears in the animation a (e.g., T41), and the total length of audio that the character 1 appears in the animation a are determined. (e.g., T12), the total length of audio that character 2 appears in animation a. (e.g., T22), the total length of audio that character 3 appears in animation a. (e.g., T32), the total length of audio that character 4 appears in animation a (e.g., T42). Determining rank value C of person 1₁Rank value C of person 2 ═ T12/T11₂Rank value C of person 3 ═ T22/T21₃Rank value C of person 4 ═ T32/T31₄T42/T41. If C is present₄>C₂>C₁＝C₃If the sequence value is from large to small, all the characters in the animation A are sequenced to obtain the following sequence: person 4, person 2, person 1, and person 3. The top 2 persons (person 4 and person 2) are each determined as the first person.

S103, determining a first audio characteristic of the first person.

And if the number of the first persons is 1, determining the first audio characteristics of the first persons. If the number of the first persons is 2, the first audio characteristics of each first person are determined.

The 'first' in the step is only to distinguish the audio features of the subsequent second person, and has no practical meaning.

Where the tone is represented by the frequency of the sound wave.

Loudness is expressed in the amplitude of vibration of the sound wave.

The tone is represented by the vibration waveform of the sound wave.

The speech rate is expressed in words per minute.

The language style is determined by:

s301-1, in the first video resource, all the audios of the first person are obtained.

S301-2, performing voice recognition on the audio obtained in S301-1, and determining a first sound characteristic.

Wherein the sound features include: word pronunciation tone, pause between words, sentence pronunciation tone, accent position, pronunciation rhythm.

The accent includes: parallel stress, contrast stress, responsiveness stress, progressive stress, turning stress, positive stress, emphatic stress, metaphorical stress, pseudovocalic stress, antisense stress.

The word or phrase is represented by the language stress. Such as: talk about life, talk about ideal, talk about the future.

The contrastive stress means that some words or phrases which are compared and contrasted to make the characteristics of the things more prominent and make the things more vivid exist in the paragraphs or the sentences, and the comparative relationship between the words or phrases is represented by the linguistic stress. If the elephant is large, the mouse is small.

The stress of the correspondence refers to the fact that the context corresponding relation is represented by the stress of the language. The pearl is big like black bean, and small like millet.

The progressive stress means that the language stress represents that the relationship is developed forward step by step and deepened step by step. If the attitude of the manager is changed first, then the attitude of the employee is changed.

The turning accent indicates a relationship of content change in the opposite direction by the language accent. If the person has no way in the world, the person has more people and the way is also formed.

Positive stress indicates positive attitude by linguistic stress. If the question is not really done by me.

Accentuation accents express a particular emotion and emphasize a particular meaning by linguistic accents, with the aim of drawing the listener's attention to a certain portion of his own emphasis. As i go to the classroom.

Metaphorical stress means that some words or phrases exist in paragraphs or sentences, which are abstracted into concrete, deepened or shallow, so that the language is fun, and the words or phrases are difficult to forget by listeners, and the words or phrases are expressed by language stress. Such as a doll just landed in spring, is new from head to foot.

The speech accents represent the ideographic words by linguistic accents. Such as by exhaling.

Antisense stress refers to the presence of words or phrases in paragraphs or sentences that are either spoken in the opposite direction or spoken in the opposite direction in order to reveal the nature of the matter, the words or phrases being represented by linguistic stress. How smart you are?

S301-3, the audio obtained in S301-1 is converted into a first text.

S301-4, performing semantic analysis on the first text, and determining first word features.

Wherein, the word characteristics include: spoken words, modifiers, word combinations, ellipses.

S301-5, the first sound characteristic and the first word characteristic are taken as the language style in the first audio characteristic of the first character.

The language characteristics of the first character can be embodied through the sound characteristics, and the word characteristics of the first character can be embodied through the word characteristics. The combination of the voice characteristics and the word characteristics can accurately describe the language style of the first character.

And S104, determining a second person corresponding to the first person.

Wherein the second person is different from the first person.

That is, when the number of the first person is 1, the number of the second person is 1, and the second person is different from the first person. When the number of the first people is multiple, the number of the second people is the same as that of the first people, each second person corresponds to one unique first person, and the second people are different from the corresponding first people.

For example, when the number of first persons is 2 (e.g., a and B), the number of second persons is also 2 (e.g., C and D), each second person corresponds to a unique first person (e.g., C corresponds to a and D corresponds to B), and the second person is different from its corresponding first person (e.g., C is different from a and D is different from B). In this embodiment, only C is different from a, and D is the same as B, but this embodiment does not limit whether C is the same as B, and this embodiment does not limit whether a is the same as D.

The specific implementation manner of the step is as follows: monitoring whether at least one replacement resource is triggered; and when at least one alternative resource is triggered, determining a second person from the triggered alternative resource.

The state of the replacement resource may be a stored replacement resource, an uploaded replacement resource, or an immediately photographed replacement resource. In addition, the alternate resource may be either an audio or a second video resource. (the second video asset is also a dynamic video asset containing audio. for example, the dynamic video is a movie, or a television, or an animation, or a game, or a self-timer video, or an advertisement video, or a small video. the movie with sound, or the television with sound, or the animation with sound, or the game with sound, or the self-timer video with sound, or the advertisement video with sound, or the small video with sound. the second video asset is only used for distinguishing from the first video asset in S101, that is, the second and the first video assets are only used for limiting the assets in different stages and have no other meanings, the first video asset is the asset where the replaced character is located, the second video asset is the asset where the replaced character is located.

Therefore, the at least one alternative resource in this embodiment may be at least one stored audio, or at least one stored second video resource, or at least one uploaded audio, or at least one uploaded second video resource, or at least one instantly recorded audio, or at least one instantly photographed second video resource.

Based thereon, it may be determined that at least one replacement resource is triggered when the following events are monitored to occur, including:

at least one stored audio is selected by a user, or at least one stored second video resource is selected by a user, or at least one stored audio is clicked by a user, or at least one stored second video resource is clicked by a user, or at least one audio is uploaded, or at least one second video resource is uploaded, or at least one audio is recorded instantly, or at least one second video resource is shot instantly.

Furthermore, the implementation manner of determining the second person from the triggered alternative resource may be: and determining the person selected by the user in the triggered alternative resource as a second person.

Or, when the triggered alternative resource is audio, determining the implementation manner of the second person from the triggered alternative resource may be: identifying all characters in the triggered replacement resources, calculating the audio time length of each character, respectively calculating the ratio of the audio time length of each character to the total audio time length, and determining a second character according to the ratio of each character. For example, a predetermined number of persons with a larger ratio are determined as the second person.

The preset number here is the same as the preset number when the first person is determined in S102.

If the ratio is larger than 2 persons, the second person is determined.

In addition, when the triggered alternative resource is the second video resource, the implementation manner of determining the second person from the triggered alternative resource may be: and identifying all the persons in the triggered replacement resource, and determining a second person according to the importance degree of each person.

And e.g. according to the ranking of the importance degrees of the characters from high to low, determining the preset number of the characters ranked at the top as the second character.

For example, 2 persons having a higher degree of importance are set as the second persons.

The calculation method for the degree of importance includes but is not limited to:

for any person i, it is determined that all frames of any person i exist.

The degree of importance of any person i is determined according to the following formula.

Wherein, W_iN is the degree of importance of any character i_iThe total number of frames for any character i to exist, N is the total number of frames of the second video resource, T_i1For the total length of appearance of any character i, T₁Total duration of video for triggered replacement resource, T_i2Total duration of audio, T, for any character i₂Total duration of audio for triggered replacement resource, s₁Total effective video duration, s, for triggered character of replacement resource₂The total length of valid audio for the triggered character of the replacement resource.

The total effective video time length of the person of the triggered alternative resource is the video time length of the person in the triggered alternative resource, and the time of only having a windy scene or beginning and end of a film is not in the time length range. The total time length of the effective audio of the character of the triggered replacement resource is the time length of the character audio in the triggered replacement resource, and the time length of the character without speaking is not within the long range when the character is in the windy scene or the head and the tail of the character.

Taking a video with 5 frames in total, a total duration of 3 seconds and an audio duration of 2 seconds as an example, for any person i, it is determined that all frames (e.g., frame 1 and frame 3) of the person i exist.

Importance of any character i

Wherein, W_iN is the degree of importance of character i_iTotal number of frames for existence of character i (n)_i2), N is the total number of frames of the second video asset (N is 5), T_i1For the total length of occurrence of character i, T₁Total duration of video (T) for triggered replacement resource₁3 seconds), T_i2Total duration of audio for character i, T₂Total duration of audio (T) for triggered replacement resource₂2 seconds), s₁Active video time count for triggered character of replacement resourceLength, s₂The total length of valid audio for the triggered character of the replacement resource.

In addition, when there are a plurality of first persons, the determination method of the corresponding relationship between the second person and the first person is not limited in this embodiment. The person may be designated manually, or the second person ranked first may be associated with the first person ranked first.

S105, determining a second audio characteristic of the second person.

The content of the audio feature here is the same as the audio feature in S103.

The "second" in this step is only to distinguish from the audio feature of the first person in S103, and does not have any practical meaning.

The audio features include, pitch, loudness, timbre, speech rate, and language style.

Where the tone is represented by the frequency of the sound wave.

Loudness is expressed in the amplitude of vibration of the sound wave.

The tone is represented by the vibration waveform of the sound wave.

The speech rate is expressed in words per minute.

The language style is determined by:

s302-1, the audio of the second person is obtained.

S302-2, performing voice recognition on the audio obtained in S302-1, and determining a second sound characteristic.

The sound features include: word pronunciation tone, pause between words, sentence pronunciation tone, accent position, pronunciation rhythm.

S302-3, converting the audio obtained in S302-1 into a second text.

S302-4, performing semantic analysis on the second text, and determining second word characteristics.

The word characteristics include: spoken words, modifiers, word combinations, ellipses.

And S302-5, taking the second sound characteristic and the second word characteristic as the language style in the second audio characteristic of the second character.

In addition, when there are a plurality of first persons, a second person corresponding to each first person is determined in S104, and a second audio feature of each second person is determined in this step.

And S106, determining a replacement audio characteristic according to the second audio characteristic and the first audio characteristic.

When the number of the first persons is multiple, the number of the second persons is multiple, and the step determines the replacement audio characteristics of each first person according to the first audio characteristics of the first person and the second audio characteristics of the second person corresponding to the first person.

That is, when the first person is p and q, and the second person corresponding to p is p ', and the second person corresponding to q is q ', this step determines the replacement audio feature for p according to the audio feature of p and the audio feature of p ' corresponding to p. And determining the replacement audio characteristic aiming at q according to the audio characteristic of q and the audio characteristic of q' corresponding to q.

The implementation for determining the replacement audio feature from the second audio feature and the first audio feature is as follows:

s106-1, acquiring a first tone, a first loudness, a first tone, a first speech speed and a first language style in the first audio feature.

S106-2, acquiring a second tone, a second loudness, a second tone, a second speech speed and a second language style in the second audio feature.

S106-3, determining the average of the first tone and the second tone as the tone in the replacement audio feature.

Since the pitch is represented by a frequency, a frequency representing a pitch in the replacement audio feature ═ 2 (a frequency representing the first pitch + a frequency representing the second pitch).

And S106-4, determining the first loudness as the loudness in the replacement audio feature.

S106-5, determining the second tone color as the tone color in the replacement audio feature.

S106-6, the value A of the following formula₃Determined to adjust to the speech rate in the replacement audio feature.

Wherein A is₃To replace speech rate in audio features, A₁Is the first speech rate, A₂At the second speech rate, B₁For interword pauses in a first language style, B₂Is an interword pause in a second language style.

And S106-7, determining the sum of the word characteristics in the first language style and the word characteristics in the second language style as the word characteristics of the language style in the replacement audio characteristics.

Word features include words such as spoken words, modifiers, word combinations, ellipses, and the like. Combining a word set formed by the word characteristics in the first language style with a word set formed by the word characteristics in the second language style, and determining the combined word set as the word characteristics replacing the language style in the audio characteristics.

S107, adjusting the sound of the first person according to the replaced audio features.

Based on the tone, the loudness, the timbre, the speech speed and the language style in the replaced audio features as the audio features of the first person, the speech of the first person is re-pronounced according to the tone, the loudness, the timbre, the speech speed and the language style in the replaced audio features, and then the sound of one person in the first video resource is replaced by the sound with the replaced audio features. Because the replacement audio feature is obtained based on the second person, the method provided by the embodiment can replace the voice of one person in the first video with the voice of the user, realize the voice change of the person after the video resource is made, and improve the participation and the interactivity.

In addition, in order to avoid abrupt and uncoordinated sound caused by tone, loudness, timbre, speech rate, language style and the like after replacement, in the embodiment, the audio features of the user are fused instead of being directly used during replacement, so that the final audio features are formed for occurrence, and the replacement effect is improved.

It should be noted that "first" and "second" in this embodiment and subsequent embodiments are only serial numbers, and are used to distinguish different characters, audio features, video resources, texts, and the like, and have no other meaning.

The method provided by the invention comprises the steps of determining a first video resource; determining a first person in a first video resource; determining a first audio characteristic of a first persona; determining a second person corresponding to the first person, the second person being different from the first person; determining a second audio characteristic of a second persona; determining a replacement audio feature from the second audio feature and the first audio feature; adjusting the sound of the first person according to the replacement audio features; the audio features comprise tone, loudness, timbre, speech rate and language style, so that the character voice change after the video resource is made is realized, and the participation and the interactivity are improved.

Referring to fig. 2, the present embodiment provides an electronic apparatus including: memory 201, processor 202, bus 203, and computer programs stored on memory 201 and executable on processor 202.

The processor 202 implements the following method when executing the program:

s101, determining a first video resource;

s102, determining a first person in a first video resource;

s103, determining a first audio characteristic of a first person;

s105, determining a second audio characteristic of a second person;

s106, determining a replacement audio characteristic according to the second audio characteristic and the first audio characteristic;

Optionally, S102 includes:

C_e＝T_e2/T_e1；

wherein e is any person in the first video resource, C_eIs the ranking value, T, of any character e in the first video resource_e1Is the total occurrence time length T of any character e in the first video resource_e2The total audio time length of any character e in the first video resource is obtained;

when the number of the first people is 1, the number of the second people is 1;

Optionally, S104 includes:

monitoring whether at least one replacement resource is triggered;

wherein the at least one replacement resource is triggered, comprising:

at least one audio is recorded instantly; alternatively, the first and second electrodes may be,

at least one second video asset is shot instantly;

the second video asset is different from the first video asset.

Optionally, determining the second person from the triggered alternative resource includes:

Alternatively, the importance level of each character is determined by:

determining all frames of any person i to exist aiming at any person i;

the importance level of any person i is determined according to the following formula:

Optionally, the linguistic style in the first audio feature of the first person is determined by:

s301-1, acquiring all audio of a first person in a first video resource;

s301-3, converting the audio obtained in S301-1 into a first text;

s301-5, taking the first sound characteristic and the first word characteristic as the language style in the first audio characteristic of the first character;

s302-1, acquiring the audio of a second person;

s302-3, converting the audio obtained in the S302-1 into a second text;

s302-5, taking the second sound characteristic and the second word characteristic as the language style in the second audio characteristic of the second character;

Optionally, S106 includes:

The electronic device provided by the embodiment determines a first video resource; determining a first person in a first video resource; determining a first audio characteristic of a first persona; determining a second person corresponding to the first person, the second person being different from the first person; determining a second audio characteristic of a second persona; determining a replacement audio feature from the second audio feature and the first audio feature; adjusting the sound of the first person according to the replacement audio features; the audio features comprise tone, loudness, timbre, speech rate and language style, so that the character voice change after the video resource is made is realized, and the participation and the interactivity are improved.

The present embodiments provide a computer storage medium that performs the following operations:

s101, determining a first video resource;

s102, determining a first person in a first video resource;

s103, determining a first audio characteristic of a first person;

s105, determining a second audio characteristic of a second person;

Optionally, S102 includes:

C_e＝T_e2/T_e1；

wherein e is any person in the first video resource, C_eIs the ranking value, T, of any character e in the first video resource_e1The total time length of the appearance of any person e in the first video resource,T_e2the total audio time length of any character e in the first video resource is obtained;

when the number of the first people is 1, the number of the second people is 1;

Optionally, S104 includes:

monitoring whether at least one replacement resource is triggered;

wherein the at least one replacement resource is triggered, comprising:

at least one second video asset is shot instantly;

the second video asset is different from the first video asset.

Alternatively, the importance level of each character is determined by:

determining all frames of any person i to exist aiming at any person i;

s301-1, acquiring all audio of a first person in a first video resource;

s301-3, converting the audio obtained in S301-1 into a first text;

s302-1, acquiring the audio of a second person;

s302-3, converting the audio obtained in the S302-1 into a second text;

Optionally, S106 includes:

The computer storage medium provided by the embodiment determines a first video resource; determining a first person in a first video resource; determining a first audio characteristic of a first persona; determining a second person corresponding to the first person, the second person being different from the first person; determining a second audio characteristic of a second persona; determining a replacement audio feature from the second audio feature and the first audio feature; adjusting the sound of the first person according to the replacement audio features; the audio features comprise tone, loudness, timbre, speech rate and language style, so that the character voice change after the video resource is made is realized, and the participation and the interactivity are improved.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A sound replacement method, characterized in that the method comprises:

s101, determining a first video resource;

s102, determining a first person in the first video resource;

s103, determining a first audio characteristic of the first person;

s105, determining a second audio characteristic of the second person;

the audio features include: tone, loudness, timbre, speech rate, language style;

wherein the speech rate is the number of words per minute in the audio;

the S102 includes:

C_e＝T_e2/T_e1；

when the number of the first people is 1, the number of the second people is 1;

2. The method of claim 1, wherein the S104 comprises:

monitoring whether at least one replacement resource is triggered;

wherein the at least one replacement resource is triggered, comprising:

at least one second video asset is shot instantly;

the second video asset is different from the first video asset.

3. The method of claim 2, wherein the first video resource is a dynamic image resource containing audio, and the dynamic image is a movie, a television, an animation, a game, a self-portrait video, an advertisement video, or a small video;

4. The method of claim 3, wherein determining the second persona from the triggered alternative resource comprises:

5. The method of claim 4, wherein the importance level of each character is determined by:

determining that all frames of any person i exist aiming at the person i;

wherein, W_iN is the degree of importance of any character i_iThe total number of frames for which any character i exists, N is the total number of frames of the second video resource, T_i1Is the total time length of the appearance of any character i, T₁Total duration of video for triggered replacement resource, T_i2Is the total audio duration, T, of the character i₂Total duration of audio for triggered replacement resource, s₁Total effective video duration, s, for triggered character of replacement resource₂The total effective audio duration of the triggered character of the replacement resource;

the total video duration of the triggered replacement resource is the total duration of the occurrence of N frames of the second video resource;

the total occurrence time length of any character i is n_iThe total length of time that a frame occurs;

the total audio duration of the triggered replacement resource is the total duration of audio in the second video resource;

the total audio duration of any character i is the total audio duration of the character i appearing in the second video resource;

the total time length of the person effective videos of the triggered replacement resources is the total time length of frames in which persons appear in the triggered replacement resources;

the total duration of the character valid audio of the triggered replacement resource is the duration of the character audio in the triggered replacement resource.

6. The method according to claim 1, wherein the S106 comprises:

a first linguistic style in a first audio feature of the first persona is determined by:

s301-1, acquiring all audio of a first person in a first video resource;

s301-3, converting the audio obtained in S301-1 into a first text;

s301-5, taking the first sound characteristic and the first word characteristic as a first language style in a first audio characteristic of the first character;

the inter-word pause is the pause time between words;

the word features include: spoken words, modifiers, word combinations, ellipses;

a second linguistic style in a second audio feature of the second persona is determined by:

s302-1, acquiring the audio of a second person;

s302-3, converting the audio obtained in the S302-1 into a second text;

7. An electronic device comprising a memory, a processor, a bus and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the method of any of claims 1-6.

8. A computer storage medium having a computer program stored thereon, characterized in that: the program when executed by a processor implementing the method of any one of claims 1 to 6.