CN112562733A - Media data processing method and device, storage medium and computer equipment - Google Patents

Media data processing method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN112562733A
CN112562733A CN202011434920.6A CN202011434920A CN112562733A CN 112562733 A CN112562733 A CN 112562733A CN 202011434920 A CN202011434920 A CN 202011434920A CN 112562733 A CN112562733 A CN 112562733A
Authority
CN
China
Prior art keywords
data
playing
translation
video
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011434920.6A
Other languages
Chinese (zh)
Inventor
张乐雨
张慧敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202011434920.6A priority Critical patent/CN112562733A/en
Publication of CN112562733A publication Critical patent/CN112562733A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a media data processing method and device, a storage medium and computer equipment, wherein the method comprises the following steps: receiving source media data, wherein the source media data comprises video data and source audio data; performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language; acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters; performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language; and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data. According to the method and the device, the media data can be suitable for people with different language habits to watch, the sound characteristics which are more matched with the emotion of the source media data are reserved, and the watching experience of a user is improved.

Description

Media data processing method and device, storage medium and computer equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a media data processing method and apparatus, a storage medium, and a computer device.
Background
With the continuous development of communication technology, users not only use intelligent terminal devices such as mobile phones, tablet computers and desktop computers to carry out conversation or inquire information, but also have wider application to other functions.
In the current video watching process, a video producer sends recorded audio and video data to a video server, and the video server forwards the video recorded by the video producer to a terminal of a video watcher for playing. However, users watching videos may be users around the world and cannot fully understand the language in the audio and video uploaded by the video producer, so that the experience of watching videos is poor, and the video playing amount of the video platform is difficult to increase.
Disclosure of Invention
In view of this, the present application provides a media data processing method and apparatus, a storage medium, and a computer device.
According to an aspect of the present application, there is provided a media data processing method including:
receiving source media data, wherein the source media data comprises video data and source audio data;
performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;
acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;
performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.
Optionally, the translating the translated text data to obtain the translated text data of the target language specifically includes:
assembling the translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;
calling the preset translation line, inputting the translation input data into the preset translation line for translation, and obtaining translation output data;
and analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain the translation text data.
Optionally, before the calling the preset translation line, the method further includes:
obtaining a verification seed corresponding to the preset translation line, and generating a verification token corresponding to the verification seed according to a token generation rule;
and verifying the preset translation line by using the verification token, and determining that the preset translation line is in an adjustable state if the verification passes.
Optionally, the obtaining of the text semantic parameter corresponding to the translation text data specifically includes:
dividing the translated text data according to a text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;
and respectively acquiring semantic parameters corresponding to each statement, and determining text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each statement.
Optionally, the receiving source media data specifically includes:
receiving the source media data sent by a video publishing terminal;
the synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data specifically includes:
acquiring a playing language corresponding to a video playing terminal, and acquiring audio data corresponding to the playing language from audio data corresponding to the target language;
synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data;
and sending the playing media data to the video playing terminal.
Optionally, the synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data specifically includes:
acquiring translation text data corresponding to the playing language;
and synthesizing the translation text data and the audio data corresponding to the playing language with the video data to obtain the playing media data.
Optionally, the acquiring a playing language corresponding to the video playing terminal specifically includes:
determining the playing language of the video playing terminal according to the geographical position of the video playing terminal; alternatively, the first and second electrodes may be,
determining the playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,
and analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.
According to another aspect of the present application, there is provided a media data processing apparatus including:
the source data receiving module is used for receiving source media data, wherein the source media data comprise video data and source audio data;
the audio data translation module is used for performing voice translation on the source audio data to obtain translated text data and translating the translated text data to obtain translated text data of a target language;
the sound parameter adjusting module is used for acquiring text semantic parameters corresponding to the translated text data and adjusting preset sound synthesis parameters based on the text semantic parameters;
the voice synthesis module is used for carrying out voice synthesis on the translation text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
and the media data synthesis module is used for synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.
Optionally, the audio data translation module specifically includes:
the input data assembling unit is used for assembling the translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;
the translation data output unit is used for calling the preset translation circuit, inputting the translation input data into the preset translation circuit for translation, and obtaining translation output data;
and the translation text analysis unit is used for analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain the translation text data.
Optionally, the apparatus further comprises:
the verification token generation module is used for acquiring a verification seed corresponding to the preset translation line before the preset translation line is called, and generating a verification token corresponding to the verification seed according to a token generation rule;
and the line verification module is used for verifying the preset translation line by using the verification token and determining that the preset translation line is in an adjustable state if the verification passes.
Optionally, the sound parameter adjusting module specifically includes:
the sentence dividing unit is used for dividing the translation text data according to a text structure corresponding to the translation text data to obtain a plurality of sentences corresponding to the translation text data;
and the semantic parameter determining unit is used for respectively acquiring the semantic parameters corresponding to each statement and determining the text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each statement.
Optionally, the source data receiving module is specifically configured to: receiving the source media data sent by a video publishing terminal;
the media data synthesis module specifically comprises:
the playing language acquisition unit is used for acquiring a playing language corresponding to the video playing terminal and acquiring audio data corresponding to the playing language from the audio data corresponding to the target language;
the playing data synthesis unit is used for synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data;
and the playing data sending unit is used for sending the playing media data to the video playing terminal.
Optionally, the playing data synthesizing unit specifically includes:
a played text acquisition subunit, configured to acquire translation text data corresponding to the played language;
and the playing data synthesizing subunit is used for synthesizing the translation text data and the audio data corresponding to the playing language and the video data to obtain the playing media data.
Optionally, the playing language obtaining unit specifically includes:
the first language acquisition subunit is used for determining the playing language of the video playing terminal according to the geographical position of the video playing terminal; alternatively, the first and second electrodes may be,
the second language obtaining subunit is configured to determine the playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,
and the third language acquisition subunit is used for analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.
According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described media data processing method.
According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the above media data processing method when executing the program.
By means of the technical scheme, according to the media data processing method and device, the storage medium and the computer device, after source media data are received, voice translation is firstly carried out on the source audio data contained in the source media data to obtain translated text data corresponding to the source audio data, then the translated text is translated into translated text data of a target language from a source language, sound synthesis parameters are adjusted according to text semantic parameters corresponding to the translated text, so that the translated text data are synthesized into audio data of a corresponding target language based on the adjusted sound synthesis parameters, and the audio data of the target language and video data contained in the source media data are assembled to obtain synthesized media data. Compared with the mode of directly playing live video in the prior art, the method and the device for playing the live video can convert source media data into media data of multiple different languages, bring convenience to users with different language habits to watch, can also obtain text semantic parameters corresponding to translated text data of the source audio data to determine sound synthesis parameters, and further utilize the sound synthesis parameters to carry out sound synthesis, so that the synthesized sound is more matched with emotion expressed by the source audio data, the look and feel similarity between the synthesized media data and the source media data is improved, the video watching experience of the users is improved, and the video playing quantity of a video platform is also favorably improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flow chart illustrating a media data processing method according to an embodiment of the present application;
fig. 2 is a schematic flow chart illustrating another media data processing method provided by an embodiment of the present application;
fig. 3 is a schematic structural diagram of a media data processing apparatus according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of another media data processing device provided in the embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In this embodiment, a media data processing method is provided, as shown in fig. 1, the method including:
step 101, receiving source media data, wherein the source media data comprises video data and source audio data;
the media data processing method provided by the embodiment of the application can be used for processing media data recorded by a main broadcast in a live broadcast terminal device in a live broadcast platform and can also be used for processing media data uploaded by a video uploading party in a video platform. In the above embodiment, the live broadcast platform server receives source media data, where the source media data includes video data and audio data, and the language type corresponding to the audio data is a language used by a main broadcast.
Step 102, performing voice translation on source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;
in this embodiment, after receiving the source audio data, the source audio data is first subjected to voice translation to obtain translated text data corresponding to the source audio data, that is, the source audio data is subjected to voice recognition to translate the voice data into text data, and further, in order to realize language conversion of the media data, the translated text data obtained by voice translation is translated, and the translated text data is translated into a target language to obtain translated text data, for example, the translated text data may be translated from a chinese language into an english language, a japanese language, or the like.
103, acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;
in this embodiment, in order to ensure that the processed media data can show natural voice effect and avoid too hard sound, after obtaining the translated text data, obtaining text semantic parameters corresponding to the translated text data, where the text semantic parameters may describe semantic information expressed by the source media data, for example, the source media data expresses a happy emotion of an author, and such happy emotion may be expressed by the text semantic parameters of the translated text. And then, the preset sound synthesis parameters can be adjusted based on the text semantic parameters, so that the adjusted sound synthesis parameters can reflect text semantics through some characteristics of sound, and the sound synthesis parameters specifically include sound fluctuation amplitude, fundamental frequency, speech speed, volume, sentence interval duration and the like. For example, the word rate is faster at happy hours and the sentence interval is shorter.
Step 104, carrying out voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
in this embodiment, the translated text data is subjected to sound synthesis according to the adjusted sound synthesis parameters, and the translated text data is subjected to text-to-sound processing by using the sound synthesis parameters given to the text semantic information, so as to obtain audio data corresponding to the target language, thereby converting the source audio data corresponding to the source language in the source media data into the audio data of the target language.
And 105, synthesizing the audio data and the video data corresponding to the target language to obtain synthesized media data.
In this embodiment, after the audio data corresponding to the target language is generated, the audio data is assembled with the video data included in the source media data to obtain the composite media data, and finally, the source media data is converted from the source language to the composite media data corresponding to the target language, so that users with different language habits can understand the content expressed by the video, the video watching experience of the users is improved, and the video playing amount of the video platform is increased.
By applying the technical scheme of the embodiment, after source media data are received, voice translation is performed on the source audio data contained in the source media data to obtain translated text data corresponding to the source audio data, then the translated text is translated into translated text data of a target language from a source language, and a sound synthesis parameter is adjusted according to a text semantic parameter corresponding to the translated text, so that the translated text data are synthesized into audio data of a corresponding target language based on the adjusted sound synthesis parameter, and the audio data of the target language is assembled with video data contained in the source media data to obtain synthesized media data. Compared with the mode of directly playing live video in the prior art, the method and the device for playing the live video can convert source media data into media data of multiple different languages, bring convenience to users with different language habits to watch, can also obtain text semantic parameters corresponding to translated text data of the source audio data to determine sound synthesis parameters, and further utilize the sound synthesis parameters to carry out sound synthesis, so that the synthesized sound is more matched with emotion expressed by the source audio data, the look and feel similarity between the synthesized media data and the source media data is improved, the video watching experience of the users is improved, and the video playing quantity of a video platform is also favorably improved.
Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully illustrate the specific implementation process of the present embodiment, another media data processing method is provided, as shown in fig. 2, and the method includes:
step 201, receiving source media data sent by a video distribution terminal, wherein the source media data includes video data and source audio data;
in this embodiment, when the anchor terminal performs live broadcasting, the anchor terminal records content to obtain source media data, where the source media data includes video data and audio data, the anchor terminal sends the source media data to a live broadcast server, and the live broadcast server receives the source media data sent by the anchor terminal.
Step 202, performing voice translation on source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;
step 203, acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;
step 204, carrying out voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
the corresponding descriptions of step 202 to step 204 refer to the corresponding descriptions of step 102 to step 104, which are not repeated herein. Specifically, TTS technology can be used to synthesize speech, and text information generated by a computer or input from the outside is converted into understandable and fluent speech output technology.
Step 205, acquiring a playing language corresponding to the video playing terminal, and acquiring audio data corresponding to the playing language from the audio data corresponding to the target language;
in step 205, since the live broadcast server needs to process the source media data sent by the live video broadcast end and forward the processed source media data to the video broadcast terminal, in order to determine which language the source media data is converted into, in this embodiment, the target language may include multiple languages, the broadcast language corresponding to the video broadcast terminal is obtained, and the audio data corresponding to the broadcast language is found from the audio data corresponding to the multiple target languages, so that the audio data is used to synthesize the media data, which is convenient for people with habits of different languages to watch live video.
In the above embodiment, specifically, the playing language of the video playing terminal is determined according to the geographic location of the video playing terminal; or, determining the playing language of the video playing terminal according to the common language corresponding to the video playing terminal; or analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.
In this embodiment, the playing language may be determined according to the location of the video playing terminal, for example, if the geographic location of the video playing terminal is japan and the common language in the region is japanese, the playing language may be determined to be japanese. Or, the playing language carried in the live viewing request sent by the video playing terminal received by the live server may be analyzed according to the playing instruction. Or, the playing language may be determined directly according to the common language corresponding to the video playing terminal, for example, the language selected when the video is watched last time.
Step 206, synthesizing the audio data and the video data corresponding to the playing language to obtain playing media data;
specifically, translation text data corresponding to a playing language is acquired; and synthesizing the translation text data corresponding to the playing language, the audio data and the video data to obtain playing media data.
In the above embodiment, the translated text data corresponding to the playing language is used as the subtitle data, and the translated text data, the audio data, and the video data are synthesized to obtain the playing media data, so that not only the sound of the synthesized playing media data is matched with the language habit of the watching user, but also the subtitle is matched with the language habit of the watching user, and the video watching experience of the user is further improved.
Step 207, sending the playing media data to the video playing terminal.
In the above embodiment, after the playing media data is synthesized, the playing media data is sent to the video playing terminal for the user to watch.
It should be noted that, in a live broadcast scenario, generally, in order to ensure video playing quality, a live broadcast server generally caches a video for a period of time and then sends the video to a video playing terminal, for example, the video is cached for 30 seconds, and then the cached video may be segmented once every 15 seconds to obtain source media data, and each source media data is respectively subjected to playing language conversion, so that the video received by the video playing terminal is not blocked continuously, and video playing quality is ensured.
In any embodiment of the present application, the voice translating the source audio data in step 102 and step 202 to obtain the translated text data specifically includes:
102-1, assembling translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;
step 102-2, calling a preset translation line, inputting translation input data into the preset translation line for translation, and obtaining translation output data;
and 102-3, analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain translation text data.
In the above embodiment, first, an input parameter assembly rule corresponding to a preset translation line is obtained, then the translated text data to be translated is assembled according to the rule to obtain translated input data, the translated input data is used as an input parameter corresponding to the preset translation line, the preset translation line is called, the translated input data is input into the line to be translated to obtain an output parameter, that is, translated output data, further, in order to obtain the translated text data capable of being recognized by a computer, the translated output data needs to be analyzed according to an output parameter analysis rule corresponding to the preset translation line, and finally, the translated text data is obtained, so that the translated text data is translated into the translated text data by using the translation line, and the text data is converted from a source language to a target language. The preset translation line may be an interface of various terminals or browsers, such as a Baidu translation interface, a Google translation interface, and the like, and may also be a preset translation database interface.
In some application scenarios, some translation interfaces define a call validation rule in advance, and in order to avoid resource waste due to malicious calls, validation needs to be performed before the interface is called, in the above embodiment, before step 102-2, the method further includes:
102-4, acquiring a verification seed corresponding to a preset translation line, and generating a verification token corresponding to the verification seed according to a token generation rule;
and 102-5, verifying the preset translation line by using the verification token, and if the verification passes, determining that the preset translation line is in an adjustable state.
In the above embodiment, a verification seed corresponding to a preset translation route is obtained, then, according to a token generation rule agreed in advance by the preset translation route, an encryption process is performed according to the verification seed to generate a verification token, before the preset translation route is called, the verification is performed through the verification token, and after the verification is passed, the preset translation route is determined to be in an adjustable state, the preset translation route can be called only in the adjustable state, otherwise, the preset translation route cannot be called, so that the situation that the preset translation route is maliciously called and translation route resources are wasted is avoided, and the translation efficiency is improved. For example, a google translation interface is called to obtain a verification seed, and a verification token is generated according to the verification seed and timestamp information corresponding to the current time and a preset encryption algorithm, so that a translation interface request is verified.
In any embodiment of the present application, the obtaining of the text semantic parameter corresponding to the translated text data in step 103 and step 203 specifically includes:
step 103-1, segmenting the translated text data according to the text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;
step 103-2, obtaining semantic parameters corresponding to each sentence respectively, and determining text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each sentence.
In the foregoing embodiment, the translated text may be specifically segmented according to reading symbols (e.g., periods, question marks, exclamation marks, and the like) in the text according to a text structure of the translated text data, so as to convert the translated text into a plurality of sentences, and after sentence extraction is completed, feature words are extracted from each of the segmented sentences, where the feature words can be used to characterize emotion implied by the sentences, and for example, the feature words may include conjunctions, negatives, and the like. And carrying out syntactic analysis on each sentence, determining the word segmentation weight before and after the red conjunctions of each sentence, and carrying out polarity inversion or double negative identification on negative words. And comprehensively determining the score of the sentence according to the emotional vocabulary and the syntactic analysis result in each sentence, wherein the score can represent the semantic parameters of the sentence. For example, the lower the score of a sentence, the more negative the emotion characterized by the sentence; the higher the score of a sentence, the more positive the emotion characterized by the sentence is. For example, if a sentence has a score of-10, then the emotion characterized by the sentence is an extremely negative emotion (e.g., violence, anger, etc.); if the score of a sentence is-2, the emotion characterized by the sentence is a relatively negative emotion (such as low mood); if the score of a sentence is 0, the emotion represented by the sentence is neutral; if the score of a sentence is +7, it indicates that the emotion characterized by the sentence is a more positive emotion (e.g., very happy). And then, determining text semantic parameters corresponding to the translated text data based on the semantic parameters corresponding to each sentence, for example, taking the average value of the semantic parameters corresponding to each sentence as the text semantic parameters, so as to avoid that the emotion fluctuation shown by the finally synthesized sound is too large due to too large semantic parameter difference of a single sentence.
Further, as a specific implementation of the method in fig. 1, an embodiment of the present application provides a media data processing apparatus, as shown in fig. 3, the apparatus includes:
a source data receiving module 31, configured to receive source media data, where the source media data includes video data and source audio data;
the audio data translation module 32 is configured to perform voice translation on the source audio data to obtain translated text data, and translate the translated text data to obtain translated text data of a target language;
the sound parameter adjusting module 33 is configured to acquire a text semantic parameter corresponding to the translated text data, and adjust a preset sound synthesis parameter based on the text semantic parameter;
the voice synthesis module 34 is configured to perform voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
and a media data synthesizing module 35, configured to synthesize the audio data and the video data corresponding to the target language to obtain synthesized media data.
In a specific application scenario, as shown in fig. 4, optionally, the audio data translation module 32 specifically includes:
an input data assembling unit 321, configured to assemble the translated text data according to an input parameter assembling rule corresponding to a preset translation line, to obtain translation input data corresponding to the translated text data;
the translation data output unit 322 is configured to invoke a preset translation line, input the translation input data into the preset translation line, and perform translation to obtain translation output data;
and the translation text analysis unit 323 is configured to analyze the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain translation text data.
In a specific application scenario, as shown in fig. 4, optionally, the apparatus further includes:
the verification token generation module 36 is configured to, before calling the preset translation line, obtain a verification seed corresponding to the preset translation line, and generate a verification token corresponding to the verification seed according to a token generation rule;
and the line verification module 37 is configured to verify the preset translation line by using the verification token, and if the verification passes, determine that the preset translation line is in an invokable state.
In a specific application scenario, as shown in fig. 4, optionally, the sound parameter adjusting module 33 specifically includes:
a sentence dividing unit 331, configured to divide the translated text data according to a text structure corresponding to the translated text data, so as to obtain a plurality of sentences corresponding to the translated text data;
the semantic parameter determining unit 332 is configured to obtain a semantic parameter corresponding to each sentence, and determine a text semantic parameter corresponding to the translated text data according to the semantic parameter corresponding to each sentence.
In a specific application scenario, as shown in fig. 4, optionally, the source data receiving module 31 is specifically configured to: receiving source media data sent by a video publishing terminal;
the media data synthesizing module 35 specifically includes:
a playing language obtaining unit 351, configured to obtain a playing language corresponding to the video playing terminal, and obtain audio data corresponding to the playing language from the audio data corresponding to the target language;
a playing data synthesizing unit 352, configured to synthesize audio data and video data corresponding to a playing language to obtain playing media data;
and the play data sending unit 353 is configured to send the play media data to the video play terminal.
Optionally, the playing data synthesizing unit 352 specifically includes:
a played text acquiring subunit 3521 configured to acquire translated text data corresponding to a played language;
and a play data synthesizing subunit 3522, configured to synthesize the translated text data corresponding to the play language, the audio data, and the video data to obtain play media data.
Optionally, the playing language obtaining unit 351 specifically includes:
the first language obtaining subunit 3511 is configured to determine a playing language of the video playing terminal according to the geographic location of the video playing terminal; alternatively, the first and second electrodes may be,
a second language obtaining subunit 3512, configured to determine a playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,
the third language obtaining subunit 3513 is configured to parse the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.
It should be noted that other corresponding descriptions of the functional units related to the media data processing apparatus provided in the embodiment of the present application may refer to the corresponding descriptions in the methods in fig. 1 to fig. 2, and are not described herein again.
Based on the methods shown in fig. 1 to 2, correspondingly, the present application further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the media data processing method shown in fig. 1 to 2.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.
Based on the method shown in fig. 1 to fig. 2 and the virtual device embodiment shown in fig. 3 to fig. 4, in order to achieve the above object, the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the computer device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the media data processing method as described above with reference to fig. 1 to 2.
Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
It will be appreciated by those skilled in the art that the present embodiment provides a computer device architecture that is not limiting of the computer device, and that may include more or fewer components, or some components in combination, or a different arrangement of components.
The storage medium may further include an operating system and a network communication module. An operating system is a program that manages and maintains the hardware and software resources of a computer device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.
Through the above description of the embodiments, those skilled in the art can clearly understand that the present application may be implemented by software plus a necessary general hardware platform, or may be implemented by hardware to receive source media data, and then perform voice translation on the source audio data included in the source media data to obtain translated text data corresponding to the source audio data, translate the translated text from the source language to translated text data in a target language, and adjust a sound synthesis parameter according to a text semantic parameter corresponding to the translated text, so as to synthesize the translated text data into audio data in the corresponding target language based on the adjusted sound synthesis parameter, and assemble the audio data in the target language with video data included in the source media data to obtain synthesized media data. Compared with the mode of directly playing live video in the prior art, the method and the device for playing the live video can convert source media data into media data of multiple different languages, bring convenience to users with different language habits to watch, can also obtain text semantic parameters corresponding to translated text data of the source audio data to determine sound synthesis parameters, and further utilize the sound synthesis parameters to carry out sound synthesis, so that the synthesized sound is more matched with emotion expressed by the source audio data, the look and feel similarity between the synthesized media data and the source media data is improved, the video watching experience of the users is improved, and the video playing quantity of a video platform is also favorably improved.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A method for media data processing, comprising:
receiving source media data, wherein the source media data comprises video data and source audio data;
performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;
acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;
performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.
2. The method according to claim 1, wherein translating the translated text data to obtain translated text data in a target language specifically comprises:
assembling the translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;
calling the preset translation line, inputting the translation input data into the preset translation line for translation, and obtaining translation output data;
and analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain the translation text data.
3. The method of claim 2, wherein prior to said invoking said preset translation line, said method further comprises:
obtaining a verification seed corresponding to the preset translation line, and generating a verification token corresponding to the verification seed according to a token generation rule;
and verifying the preset translation line by using the verification token, and determining that the preset translation line is in an adjustable state if the verification passes.
4. The method according to claim 1, wherein the obtaining of the text semantic parameter corresponding to the translation text data specifically includes:
dividing the translated text data according to a text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;
and respectively acquiring semantic parameters corresponding to each statement, and determining text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each statement.
5. The method according to claim 1, wherein the receiving source media data specifically comprises:
receiving the source media data sent by a video publishing terminal;
the synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data specifically includes:
acquiring a playing language corresponding to a video playing terminal, and acquiring audio data corresponding to the playing language from audio data corresponding to the target language;
synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data;
and sending the playing media data to the video playing terminal.
6. The method according to claim 5, wherein the synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data specifically comprises:
acquiring translation text data corresponding to the playing language;
and synthesizing the translation text data and the audio data corresponding to the playing language with the video data to obtain the playing media data.
7. The method according to claim 5 or 6, wherein the acquiring the playing language corresponding to the video playing terminal specifically includes:
determining the playing language of the video playing terminal according to the geographical position of the video playing terminal; alternatively, the first and second electrodes may be,
determining the playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,
and analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.
8. A media data processing apparatus, comprising:
the source data receiving module is used for receiving source media data, wherein the source media data comprise video data and source audio data;
the audio data translation module is used for performing voice translation on the source audio data to obtain translated text data and translating the translated text data to obtain translated text data of a target language;
the sound parameter adjusting module is used for acquiring text semantic parameters corresponding to the translated text data and adjusting preset sound synthesis parameters based on the text semantic parameters;
the voice synthesis module is used for carrying out voice synthesis on the translation text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;
and the media data synthesis module is used for synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.
9. A storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the media data processing method of any one of claims 1 to 7.
10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the media data processing method of any one of claims 1 to 7 when executing the computer program.
CN202011434920.6A 2020-12-10 2020-12-10 Media data processing method and device, storage medium and computer equipment Pending CN112562733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011434920.6A CN112562733A (en) 2020-12-10 2020-12-10 Media data processing method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011434920.6A CN112562733A (en) 2020-12-10 2020-12-10 Media data processing method and device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN112562733A true CN112562733A (en) 2021-03-26

Family

ID=75060473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011434920.6A Pending CN112562733A (en) 2020-12-10 2020-12-10 Media data processing method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN112562733A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727306A (en) * 2023-12-21 2024-03-19 青岛润恒益科技有限公司 Pickup translation method, device and storage medium based on original voiceprint features

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727306A (en) * 2023-12-21 2024-03-19 青岛润恒益科技有限公司 Pickup translation method, device and storage medium based on original voiceprint features

Similar Documents

Publication Publication Date Title
US11917344B2 (en) Interactive information processing method, device and medium
CN110085244B (en) Live broadcast interaction method and device, electronic equipment and readable storage medium
CN107040452B (en) Information processing method and device and computer readable storage medium
KR101628050B1 (en) Animation system for reproducing text base data by animation
CN112037792B (en) Voice recognition method and device, electronic equipment and storage medium
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
CN109582825B (en) Method and apparatus for generating information
CN107808007A (en) Information processing method and device
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
EP2747464A1 (en) Sent message playing method, system and related device
CN112116903A (en) Method and device for generating speech synthesis model, storage medium and electronic equipment
CN115691544A (en) Training of virtual image mouth shape driving model and driving method, device and equipment thereof
US20140129228A1 (en) Method, System, and Relevant Devices for Playing Sent Message
CN110379406A (en) Voice remark conversion method, system, medium and electronic equipment
CN112562733A (en) Media data processing method and device, storage medium and computer equipment
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN111862933A (en) Method, apparatus, device and medium for generating synthesized speech
CN115967833A (en) Video generation method, device and equipment meter storage medium
CN113312928A (en) Text translation method and device, electronic equipment and storage medium
CN113221514A (en) Text processing method and device, electronic equipment and storage medium
CN111027332A (en) Method and device for generating translation model
CN113132789B (en) Multimedia interaction method, device, equipment and medium
CN112383722B (en) Method and apparatus for generating video
CN115604535A (en) Video data processing method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination