WO2024051760A1 - 视频处理方法及电子设备 - Google Patents

视频处理方法及电子设备 Download PDF

Info

Publication number
WO2024051760A1
WO2024051760A1 PCT/CN2023/117369 CN2023117369W WO2024051760A1 WO 2024051760 A1 WO2024051760 A1 WO 2024051760A1 CN 2023117369 W CN2023117369 W CN 2023117369W WO 2024051760 A1 WO2024051760 A1 WO 2024051760A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
content
original
optimized
Prior art date
Application number
PCT/CN2023/117369
Other languages
English (en)
French (fr)
Inventor
陈羽飞
刘奎龙
杨昌源
包季真
沈琳
李可娜
吴燕晶
严文俊
缪瑜
Original Assignee
杭州阿里巴巴海外互联网产业有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州阿里巴巴海外互联网产业有限公司 filed Critical 杭州阿里巴巴海外互联网产业有限公司
Publication of WO2024051760A1 publication Critical patent/WO2024051760A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof

Definitions

  • This application relates to the field of video processing technology, and in particular to video processing methods and electronic devices.
  • Live streaming is a commonly used way to "bring goods". This method seems simple, but in fact it has relatively high requirements for the host and operations. Moreover, consumers' time efficiency is very low, and the return rate caused by impulse purchases is also high. high. Therefore, many merchants choose to "bring goods" through short videos. For example, they can refine the selling points of the goods, and then express the main selling points of the goods more vividly through real people appearing on the scene and explaining the physical products. Alternatively, some larger brands can also "tell stories” through short videos, transforming the brand's feelings and value proposition into stories, expressing them in short videos, and so on. In contrast, short videos "bring goods" are more stable, can reduce the return rate caused by impulse purchases, and are suitable for expressing product selling points, brand concepts or value propositions in a more quiet way.
  • This application provides a video processing method and electronic equipment, which can save costs while improving the information expression ability of the processed video in a specific target language.
  • a video processing method including:
  • the content to be optimized is optimized to generate a target video, and the target video has the characteristics of the video content produced in the target region.
  • the content determined to be optimized includes:
  • the optimization process of the content to be optimized includes:
  • the human body image is processed according to the facial features, skin color and/or hair color characteristic information corresponding to the target region, so that the processed human body image has the facial features, skin color and/or hair color characteristics corresponding to the target region.
  • Lip shape matching is performed on the processed human body image according to the target speech content.
  • the target voice content is processed so that its duration is consistent with the pronunciation duration of the original voice content.
  • the content determined to be optimized includes:
  • the optimization process of the content to be optimized includes:
  • the content determined to be optimized includes:
  • the optimization process of the content to be optimized includes:
  • the translation is rewritten based on the idioms used in the corresponding scene of the video content produced locally in the target region.
  • the processing requirement information also includes: the speech type during speech synthesis, the speech type includes the standard pronunciation method corresponding to the target language, or the pronunciation method with a local accent;
  • the optimization process of the content to be optimized includes:
  • the speech synthesis process is performed on the translation according to the speech type to generate the target Targeted speech content in videos.
  • the content determined to be optimized includes:
  • the optimization process of the content to be optimized includes:
  • the subtitle content corresponding to the original voice content is erased, and the text content corresponding to the target voice content translated according to the target language is backfilled into the subtitle display area in the image.
  • optimizing the content to be optimized includes:
  • the text content corresponding to the target voice content translated according to the target language is filled into the target display area in the image screen, so as to provide the Add subtitles to the target video.
  • the subtitles are formatted according to the subtitle formatting method in the video content produced locally in the target region.
  • optimizing the content to be optimized includes:
  • the original video to be processed and the processing requirement information include:
  • the user is provided with operation options for uploading videos and submitting processing requirement information through the target interface, so as to determine the original video based on the video uploaded by the user, and determine the processing requirement information based on the requirement information submitted by the user.
  • the original video to be processed and the processing requirement information include:
  • the explanation video is determined as the video to be explained, and the operation option for submitting the processing requirement is provided.
  • a video processing device including:
  • the original video determination unit is used to determine the original video to be processed, and the processing requirement information, where the processing requirement information includes the required target language;
  • a content to be optimized determining unit configured to determine the content to be optimized during the process of translating the original video according to the target language
  • An optimization processing unit configured to optimize the content to be optimized according to the characteristic information of the video content produced in the target region corresponding to the target language, so as to generate a target video, and make the target video have the characteristics of the video content produced in the target region. characteristics of the video content.
  • An electronic device including:
  • a memory associated with the one or more processors the memory being used to store program instructions that, when read and executed by the one or more processors, perform any of the foregoing methods.
  • the content to be optimized in the process of translating the original video according to the target language, the content to be optimized can be determined, and the content to be optimized can be determined based on the characteristic information of the video content produced in the target region corresponding to the target language.
  • the content is optimized so that the generated target video has the characteristics of video content produced locally in the target region.
  • videos suitable for users in various regions can be obtained at a lower cost, making the video content look more real and friendly, thereby improving the information expression ability of the processed video in specific target languages.
  • they only need to produce the original video in one language, and can use the tools provided by the embodiments of this application to translate and obtain target videos in other target languages, and the content in the target video can be localized according to the target language. , Therefore, the video production cost of merchants can be reduced and the video production efficiency can be improved.
  • Figure 1 is a schematic diagram of the system architecture provided by the embodiment of the present application.
  • Figure 2 is a flow chart of the method provided by the embodiment of the present application.
  • Figure 3 is a schematic diagram of comparison before and after video optimization processing provided by the embodiment of the present application.
  • Figure 4 is a schematic diagram of a device provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of an electronic device provided by an embodiment of the present application.
  • Another possible way is to perform speech recognition on the original video and then translate the text content into multiple Languages of different languages are then synthesized into voices of multiple languages through speech synthesis, and the original voices are replaced respectively to generate multiple videos, each of which can correspond to the voice explanation content of a different language.
  • this method allows users in other regions to understand the explanation content in the video without resorting to subtitles, however, since short videos often feature real people (for example, models, anchors, etc.), users from other regions cannot watch the video. There may also be a poor experience with this kind of video. For example, in a video seen by users in a certain European and American country, an anchor with an Asian face may be explaining in English.
  • the intimacy and realism will be relatively low, and it will also affect the viewer's trust in the explanation content.
  • the converted voice content may be too blunt, and some translations may even be inaccurate.
  • anchors when anchors record short videos, they usually use a colloquial introduction method, and may often use some Internet slang, for example, calling "sisters”"Jimei", and also include Some “abbreviations” using the first letters of Pinyin, such as "YYDS", etc. If these contents are translated directly, errors may occur or even cannot be translated.
  • other regions may also have their own Internet terms, or customary usage of some words in some scenarios, etc.
  • the word “good” has many ways of expression in English, and in the scenario of "bringing goods” In foreign countries, you may be accustomed to using more exaggerated words to express, etc. Therefore, if these factors are not taken into account for direct translation, the processed speech content may not be vivid enough or attractive enough, etc.
  • video processing related tools can be provided for users such as businesses.
  • video content produced locally in multiple regions for example, videos recorded by models or anchors in Europe and the United States
  • videos recorded by models or anchors in Southeast Asia, etc. can be obtained in advance.
  • videos recorded by models or anchors in Southeast Asia, etc. can be obtained in advance.
  • this kind of characteristic information can include the model's facial features, skin color, hair color and other characteristics, and can also include some idiomatic expressions, Internet slang characteristics, subtitle layout characteristics, local popular music type characteristics, etc.
  • users can upload locally saved pre-recorded short videos as the original videos to be processed, and can submit specific processing requirements, such as the required target language, etc.
  • the tool can then perform translation processing on the original video (for example, including speech recognition, text translation, speech synthesis, synthesis of speech and images, etc.) to generate the target video, and can identify the areas that need to be optimized.
  • content For example, determine whether it contains a character's head image, voice content, subtitles, etc. If so, it can be determined as content to be optimized.
  • the content to be optimized can be optimized based on the characteristic information of locally produced video content in the target region corresponding to the above target language. For example, the facial features, skin color, hair color, etc. of the person's head image are optimized so that the processed head image looks like a local in the above-mentioned target area.
  • the speech recognition results of the original speech content can also be optimized such as copywriting rewriting to ensure the accuracy of the translation.
  • the translated text content can also be rewritten to make it more suitable for the corresponding scenarios in the above target regions. Download conventions used when producing video content.
  • localization processing can be used to make the generated target video have the characteristics of locally produced video content in the target region, so as to improve the viewing experience of users in the target region and also improve the performance of the specifically processed video in the specific target language. It can improve the information expression ability and reduce the cost of merchants.
  • manual review can also be carried out based on the above-mentioned automated processing results, and some inappropriate or inappropriate points can be manually corrected.
  • manual processing options can be provided for head image processing, translation, copywriting rewriting and other aspects, and users can choose if they need it.
  • embodiments of the present application can provide video processing tools for merchants and other users.
  • the tools can exist in the form of Web, H5 and other pages, or can also be in the form of applications (for example, mobile terminals). App), or it can exist in the form of small programs, light applications, etc., or it can also exist in the form of a functional module in some applications, etc.
  • a typical application scenario is that a merchant has recorded a short video and needs to publish it for users in other regions to watch.
  • the merchant can upload the specific short video through the tool client provided by the embodiment of this application, and You can submit specific requirements, such as the required language, etc., and then the tool can generate a video with the characteristics of local video content produced in the region corresponding to the language. Merchants can use this video to publish to the corresponding region.
  • the specific processing process can be completed on the server side. After the server side processing is completed, it is returned to the user through the client side.
  • the video material of the product can be subsequently displayed to users through the product details page and other pages.
  • a download entrance can also be provided, and the user can download the processed video locally, and then publish it to other channels (for example, some content service systems, etc.), and so on.
  • the embodiment of this application provides a video processing method from the perspective of the aforementioned server. See Figure 2.
  • the method may include:
  • S201 Determine the original video to be processed and the processing requirement information, where the processing requirement information includes the required target language.
  • the original video to be processed can be a short video uploaded by the user.
  • the user can be provided with an interface for specifically initiating a video processing request.
  • operation options for uploading local videos can be provided.
  • action options for submitting processing requests, and so on are provided.
  • users can upload specific original videos to be processed and specific processing requirement information through this interface.
  • the processing requirement information can at least include the required target language, or more detailed requirements can also be put forward.
  • the specific processing process involves speech recognition, translation, and speech synthesis of speech content. During speech synthesis, you can choose a standard pronunciation method or a pronunciation method with some local characteristics. If the user needs a certain characteristic pronunciation mode, you can choose it in this interface, otherwise it can default to the standard pronunciation mode, etc.
  • a product information service system can provide a service that automatically intercepts and publishes product explanation videos based on live broadcast content.
  • the playback interface of such explanation videos can provide merchant users with operations for processing product explanation videos. Entrance.
  • Merchant users can initiate a request for processing the explanation video by clicking on this portal. Of course, they can also submit the required target language and other demand information.
  • the specific video to be processed can also be a video stream during live broadcast, that is, the video content can be processed during the live broadcast, so that not only the original video can be recorded, but also the target can be generated Videos corresponding to different languages can be released to users in various regions, etc.
  • S202 In the process of translating the original video according to the target language, determine the content to be optimized.
  • the specific content to be optimized may include image content and may also include audio content.
  • the character head image content can be detected from the image content of the original video. If it exists, it can be determined as the content to be optimized. It can also be determined whether there are subtitles, flowery characters and other content in the image content of the original video. If it exists, it can also be determined as content to be optimized.
  • the flower character content the area ratio of the screen can also be calculated. If the area ratio is greater than a certain threshold, no processing is required.
  • the voice content can be mainly used as the content to be optimized.
  • background music can also be included, and so on.
  • videos to be processed will correspond to various situations. For example, some videos may have real models appearing for explanations, and some may not have real models appearing, etc. Therefore, when specifically determining the content to be optimized from the video to be processed, the video to be processed can be judged from multiple dimensions, and then the content to be optimized included in the current video to be processed can be determined, and so on.
  • S203 According to the characteristic information of the video content produced in the target region corresponding to the target language, optimize the content to be optimized to generate a target video, and make the target video have the characteristics of the video content produced in the target region.
  • the video content produced in the target region corresponding to the target language can be determined based on the pre-saved information.
  • Feature information can be referred to as localized features.
  • specific feature information may include: facial features, skin color, hair color and other appearance features of local models in the target area, some customary terminology features in scenarios such as product explanations, layout features when adding subtitles to videos, Features of popular music that locals prefer, etc. These feature information can be saved in advance through configuration files, etc.
  • the above-mentioned localization features corresponding to the currently required target language can be obtained from it, and then based on these features, the The content is optimized and processed so that the generated target video has the characteristics of video content produced locally in the target region.
  • the human body can be detected based on the facial features, skin color, and/or hair color and other characteristic information corresponding to the target region corresponding to the specific target language.
  • the image is optimized so that the processed human body image has facial features, skin color and/or hair color characteristics corresponding to the target region.
  • the corresponding relationship between a specific language and a region can be determined based on the official language of the specific region. For example, if the target language is English, the corresponding target region can be Europe and the United States by default. Of course, there are also some countries that may adopt bilingual languages. For example, India may have both English and Hindi as official languages. At this time, when facing India, degree users, you can also use English as the target language. Of course, since the appearance of people in India is obviously different from that in Europe and the United States, if a merchant needs to publish videos to countries such as India, it can specify a specific country or region when making specific processing requirements. etc.
  • one of the main tasks is to translate the voice content, and then generate the voice content corresponding to the target language through speech synthesis and other methods, and then re-synthesize it with the image content to generate the target video.
  • the process of re-synthesizing the image content in order to make it appear that the characters in the video are explaining the speech content of the specific target language, it also involves re-matching the lips of the characters in the video.
  • lip shape matching can be performed on the processed human body image according to the target speech content of speech synthesis after the above-mentioned processing of the human body image is completed. .
  • FIG. 3(A) assume that it is a certain frame in the original video.
  • the character in the original video is Chinese, and a short video introducing the product is recorded in Chinese.
  • the target language is a Southeast Asian language.
  • the appearance of the characters in the video as shown in Figure 3(B), the appearance of the characters already has the characteristics of Southeast Asian characters.
  • the lip shape of the characters is consistent with the translated target Phonetic correspondence of languages, etc.
  • the target region is a European and American country/region, the target language is English.
  • the facial features/skin color of the characters in the video are processed according to the characteristics of European and American countries/regions, and the lip shapes of the characters are compared.
  • the target video in English can be generated.
  • the specific lip shape matching processing can also be completed by the corresponding algorithm model, which will not be described in detail here.
  • the target speech content can also be processed so that its duration is consistent with the pronunciation duration of the original speech content. For example, if the pronunciation duration of the target voice content is longer than the original voice content, the length of the copy can be shortened by summarizing the text content corresponding to the target voice content, thereby shortening the pronunciation duration of the target voice content.
  • the copy generation algorithm can be used to increase the length of the copy, thereby increasing the pronunciation duration of the target voice content, and so on.
  • the specific copy summary algorithm and copy generation algorithm will not be described in detail here.
  • the process of processing voice content may also involve copywriting rewriting and other processing.
  • the Internet terms contained in it can be identified from the original text content, which may cause errors or cannot be translated during the translation process.
  • this kind of content can also be used as pending content.
  • the video content produced locally in the target region can also be used in the corresponding scenario.
  • idioms and rewrite the translation For example, when the same meaning can be expressed through multiple different words or phrases, you can choose to use words or phrases that are more suitable for use in the "carrying goods" scenario to ensure specific expression effects, etc.
  • the specific rewriting process can also be implemented based on information such as "dictionaries" that are pre-configured for various regions in specific scenarios.
  • the user when performing speech synthesis, you can also consider the dialect, accent and other characteristics of the specific region. For example, for some bi-native speaking countries or regions, although English is also the official language, people in this region do not speak English. When speaking English, there may be obvious accent and other local characteristics, etc. Therefore, multiple voice packages with different accents can be prepared in advance for the same language according to this situation, for example, they can include standard English, Indian English, American English, and so on. In this way, when the user puts forward specific processing needs, the user can be provided with a variety of optional voice types.
  • the specific voice types include the standard pronunciation method corresponding to the target language, or the pronunciation method with a local accent.
  • the translation can be speech synthesized according to the speech type selected by the user to generate target speech content. For example, if the target language selected by the user is English and the speech type is Indian English, then after the English translation is obtained, the corresponding speech package of Indian English can be used for speech synthesis, so that the specific speech in the produced video is
  • the voice content can be English voice with Indian accent characteristics, which can make users in India feel more friendly.
  • subtitle content can also be used as content to be optimized. Specifically, when optimizing subtitles, it can be divided into two situations. In one case, if there are subtitles in the original video, the subtitle content corresponding to the original voice content can be determined from the original video, and then the subtitle content can be The subtitle content corresponding to the original voice content is erased from the original image (of course, after erasing, some damaged images can also be repaired), and then the target obtained after being translated according to the target language can be The text content corresponding to the voice content is backfilled into the subtitle display area in the image screen.
  • the text content corresponding to the target voice content translated according to the target language can be filled into the target image in the image.
  • Display area to add subtitles to said target video if the text content corresponding to the target speech content is rewritten, the rewritten text content is backfilled or added to the subtitle display area.
  • the specific implementation involves the process of backfilling or adding subtitles, which involves the alignment of the new subtitles with the timeline of the synthesized speech.
  • the specific implementation method as well as the aforementioned subtitle erasure method, part of the image content is modified after erasure.
  • the repair, etc. can all be implemented using relevant algorithms, which will not be described in detail here.
  • the subtitles may also be formatted based on the popular subtitle formatting methods in video content produced locally in the target region.
  • Do typesetting For example, for Japanese, Japanese typesetting can be used for typesetting, and so on.
  • the original image may also include some "flower character” content.
  • the main selling points of a product and other information are expressed in text, and the text is added to a certain frame or several frames in the video. frame.
  • this kind of "flower characters” can be erased.
  • the image can also be repaired through picture repair technology after erasing. .
  • the area of " ⁇ " in the picture may be too large, it may be difficult to repair the picture after erasing. In this case, there is no need to perform the erasing operation, that is, the content of " ⁇ " is retained. , for foreign users, this kind of content usually does not affect their viewing.
  • background music can be added to a specific video or the original background music can be replaced according to the music preference information of target users in the target region.
  • copyrighted music materials can be used to add or replace, or artificial intelligence can be used to generate smart music that meets copyright conditions, etc.
  • the image content, voice content, background music, etc. in the original video can be processed from multiple aspects to make it have the characteristics of video content produced locally in the target region.
  • the target video can be generated.
  • This target video is a video with the characteristics of video content produced locally in the target region.
  • the target video can be returned to the current merchant and other users.
  • the download link of the target video can be returned specifically, so that the merchant can download the target video locally and select a specific delivery channel for delivery according to actual needs.
  • an operation option for publishing the target video to the product information service system may also be provided on the page.
  • the target video can also be bound to the published products in the product information service system, associated language information, required target regions, etc. In this way, the information service system can
  • the target video is saved as material for the product.
  • the target video can be placed on a specific product details page to display to specific users, etc. .
  • the content to be optimized in the process of translating the original video according to the target language, can be determined, and based on the characteristic information of the video content produced locally in the target region corresponding to the target language, the content to be optimized can be The content to be optimized is optimized so that the generated target video has the characteristics of video content produced locally in the target region. In this way, videos suitable for users in various regions can be obtained at a lower cost, making the video content look more real and friendly, thereby improving the information expression ability of the processed video in specific target languages.
  • embodiments of this application may involve the use of user data. In actual applications, this can be done in compliance with the applicable laws and regulations of the country where the user is located (for example, the user explicitly agrees, the user is effectively notified, etc.), use user-specific personal data in the scenarios described herein to the extent permitted by applicable laws and regulations.
  • the device may include:
  • the original video determination unit 401 is used to determine the original video to be processed and processing requirement information, where the processing requirement information includes the required target language;
  • the content to be optimized determining unit 402 is used to determine the content to be optimized in the process of translating the original video according to the target language;
  • the optimization processing unit 403 is configured to optimize the content to be optimized according to the characteristic information of the video content produced in the target region corresponding to the target language, so as to generate a target video, and make the target video have the target region Characteristics of the video content produced.
  • the to-be-optimized content determination unit can be specifically used for:
  • the optimization processing unit can be specifically used for:
  • the human body image is processed according to the facial features, skin color and/or hair color characteristic information corresponding to the target region, so that the processed human body image has the facial features, skin color and/or hair color characteristics corresponding to the target region.
  • the optimization processing unit can also be used to:
  • optimization processing unit can also be used for:
  • the target voice content is processed so that its duration is consistent with the pronunciation duration of the original voice content.
  • the content to be optimized determining unit may be specifically used to:
  • the optimization processing unit can be specifically used to:
  • the content to be optimized determination unit can be specifically used for:
  • the optimization processing unit can also be used to:
  • the translation is rewritten based on the idioms used in the corresponding scene of the video content produced locally in the target region.
  • the processing requirement information may also include: a speech type during speech synthesis, where the speech type includes a standard pronunciation method corresponding to the target language, or a pronunciation method with a local accent;
  • the optimization processing unit can be specifically used to:
  • the speech synthesis process is performed on the translation according to the speech type to generate the The target speech content in the target video.
  • the content to be optimized determination unit can be specifically used for:
  • the optimization processing unit can be specifically used for:
  • the subtitle content corresponding to the original voice content is erased, and the text content corresponding to the target voice content translated according to the target language is backfilled into the subtitle display area in the image.
  • optimization processing unit can be specifically used for:
  • the text content corresponding to the target voice content translated according to the target language is filled into the target display area in the image screen, so as to provide the Add subtitles to the target video.
  • optimization processing unit can also be used for:
  • the subtitles are formatted according to the subtitle formatting method in the video content produced locally in the target region.
  • the optimization processing unit can also be used for:
  • the original video determination unit can be used to:
  • the user is provided with operation options of uploading a video and submitting processing requirement information through the target interface, so as to determine the original video based on the video uploaded by the user, and determine the processing requirement information based on the requirement information submitted by the user.
  • the original video determination unit may be specifically used to:
  • the explanation video is determined as the video to be explained, and the operation option for submitting the processing requirement is provided.
  • the device may also include:
  • a returning unit configured to return the target video for publishing the target video to target users in the target region.
  • the device may include:
  • a publishing option providing unit is used to provide operation options for publishing the target video to the target product information service system
  • a publishing unit configured to provide operation options for configuring the associated target product after receiving the publishing request through the operation option, so as to publish the target video as video material associated with the target product, and send the target video to the target product after receiving the publishing request through the operation option.
  • the target video is displayed.
  • embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the steps of the method described in any one of the foregoing method embodiments are implemented.
  • an electronic device including:
  • a memory associated with the one or more processors is used to store program instructions.
  • the program instructions execute any one of the foregoing method embodiments. steps of the method.
  • FIG. 5 exemplarily shows the architecture of the electronic device, which may specifically include a processor 510, a video display adapter 511, a disk drive 512, an input/output interface 513, a network interface 514, and a memory 520.
  • the above-mentioned processor 510, video display adapter 511, disk drive 512, input/output interface 513, network interface 514, and the memory 520 can be connected through a communication bus 530.
  • the processor 510 can be implemented by using a general-purpose CPU (Central Processing Unit, processor), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for execution.
  • a general-purpose CPU Central Processing Unit, processor
  • a microprocessor e.g., central processing Unit, processor
  • ASIC Application Specific Integrated Circuit
  • the memory 520 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc.
  • the memory 520 may store an operating system 521 for controlling the operation of the electronic device 500 and a basic input output system (BIOS) for controlling low-level operations of the electronic device 500 .
  • BIOS basic input output system
  • a web browser 523, a data storage management system 524, a video processing system 525, etc. can also be stored.
  • the above-mentioned video processing system 525 can be an application program that specifically implements the aforementioned steps in the embodiment of the present application.
  • the relevant program code is stored in the memory 520 and called and executed by the processor 510 .
  • the input/output interface 513 is used to connect the input/output module to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions.
  • Input devices can include keyboards, mice, touch screens, microphones, various sensors, etc., and output devices can include monitors, speakers, vibrators, indicator lights, etc.
  • the network interface 514 is used to connect a communication module (not shown in the figure) to realize communication interaction between this device and other devices.
  • the communication module can communicate through wired methods (such as USB, network cables, etc.) or wirelessly (such as mobile networks, WIFI, Bluetooth, etc.).
  • Bus 530 includes a path that carries information between various components of the device (eg, processor 510, video display adapter 511, disk drive 512, input/output interface 513, network interface 514, and memory 520).
  • the Equipment may also include other components necessary for proper operation.
  • the above-mentioned device may also include only the components necessary to implement the solution of the present application, and does not necessarily include all the components shown in the drawings.
  • the present application can be implemented by means of software plus the necessary general hardware platform. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product can be stored in a storage medium, such as ROM/RAM, disk , CD-ROM, etc., including a number of instructions to cause a computer device to
  • the equipment (which may be a personal computer, a server, or a network device, etc.) executes the methods described in various embodiments or certain parts of the embodiments of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例公开了视频处理方法及电子设备,所述方法包括:确定待处理的原始视频,以及处理需求信息,所述处理需求信息包括所需的目标语种;在按照所述目标语种对所述原始视频进行翻译的过程中,确定待优化内容;根据所述目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以生成目标视频,并使得所述目标视频具有所述目标地域生产的视频内容的特点。通过本申请实施例,能够在节省成本的同时,提升处理后的视频在具体目标语种下的信息表达能力。

Description

视频处理方法及电子设备
本申请要求于2022年09月09日提交中国专利局、申请号为202211102445.1、申请名称为“视频处理方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频处理技术领域,特别是涉及视频处理方法及电子设备。
背景技术
直播是比较常用的“带货”方式,这种方式看似简单,但实际上对主播、对运营的要求都比较高,并且,消费者的时间效率很低,冲动购买后造成的退货率也高。因此,很多商家选择通过短视频“带货”,例如,可以提炼商品卖点,然后通过真人出镜并对商品实物讲解等方式,将商品的主要卖点更生动形象地进行表达。或者,一些较大型的品牌方还可以通过短视频去“讲故事”,把品牌的情怀,价值主张转变成故事,用短视频的方式去表达,等等。相比之下,短视频“带货”更平稳,可以降低冲动购买而产生的退货率,适合更静水细流地去表达商品卖点、品牌理念或价值主张。
但是,这种短视频“带货”还存在一个缺点在于,受限于语言、国外内容创作/服务机构不够成熟等现状,在通过短视频向海外用户进行“带货”时,会遇到很多困难。例如,国外用户可能难以理解短视频中的语音讲解内容,等等。当然,商家可以分别请多个国家的模特或主播用不同的语种录制具体的短视频,以分别用于在多个地域进行发布,这种方式虽然能解决不同地域的用户对视频内容的理解问题,但是,这种方式的视频生产成本会很高,效率也比较低,对于一些中小型商家而言,甚至可能是难以承受的。
发明内容
本申请提供了视频处理方法及电子设备,能够在节省成本的同时,提升处理后的视频在具体目标语种下的信息表达能力。
本申请提供了如下方案:
一种视频处理方法,包括:
确定待处理的原始视频,以及处理需求信息,所述处理需求信息包括所需的目标语种;
在按照所述目标语种对所述原始视频进行翻译的过程中,确定待优化内容;
根据所述目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以生成目标视频,并使得所述目标视频具有所述目标地域生产的视频内容的特点。
其中,所述确定待优化内容,包括:
从所述原始视频的图像画面内容中进行人体图像检测;
所述对所述待优化内容进行优化处理,包括:
根据所述目标地域对应的五官、肤色和/或发色特征信息,对所述人体图像进行处理,以使得处理后的人体图像具有所述目标地域对应的五官、肤色和/或发色特点。
其中,还包括:
获取按照所述目标语种对所述原始视频中包含的原始语音内容进行翻译后的目标语音内容;
根据所述目标语音内容对处理后的人体图像进行唇形匹配。
其中,还包括:
如果所述目标语音内容与所述原始语音内容的发音时长差距大于阈值,则将所述目标语音内容进行处理,以使其时长与所述原始语音内容的发音时长一致。
其中,所述确定待优化内容,包括:
对所述原始视频中包括的原始语音内容进行识别,得到对应的原始文本内容,并从所述原始文本内容中识别出网络用语;
所述对所述待优化内容进行优化处理,包括:
对所述网络用语进行文案改写,以便基于改写后的文本内容进行翻译,以获得所述目标语种对应的译文。
其中,所述确定待优化内容,包括:
在对所述原始视频中包括的原始语音内容进行语音识别,并基于得到的文本内容翻译为目标语种对应的译文后,从所述译文中确定待优化内容;
所述对所述待优化内容进行优化处理,包括:
根据所述目标地域本土生产的视频内容在对应场景下的习惯用语,对所述译文进行文案改写。
其中,所述处理需求信息还包括:语音合成时的语音类型,所述语音类型包括对应目标语种的标准发音方式,或带有地方口音的发音方式;
所述对所述待优化内容进行优化处理,包括,包括:
在对所述原始视频中包括的原始语音内容进行语音识别,并按照所述目标语种对识别出的文本进行翻译得到译文后,根据语音类型对所述译文进行语音合成处理,以生成所述目标视频中的目标语音内容。
其中,所述确定待优化内容,包括:
从所述原始视频中确定与原始语音内容对应的字幕内容;
所述对所述待优化内容进行优化处理,包括:
将所述原始语音内容对应的字幕内容擦除,并将按照所述目标语种进行翻译后得到的目标语音内容对应的文本内容回填至图像画面中的字幕展示区域。
其中,所述对所述待优化内容进行优化处理,包括:
如果所述原始视频中不包括与原始语音内容对应的字幕内容,则将按照所述目标语种进行翻译后得到的目标语音内容对应的文本内容填充到图像画面中的目标展示区域,以便为所述目标视频添加字幕。
其中,还包括:
在进行字幕回填或者添加时,根据所述目标地域本土生产的视频内容中的字幕排版方式,对字幕进行排版。
其中,所述对所述待优化内容进行优化处理,包括:
根据所述目标地域的目标用户对音乐的偏好信息,添加背景音乐或者对原有背景音乐进行替换。
其中,所述确定待处理的原始视频,以及处理需求信息,包括:
通过目标界面为用户提供上传视频以及提交处理需求信息的操作选项,以便根据用户上传的视频确定所述原始视频,以及根据用户提交的需求信息确定所述处理需求信息。
其中,所述确定待处理的原始视频,以及处理需求信息,包括:
在目标商品的讲解视频播放界面中提供用于发起处理请求的操作选项;
通过该操作选项接收到处理请求后,将该讲解视频确定为待讲解视频,并提供用于提交处理需求的操作选项。
其中,还包括:
返回所述目标视频,以用于将所述目标视频向所述目标地域的目标用户进行发布。
其中,还包括:
提供将所述目标视频进行发布到目标商品信息服务系统的操作选项;
通过该操作选项接收到发布请求后,提供用于配置关联的目标商品的操作选项,以便将所述目标视频发布为所述目标商品关联的视频素材,并在向所述目标地域的目标用户展示所述目标商品的信息时,展示所述目标视频。
一种视频处理装置,包括:
原始视频确定单元,用于确定待处理的原始视频,以及处理需求信息,所述处理需求信息包括所需的目标语种;
待优化内容确定单元,用于在按照所述目标语种对所述原始视频进行翻译的过程中,确定待优化内容;
优化处理单元,用于根据所述目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以生成目标视频,并使得所述目标视频具有所述目标地域生产的视频内容的特点。
一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一项所述的方法的步骤。
一种电子设备,包括:
一个或多个处理器;以及
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行前述任一项所述的方法的步骤。
根据本申请提供的具体实施例,本申请公开了以下技术效果:
通过本申请实施例,在按照所述目标语种对所述原始视频进行翻译的过程中,可以确定待优化内容,并根据目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以使得所生成的目标视频具有所述目标地域本土生产的视频内容的特点。通过这种方式,可以以更低的成本获得适合多种不同地域的用户观看的视频,使得视频内容看上去更真实,更亲切,从而可以提升处理后的视频在具体目标语种下的信息表达能力。对于商家用户而言,只需要生产一个语种的原始视频,即可通过本申请实施例提供的工具,翻译得到其他目标语种的目标视频,并且可以按照目标语种对目标视频中的内容进行本土化处理,因此,可以降低商家的视频生产成本,提升视频生产效率。
当然,实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的系统架构的示意图;
图2是本申请实施例提供的方法的流程图;
图3是本申请实施例提供的视频优化处理前后对比的示意图;
图4是本申请实施例提供的装置的示意图;
图5是本申请实施例提供的电子设备的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
首先需要说明的是,本申请发明人在实现本申请实施例的过程中发现,为了以更低的成本解决通过短视频向海外用户进行“带货”过程中遇到的语言不通等问题,一种方式可以是,商家生产出视频后,可以复制多份,并分别添加不同语种的字幕,以用于向多个不同地域(不同国家/地区)的用户进行发布。但是,在这种方式下,其他地域的用户只能通过字幕来理解视频中的内容,可能会减弱短视频本身的信息表达能力。
另外还有一种可行的方式是,对原始视频进行语音识别,然后将文本内容翻译成多种 不同语种的语言,再通过语音合成的方式合成为多种语种的语音,并分别对原始语音进行替换,生成多份视频,每份可以对应不同语种的语音讲解内容。这种方式虽然能够使得其他地域的用户无需借助字幕即可理解视频中的讲解内容,但是,由于短视频中经常会有真人(例如,模特,或者主播等)出镜,因此,其他地域的用户观看这种视频时也可能存在体验不佳的情况。例如,某欧美国家的用户看到的视频中,可能是一个有着亚洲面孔的主播在用英语进行讲解,因此,亲切感、真实感都会比较低,另外也影响观看者对讲解内容的信任感。另外,如果只是单纯地将一个语种的语音内容转换成另一个语种,这种转换后的语音内容也可能存在过于生硬,甚至可能出现一些翻译不够准确的情况。尤其是在“带货”场景下,主播在录制短视频时,通常是采用口语化的介绍方式,并且,可能经常使用一些网络用语,例如,将“姐妹”称为“集美”,另外还包括一些使用拼音首字母的“缩略语”,如“YYDS”,等等。对于这些内容,如果直接翻译,可能会出现错误,甚至无法翻译等情况。另外,其他地域可能也存在各自的网络用语,或者在一些场景下对一些词的习惯性用法等等,例如,“好”这个词在英文中有多种表达方式,而在“带货”场景中,在国外可能习惯于使用比较夸张的词来进行表达,等等。因此,如果不考虑这些因素进行直接翻译,也可能导致处理后的语音内容不够生动,或者不够具有吸引力等情况,等等。
基于上述情况,在本申请实施例中,可以为商家等用户提供视频处理相关的工具,具体的,可以预先获取到多个地域本土生产的视频内容(例如,欧美地区的模特或主播录制的视频,东南亚地区的模特或主播录制的视频,等等)的特征信息,并进行保存。其中,这种特征信息就可以包括模特的五官、肤色、发色等特征,还可以包括一些习惯性用语、网络用语特征,字幕的排版方式特征,当地比较流行的音乐类型特征,等等。这样,具体在使用该工具进行视频处理时,用户可以上传本地保存的预先录制好的短视频等作为待处理的原始视频,并且可以提交具体的处理需求,例如,所需的目标语种等。之后,该工具可以在对原始视频进行翻译处理(例如,包括语音识别、文本翻译、语音合成、语音与图像画面的合成等等)以生成目标视频的过程中,可以确定出其中需要优化处理的内容。例如,判断是否包含人物的头部图像,语音内容、字幕等,如果是,则都可以确定为待优化的内容。之后,可以根据上述目标语种对应的目标地域本土生产的视频内容的特征信息,对这些待优化内容进行优化处理。例如,对人物头部图像的五官、肤色、发色等进行优化,使得处理后的头部图像看上去像是上述目标地域本地人的样貌。另外,还可以对原始语音内容的语音识别结果进行一些文案改写等优化处理,以保证翻译的准确性,对翻译后的文本内容也可以进行一些改写,以使其更适合上述目标地域在对应场景下生产出视频内容时的用语习惯。再者,还可以对字幕进行替换,或者原来没有字幕的情况下,也可以添加字幕。对于画面中出现的一些花字(合成到画面中的一些用文字描述的商品卖点等内容)等,可以进行擦除,或者翻译后回填。对于背景音乐,可以替换成当地比较流行的音乐素材,等等。总之,可以通过本土化处理,使得生成的目标视频具有目标地域本土生产的视频内容的特点,以提升目标地域的用户的观看体验,也提升具体处理后的视频在具体目标语种 下的信息表达能力,同时可以降低商家的成本。
需要说明的是,在实际应用中,还可以在上述自动化处理结果的基础上进行人工审核,对一些不适合或者不恰当之处可以进行人工修正。另外,对于头部图像的处理,翻译,文案改写等环节,都可以提供人工处理的选项,如果用户需要则可以进行选择。
从系统架构角度而言,参见图1,本申请实施例可以为商家等用户提供视频处理工具,该工具可以以Web、H5等页面的形式存在,或者,也可以以应用程序(例如,移动端App),或者,小程序、轻应用等形式存在,或者,还可以作为某些应用中的一个功能模块的形式存在,等等。一种典型的应用场景是,商家录制好了一段短视频,需要发布给其他地域的用户观看,此时,该商家就可以通过本申请实施例提供的工具客户端,上传具体的短视频,并且可以提交具体的需求,例如,所需的语种等,之后,该工具便可以生成具有该语种对应的地域本土生产的视频内容特点的视频,商家则可以利用这种视频向对应的地域进行发布。其中,具体的处理过程可以是在服务端来完成的,服务端处理完成后,再通过客户端返回给用户。其中,如果具体的视频处理工具关联有商品信息服务系统,还可以直接利用工具内通过的发布链接,发布到该商品信息服务系统中,并与具体商品进行绑定,使得处理后的视频成为具体商品的视频素材,后续可以通过商品详情页等页面向用户进行展示。或者,还可以提供下载入口,用户可以将处理后的视频下载到本地,然后可以向其他渠道(例如,可以一些内容服务系统,等等)进行发布,等等。
下面对本申请实施例提供的具体实现方案进行详细介绍。
首先,本申请实施例从前述服务端的角度,提供了一种视频处理方法,参见图2,该方法可以包括:
S201:确定待处理的原始视频,以及处理需求信息,所述处理需求信息包括所需的目标语种。
其中,待处理的原始视频就可以是用户上传的短视频等,具体的,可以为用户提供具体发起视频处理请求的界面,在该界面中可以提供用于上传本地视频的操作选项,另外还可以提供用于提交处理需求的操作选项,等等。这样,用户可以通过该界面上传具体待处理的原始视频,以及具体的处理需求信息。其中,关于处理需求信息,至少可以包括所需的目标语种,或者,也可以提出更详细的需求。例如,具体处理过程中涉及到对语音内容的语音识别,翻译,语音合成,其中,在语音合成时,可以选择标准发音方式,还是具有一些地方特色的发音方式,如果用户需要某种特色的发音方式,则可以在该界面中进行选择,否则可以默认为标准发音方式,等等。
需要说明的是,在具体实现时,除了可以由用户上传本地视频之外,还可以有其他的确定待处理原始视频的方式。例如,假设某商品信息服务系统中能够提供基于直播内容自动截取商品讲解视频并进行发布的服务,此时,可以在这种讲解视频的播放界面在为商家用户提供对商品讲解视频进行处理的操作入口。商家用户可以通过点击该入口等方式发起对讲解视频进行处理的请求,当然,也同样可以提交所需的目标语种等需求信息。另外, 除了预先录制好的视频,具体的的待处理视频还可以是直播中的视频流,也即,可以在直播的过程中对视频内容进行处理,从而不仅可以进行原始视频的录制,还可以生成目标语种对应的视频,以用于发布给多种不同地域的用户,等等。
S202:在按照所述目标语种对所述原始视频进行翻译的过程中,确定待优化内容。
在确定出待处理的原始视频后,可以对原始视频进行语音识别、文本翻译、语音合成等处理,以生产出目标视频,在此过程中,可以从中识别出一些待优化的内容。其中,具体的待优化内容可以包括图像画面内容,还可以包括音频内容。例如,对于图像画面内容,可以从原始视频的图像画面内容中进行人物头部图像内容检测,如果存在,则可以确定为待优化的内容。还可以判断原始视频的图像画面内容中是否存在字幕、花字等内容,如果存在,也可以确定为待优化的内容。其中,对于花字内容,还可以对其在屏幕中的面积占比等进行计算,如果面积占比大于某阈值,则可以不进行处理。另外,对于音频内容,主要可以将其中的语音内容作为待优化内容,另外,还可以包括背景音乐,等等。
需要说明的是,在具体实现时,由于不同的待处理视频会对应多种不同的情况,例如,有的视频中可能有真人模特出镜进行讲解,有的可能没有真人模特出镜,等等。因此,具体在从待处理视频中确定待优化内容时,可以从多种维度对待处理视频进行判断,进而确定当前的待处理视频中包括哪些待优化内容,等等。
S203:根据所述目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以生成目标视频,并使得所述目标视频具有所述目标地域生产的视频内容的特点。
在确定出待优化内容之后,由于还可以从用户提交的处理需求信息中获知具体所需的目标语种,因此,可以根据预先保存的信息中,确定出目标语种对应的目标地域生产的视频内容的特征信息(可以简称为本土化特征)。例如,具体的特征信息可以包括:所述目标地域本地模特的五官、肤色、发色等外貌特征,在进行商品讲解等场景下的一些习惯性用语特征,视频中添加字幕时的排版方式特征,当地人比较喜欢的流行音乐特征,等等。这些特征信息可以提前通过配置文件等方式进行保存,这样,在针对具体的待优化内容进行优化处理时,就可以从中获取到当前所需的目标语种对应的上述本土化特征,然后基于这些特征对待优化内容进行处理,以使得所生成的目标视频具有所述目标地域本土生产的视频内容的特点。
其中,如果原始视频的图像画面内容中存在人体图像检测,则具体在进行处理时,可以根据具体的目标语种对应的目标地域对应的五官、肤色和/或发色等特征信息,对所述人体图像进行优化处理,以使得处理后的人体图像具有所述目标地域对应的五官、肤色和/或发色特点。
其中,关于具体语种与地域之间的对应关系,可以根据具体地域的官方语言等来进行确定,例如,如果目标语种是英语,则对应的目标地域可以默认为欧美。当然,还有一些国家可能采用双母语,例如,印度,可能是英语和印度语都是官方语言,此时,在面向印 度的用户时,也可以使用英语作为目标语种。当然,由于印度地区与欧美地区的人的外貌具有明显的不同,因此,在如果商家需要面向印度等国家进行视频发布,则在提出具体的处理需求时,可以对具体的国家或地区进行指定,等等。
具体的,可以通过算法模型来实现具体的人体图像处理过程,例如,包括关于眼睛大小、眼窝深浅、脸型、肤色、发色的调整等,都可以通过预先训练得到的算法模型来完成。关于具体的算法模型不属于本申请实施例中关注的重点,因此,这里不进行详述。
另外,在对原始视频进行处理的过程中,一项主要的任务就是对语音内容进行翻译,然后再通过语音合成等方式生成目标语种对应的语音内容,再重新与图像画面内容进行合成,生成目标视频。其中,在重新与图像画面内容进行合成的过程中,为了使得看上去是视频中的人物在讲解具体目标语种的语音内容,还涉及到重新对视频中的人物进行唇形匹配。而在本申请实施例中,由于进行了上述对人体图像的处理之后,因此,可以是在完成上述人体图像的处理之后,再根据语音合成的目标语音内容对处理后的人体图像进行唇形匹配。例如,如图3(A)所示,假设其为原始视频中的某一帧画面,原始视频中的人物是中国人,采用中文方式录制了介绍商品的短视频,假设目标语种是某东南亚语种,则在对视频中人物的五官/肤色等进行处理后,可以如图3(B)所示,人物的外貌已经具有了东南亚地区的人物特征,另外,人物的唇形是与翻译后的目标语种的语音对应,等等。当然,如果目标地域是欧美国家/地区,则目标语种是英语,通过本方案的方法对视频中人物的五官/肤色等按照欧美国家/地区的特征进行处理后,并对人物的唇形进行相对应的处理后,可以生成英语语种的目标视频。其中,关于具体唇形匹配的处理也可以由相应的算法模型来完成,这里不进行详述。
其中,在将翻译后生成的目标语音内容与图像画面内容重新进行合成的过程中,由于目标语音内容与原始语音内容的发音时长可能是不同的,如果两者之间的差距比较大,则可能会影响到与图像画面内容的合成效果。因此,在具体实现时,如果出现上述情况,还可以对目标语音内容进行处理,以使其时长与所述原始语音内容的发音时长一致。例如,如果目标语音内容的发音时长大于原始语音内容,则可以通过对目标语音内容对应的文本内容进行摘要等方式,来缩短文案长度,进而缩短目标语音内容的发音时长。如果是目标语音内容的发音时长小于原始语音内容,则可以通过文案生成算法,增长文案长度,进而增长目标语音内容的发音时长,等等。其中,关于具体的文案摘要算法以及文案生成算法,这里也不进行详述。
另外,如前文所述,在对语音内容进行处理的过程中,还可以涉及到文案改写等处理。具体的,在对原始视频中包括的原始语音内容进行识别,得到对应的原始文本内容后,可以从原始文本内容中识别出其中包含的网络用语等在翻译过程中可能会出错或者无法翻译等情况,这种内容也可以作为待处理的内容。具体的,在对这种内容进行处理时,可以首先利用预先建立的“词典”等信息,对识别出的网络用语进行文案改写,例如,将“集美”改写为“姐妹”,等等。这样,可以基于改写后的文本内容进行翻译,以获得所述目标语种 对应的译文。
再者,在对所述原始视频中包括的原始语音内容进行语音识别,并基于得到的文本内容翻译为目标语种对应的译文后,还可以根据所述目标地域本土生产的视频内容在对应场景下的习惯用语,对所述译文进行文案改写。例如,对于可以通过多种不同的词语或在短语等表达同一个意思时,可以选择使用更适合“带货”场景下使用的词语或者短语,以保证具体的表达效果,等等。具体实现这种文案改写时,也可以根据预先为多种不同的地域在具体场景下配置的“词典”等信息,来实现具体的改写过程。
如前文所述,在具体进行语音合成时,还可以考虑具体地域的方言、口音等特征,例如,对于一些双母语国家或地区而言,虽然英语等也是官方语言,但是这种地域的人们在说英语时,可能带有明显的口音等地方特色,等等。因此,也可以预先根据这种情况为同一语种准备多种不同口音的语音包,例如,可以包括标准英语、印度英语、美国英语等等。这样,在用户提出具体的处理需求时,可以为用户提供多种可选的语音类型,具体的语音类型就包括对应目标语种的标准发音方式,或带有地方口音的发音方式。这样,如果用户在处理需求中提出使用其中某种语音类型,则在对所述原始视频中包括的原始语音内容进行语音识别,并按照所述目标语种对识别出的文本进行翻译得到译文后,进行语音合成时,可以根据用户选择的语音类型对所述译文进行语音合成,以生成目标语音内容。例如,用户选择的目标语种是英语,语音类型是印度英语,则具体在翻译得到英文译文后,进行语音合成时,就可以使用印度英语对应的语音包来进行合成,从而使得生产的视频中具体的语音内容可以是带有印度口音特色的英语语音,这样,可以使得印度地区的用户感到更亲切。
另外,具体实现时,还可以将字幕内容作为待优化内容。具体在对字幕进行优化处理时,可以分为两种情况,一种情况下,如果原始视频中存在字幕,可以从所述原始视频中确定与原始语音内容对应的字幕内容,然后,可以将所述原始语音内容对应的字幕内容从原始图像画面中擦除(当然,擦除之后,还可以对一些被破坏的图像画面进行修复),然后,可以将按照所述目标语种进行翻译后得到的目标语音内容对应的文本内容回填至图像画面中的字幕展示区域。另一种情况下,如果所述原始视频中不包括与原始语音内容对应的字幕内容,则可以将按照所述目标语种进行翻译后得到的目标语音内容对应的文本内容填充到图像画面中的目标展示区域,以便为所述目标视频添加字幕。其中,如果对目标语音内容对应的文本内容进行了文案改写,则具体是将改写后的文本内容回填或者添加到字幕展示区域。具体实现时,在进行字幕回填或者添加的过程中,涉及到新字幕与合成语音的时间轴的对齐,关于具体的实现方式,以及前述关于字幕擦除方式,在擦除之后对部分图像画面内容的修复,等等,都可以采用相关的算法来实现,这里不再进行详述。
其中,具体在进行字幕回填或者添加时,考虑到不同地域对字幕内容的排版方式可能也会有不同,因此,还可以根据所述目标地域本土生产的视频内容中流行的字幕排版方式,对字幕进行排版。例如,对于日语而言,可以采用日语排版方式进行排版,等等。
除了字幕之外,原始图像画面中还可能包括一些“花字”类的内容,例如,将某商品的主要卖点等信息通过文字的方式进行表达,并将文字添加到视频中某一帧或者几帧画面中。此时,可以将这种“花字”进行擦除,当然,由于“花字”在画面中的面积占比可能会比较大,因此,擦除之后还可以通过画面修复技术对图像画面进行修复。或者,如果“花字”在画面中的面积占比可能会过大,如果进行擦除之后可能难以对画面进行修复,此时,可以不必执行擦除操作,也即,保留“花字”内容,对于国外的用户而言,这种内容通常也不会影响其观看。
另外,还可以根据所述目标地域的目标用户对音乐的偏好信息,为具体的视频添加背景音乐或者对原有背景音乐进行替换。其中,具体可以使用具有版权的音乐素材来进行添加或者替换,或者还可以通过人工智能的方式生成满足版权条件的智能音乐,等等。
总之,可以从多个方面对原始视频中的图像画面内容、语音内容、背景音乐等进行处理,使其能够具有目标地域本土生产的视频内容的特点。
在完成对原始视频的翻译处理,以及对待优化内容的优化处理后,可以生成目标视频,这种目标视频就是具有目标地域本土生产的视频内容的特点的视频。之后,可以向当前的商家等用户返回该目标视频,例如,具体可以返回该目标视频的下载链接,使得商家可以将目标视频下载到本地,并根据实际需要选择具体的投放渠道进行投放。或者,在关联有商品信息服务系统的情况下,还可以在页面中提供用于将目标视频发布到该商品信息服务系统的操作选项。在通过该操作选项进行发布操作时,还可以将该目标视频与商品信息服务系统中已发布的商品、关联的语种信息、所需面向的目标地域等进行绑定,这样,信息服务系统可以将该目标视频保存为该商品的素材,在向具体目标地域的用户展示该商品的信息时,就可以将该目标视频投放到具体的商品详情页等页面中,以展示给具体的用户,等等。
总之,通过本申请实施例,在按照所述目标语种对所述原始视频进行翻译处理的过程中,可以确定待优化内容,并根据目标语种对应的目标地域本土生产的视频内容的特征信息,对所述待优化内容进行优化处理,以使得所生成的目标视频具有所述目标地域本土生产的视频内容的特点。通过这种方式,可以以更低的成本获得适合多种不同地域的用户观看的视频,使得视频内容看上去更真实,更亲切,从而提升处理后的视频在具体目标语种下的信息表达能力。对于商家用户而言,只需要生产一个语种的原始视频,即可通过本申请实施例提供的工具,翻译得到其他目标语种的目标视频,并且可以按照目标语种对目标视频中的内容进行本土化处理,因此,可以降低商家的视频生产成本,提升视频生产效率。
需要说明的是,本申请实施例中可能会涉及到对用户数据的使用,在实际应用中,可以在符合所在国的适用法律法规要求的情况下(例如,用户明确同意,对用户切实通知,等),在适用法律法规允许的范围内在本文描述的方案中使用用户特定的个人数据。
与前述方法实施例相对应,本申请实施例还提供了一种视频处理装置,参加图4,该装置可以包括:
原始视频确定单元401,用于确定待处理的原始视频,以及处理需求信息,所述处理需求信息包括所需的目标语种;
待优化内容确定单元402,用于在按照所述目标语种对所述原始视频进行翻译的过程中,确定待优化内容;
优化处理单元403,用于根据所述目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以生成目标视频,并使得所述目标视频具有所述目标地域生产的视频内容的特点。
其中,所述待优化内容确定单元具体可以用于:
从所述原始视频的图像画面内容中进行人体图像检测;
此时,所述优化处理单元具体可以用于:
根据所述目标地域对应的五官、肤色和/或发色特征信息,对所述人体图像进行处理,以使得处理后的人体图像具有所述目标地域对应的五官、肤色和/或发色特点。
此时,所述优化处理单元还可以用于:
获取按照所述目标语种对所述原始视频中包含的原始语音内容进行翻译后的目标语音内容;根据所述目标语音内容对所述处理后的人体图像进行唇形匹配。
另外,所述优化处理单元还可以用于:
如果所述目标语音内容与所述原始语音内容的发音时长差距大于阈值,则将所述目标语音内容进行处理,以使其时长与所述原始语音内容的发音时长一致。
或者,所述待优化内容确定单元具体可以用于:
对所述原始视频中包括的原始语音内容进行识别,得到对应的原始文本内容,并从所述原始文本内容中识别出网络用语;
此时,所述优化处理单元具体可以用于:
对所述网络用语进行文案改写,以便基于改写后的文本内容进行翻译,以获得所述目标语种对应的译文。
另外,所述待优化内容确定单元具体可以用于:
在对所述原始视频中包括的原始语音内容进行语音识别,并基于得到的文本内容翻译为目标语种对应的译文后,从所述译文中确定待优化内容;
此时,所述优化处理单元还可以用于:
根据所述目标地域本土生产的视频内容在对应场景下的习惯用语,对所述译文进行文案改写。
具体实现时,所述处理需求信息还可以包括:语音合成时的语音类型,所述语音类型包括对应目标语种的标准发音方式,或带有地方口音的发音方式;
此时,所述优化处理单元具体可以用于:
在对所述原始视频中包括的原始语音内容进行语音识别,并按照所述目标语种对识别出的文本进行翻译得到译文后,根据语音类型对所述译文进行语音合成处理,以生成所述 目标视频中的目标语音内容。
另外,所述待优化内容确定单元具体可以用于:
从所述原始视频中确定与原始语音内容对应的字幕内容;
此时,所述优化处理单元具体可以用于:
将所述原始语音内容对应的字幕内容擦除,并将按照所述目标语种进行翻译后得到的目标语音内容对应的文本内容回填至图像画面中的字幕展示区域。
再者,所述优化处理单元具体可以用于:
如果所述原始视频中不包括与原始语音内容对应的字幕内容,则将按照所述目标语种进行翻译后得到的目标语音内容对应的文本内容填充到图像画面中的目标展示区域,以便为所述目标视频添加字幕。
另外,所述优化处理单元还可以用于:
在进行字幕回填或者添加时,根据所述目标地域本土生产的视频内容中的字幕排版方式,对字幕进行排版。
所述优化处理单元还可以用于:
根据所述目标地域的目标用户对音乐的偏好信息,添加背景音乐或者对原有背景音乐进行替换。
具体实现时,所述原始视频确定单元具体可以用于:
通过目标界面为用户提供上传视频以及提交处理需求信息的操作选项,以便根据用户上传的视频确定所述原始视频,以及根据用户提交的需求信息确定所述处理需求信息。
或者,所述原始视频确定单元具体可以用于:
在目标商品的讲解视频播放界面中提供用于发起处理请求的操作选项;
通过该操作选项接收到处理请求后,将该讲解视频确定为待讲解视频,并提供用于提交处理需求的操作选项。
具体实现时,该装置还可以包括:
返回单元,用于返回所述目标视频,以用于将所述目标视频向所述目标地域的目标用户进行发布。
另外,该装置还可以包括:
发布选项提供单元,用于提供将所述目标视频进行发布到目标商品信息服务系统的操作选项;
发布单元,用于通过该操作选项接收到发布请求后,提供用于配置关联的目标商品的操作选项,以便将所述目标视频发布为所述目标商品关联的视频素材,并在向所述目标地域的目标用户展示所述目标商品的信息时,展示所述目标视频。
另外,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述方法实施例中任一项所述的方法的步骤。
以及一种电子设备,包括:
一个或多个处理器;以及
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行前述方法实施例中任一项所述的方法的步骤。
其中,图5示例性的展示出了电子设备的架构,具体可以包括处理器510,视频显示适配器511,磁盘驱动器512,输入/输出接口513,网络接口514,以及存储器520。上述处理器510、视频显示适配器511、磁盘驱动器512、输入/输出接口513、网络接口514,与存储器520之间可以通过通信总线530进行通信连接。
其中,处理器510可以采用通用的CPU(Central Processing Unit,处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请所提供的技术方案。
存储器520可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器520可以存储用于控制电子设备500运行的操作系统521,用于控制电子设备500的低级别操作的基本输入输出系统(BIOS)。另外,还可以存储网页浏览器523,数据存储管理系统524,以及视频处理系统525等等。上述视频处理系统525就可以是本申请实施例中具体实现前述各步骤操作的应用程序。总之,在通过软件或者固件来实现本申请所提供的技术方案时,相关的程序代码保存在存储器520中,并由处理器510来调用执行。
输入/输出接口513用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。
网络接口514用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。
总线530包括一通路,在设备的各个组件(例如处理器510、视频显示适配器511、磁盘驱动器512、输入/输出接口513、网络接口514,与存储器520)之间传输信息。
需要说明的是,尽管上述设备仅示出了处理器510、视频显示适配器511、磁盘驱动器512、输入/输出接口513、网络接口514,存储器520,总线530等,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本申请方案所必需的组件,而不必包含图中所示的全部组件。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设 备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上对本申请所提供的视频处理方法及电子设备,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本申请的限制。

Claims (14)

  1. 一种视频处理方法,其特征在于,包括:
    确定待处理的原始视频,以及处理需求信息,所述处理需求信息包括所需的目标语种;
    在按照所述目标语种对所述原始视频进行翻译的过程中,确定待优化内容;
    根据所述目标语种对应的目标地域生产的视频内容的特征信息,对所述待优化内容进行优化处理,以生成目标视频,并使得所述目标视频具有所述目标地域生产的视频内容的特点。
  2. 根据权利要求1所述的方法,其特征在于,
    所述确定待优化内容,包括:
    从所述原始视频的图像画面内容中进行人体图像检测;
    所述对所述待优化内容进行优化处理,包括:
    根据所述目标地域对应的五官、肤色和/或发色特征信息,对所述人体图像进行处理,以使得处理后的人体图像具有所述目标地域对应的五官、肤色和/或发色特点。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    获取按照所述目标语种对所述原始视频中包含的原始语音内容进行翻译后的目标语音内容;
    根据所述目标语音内容对处理后的人体图像进行唇形匹配。
  4. 根据权利要求3所述的方法,其特征在于,还包括:
    如果所述目标语音内容与所述原始语音内容的发音时长差距大于阈值,则将所述目标语音内容进行处理,以使其时长与所述原始语音内容的发音时长一致。
  5. 根据权利要求1所述的方法,其特征在于,
    所述确定待优化内容,包括:
    对所述原始视频中包括的原始语音内容进行识别,得到对应的原始文本内容,并从所述原始文本内容中识别出网络用语;
    所述对所述待优化内容进行优化处理,包括:
    对所述网络用语进行文案改写,以便基于改写后的文本内容进行翻译,以获得所述目标语种对应的译文。
  6. 根据权利要求1所述的方法,其特征在于,
    所述确定待优化内容,包括:
    在对所述原始视频中包括的原始语音内容进行语音识别,并基于得到的文本内容翻译 为目标语种对应的译文后,从所述译文中确定待优化内容;
    所述对所述待优化内容进行优化处理,包括:
    根据所述目标地域本土生产的视频内容在对应场景下的习惯用语,对所述译文进行文案改写。
  7. 根据权利要求1所述的方法,其特征在于,
    所述处理需求信息还包括:语音合成时的语音类型,所述语音类型包括对应目标语种的标准发音方式,或带有地方口音的发音方式;
    所述对所述待优化内容进行优化处理,包括,包括:
    在对所述原始视频中包括的原始语音内容进行语音识别,并按照所述目标语种对识别出的文本进行翻译得到译文后,根据语音类型对所述译文进行语音合成处理,以生成所述目标视频中的目标语音内容。
  8. 根据权利要求1所述的方法,其特征在于,
    所述对所述待优化内容进行优化处理,包括:
    根据所述目标地域的目标用户对音乐的偏好信息,添加背景音乐或者对原有背景音乐进行替换。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,
    所述确定待处理的原始视频,以及处理需求信息,包括:
    通过目标界面为用户提供上传视频以及提交处理需求信息的操作选项,以便根据用户上传的视频确定所述原始视频,以及根据用户提交的需求信息确定所述处理需求信息。
  10. 根据权利要求1至8任一项所述的方法,其特征在于,
    所述确定待处理的原始视频,以及处理需求信息,包括:
    在目标商品的讲解视频播放界面中提供用于发起处理请求的操作选项;
    通过该操作选项接收到处理请求后,将该讲解视频确定为待讲解视频,并提供用于提交处理需求的操作选项。
  11. 根据权利要求1至8任一项所述的方法,其特征在于,还包括:
    返回所述目标视频,以用于将所述目标视频向所述目标地域的目标用户进行发布。
  12. 根据权利要求1至8任一项所述的方法,其特征在于,还包括:
    提供将所述目标视频进行发布到目标商品信息服务系统的操作选项;
    通过该操作选项接收到发布请求后,提供用于配置关联的目标商品的操作选项,以便 将所述目标视频发布为所述目标商品关联的视频素材,并在向所述目标地域的目标用户展示所述目标商品的信息时,展示所述目标视频。
  13. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至12任一项所述的方法的步骤。
  14. 一种电子设备,其特征在于,包括:
    一个或多个处理器;以及
    与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行权利要求1至12任一项所述的方法的步骤。
PCT/CN2023/117369 2022-09-09 2023-09-07 视频处理方法及电子设备 WO2024051760A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211102445.1 2022-09-09
CN202211102445.1A CN115643466A (zh) 2022-09-09 2022-09-09 视频处理方法及电子设备

Publications (1)

Publication Number Publication Date
WO2024051760A1 true WO2024051760A1 (zh) 2024-03-14

Family

ID=84942688

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/117369 WO2024051760A1 (zh) 2022-09-09 2023-09-07 视频处理方法及电子设备

Country Status (2)

Country Link
CN (1) CN115643466A (zh)
WO (1) WO2024051760A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115643466A (zh) * 2022-09-09 2023-01-24 阿里巴巴(中国)有限公司 视频处理方法及电子设备
CN117253486A (zh) * 2023-09-22 2023-12-19 北京中科金财科技股份有限公司 一种基于深度学习的实时多语言处理的直播方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119063A (zh) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 视频配音生成方法、装置、设备及存储介质
CN112562721A (zh) * 2020-11-30 2021-03-26 清华珠三角研究院 一种视频翻译方法、系统、装置及存储介质
US20210352380A1 (en) * 2018-10-18 2021-11-11 Warner Bros. Entertainment Inc. Characterizing content for audio-video dubbing and other transformations
CN113763236A (zh) * 2021-09-13 2021-12-07 秒影工场(北京)科技有限公司 一种商业短视频根据地域动态调整脸部特征的方法
CN114501042A (zh) * 2021-12-20 2022-05-13 阿里巴巴(中国)有限公司 跨境直播处理方法及电子设备
CN115643466A (zh) * 2022-09-09 2023-01-24 阿里巴巴(中国)有限公司 视频处理方法及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119063A (zh) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 视频配音生成方法、装置、设备及存储介质
US20210352380A1 (en) * 2018-10-18 2021-11-11 Warner Bros. Entertainment Inc. Characterizing content for audio-video dubbing and other transformations
CN112562721A (zh) * 2020-11-30 2021-03-26 清华珠三角研究院 一种视频翻译方法、系统、装置及存储介质
CN113763236A (zh) * 2021-09-13 2021-12-07 秒影工场(北京)科技有限公司 一种商业短视频根据地域动态调整脸部特征的方法
CN114501042A (zh) * 2021-12-20 2022-05-13 阿里巴巴(中国)有限公司 跨境直播处理方法及电子设备
CN115643466A (zh) * 2022-09-09 2023-01-24 阿里巴巴(中国)有限公司 视频处理方法及电子设备

Also Published As

Publication number Publication date
CN115643466A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
WO2024051760A1 (zh) 视频处理方法及电子设备
US11347801B2 (en) Multi-modal interaction between users, automated assistants, and other computing services
WO2021109678A1 (zh) 视频生成方法、装置、电子设备及存储介质
Caldwell et al. Web content accessibility guidelines (WCAG) 2.0
World Wide Web Consortium Web content accessibility guidelines (WCAG) 2.0
JP7269286B2 (ja) 字幕生成方法および字幕生成装置
KR20180038318A (ko) 자막 생성 시스템, 자막 생성 방법, 그리고 콘텐트 생성 프로그램
US20110097693A1 (en) Aligning chunk translations for language learners
De Araújo et al. An approach to generate and embed sign language video tracks into multimedia contents
US20140278370A1 (en) Systems and Methods for Customizing Text in Media Content
CN109460548B (zh) 一种面向智能机器人的故事数据处理方法及系统
KR20160106970A (ko) 디지털 사이니지의 최적 템플릿 생성 방법 및 그 장치
KR20220167358A (ko) 가상 캐릭터 생성 방법과 장치, 전자 기기, 저장 매체 및 컴퓨터 프로그램
US11902690B2 (en) Machine learning driven teleprompter
US20240114106A1 (en) Machine learning driven teleprompter
WO2021247687A1 (en) Automatic modification of values of content elements in a video
US12125486B2 (en) Multi-modal interaction between users, automated assistants, and other computing services
CN109241331B (zh) 一种面向智能机器人的故事数据处理方法
US20230215296A1 (en) Method, computing device, and non-transitory computer-readable recording medium to translate audio of video into sign language through avatar
KR20160010810A (ko) 실음성 표출 가능한 실사형 캐릭터 생성 방법 및 생성 시스템
WO2021092733A1 (zh) 字幕显示方法、装置、电子设备和存储介质
US20210097284A1 (en) Techniques for presentation of electronic content related to printed material
KR102548088B1 (ko) 오디오 콘텐츠 제작을 위한 사용자 인터페이스 제공 장치 및 방법
CN114023358B (zh) 对话小说的音频生成方法、电子设备及存储介质
KR102495597B1 (ko) 시각장애인을 위한 온라인 강의 콘텐츠 제공방법 및 그 사용자 단말

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23862454

Country of ref document: EP

Kind code of ref document: A1