CN113630620A - Multimedia file playing system, related method, device and equipment - Google Patents

Multimedia file playing system, related method, device and equipment Download PDF

Info

Publication number
CN113630620A
CN113630620A CN202010376043.5A CN202010376043A CN113630620A CN 113630620 A CN113630620 A CN 113630620A CN 202010376043 A CN202010376043 A CN 202010376043A CN 113630620 A CN113630620 A CN 113630620A
Authority
CN
China
Prior art keywords
audio stream
multimedia file
playing
voice
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010376043.5A
Other languages
Chinese (zh)
Inventor
周明智
龙舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010376043.5A priority Critical patent/CN113630620A/en
Publication of CN113630620A publication Critical patent/CN113630620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/27Server based end-user applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Multimedia (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Social Psychology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a multimedia file playing related system, method, device and equipment. The system extracts an audio stream corresponding to the playing progress through the client aiming at a multimedia file currently played by the client player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; and the server determines the voice translation text through a voice translation model and returns the voice translation text to the client. By adopting the processing mode, the voice translation service is called according to the audio stream generated by the current user, and the instant translation of the voice is realized; therefore, the method can effectively ensure that the user can watch the new file and synchronously display the subtitles, achieves the real-time subtitle effect of 'what you hear is' and can meet the subtitle watching requirements of users with different languages.

Description

Multimedia file playing system, related method, device and equipment
Technical Field
The application relates to the technical field of voice processing, in particular to a multimedia file playing system, method and device, a voice translation model quality evaluation system and method and electronic equipment.
Background
With the continuous development of internet technology, video websites have been increasingly widely used. When a user watches the audio and video files, the video website can accurately match the current playing progress of the audio and video files and display multi-language subtitles in real time, so that the user can better understand the audio and video contents.
At present, a video network station mainly adopts an off-line voice translation scheme and generates multi-language subtitles based on a video file. Specifically, the scheme calls a voice recognition and translation service to recognize the whole file through the whole voice file provided by the user, and the user can see the real-time caption result of the synchronization of the sound picture and the translation caption after the whole voice file is translated.
However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: 1) for the newly added audio and video, because the voice translation subtitle of the newly added file is generated in an off-line voice translation mode, a user needs to wait for a certain time, and can only see the synchronous voice translation subtitle of the newly added audio and video after the system completes voice recognition and translation processing on the whole newly added file, but only the file without the subtitle can be watched before the whole newly added file is translated, and the real-time subtitle effect which can be seen by listening cannot be achieved; 2) off-line speech translation usually only generates translation subtitles of one common language, and cannot meet the subtitle viewing requirements of users of different languages. In summary, how to implement real-time speech translation to achieve the effect of synchronizing sound and picture with subtitles, and meet the viewing requirements of users with different languages, is a technical problem that needs to be solved urgently by technical personnel in the field.
Disclosure of Invention
The application provides a multimedia file playing system, which aims to solve the problem that subtitles cannot be displayed when a new file is watched in the prior art. The application further provides a multimedia file playing method and device, a voice translation model quality evaluation system and method and electronic equipment.
The application provides a multimedia file playing system, comprising:
the client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player;
and the server is used for determining the voice translation text through a voice translation model and returning the voice translation text to the client.
The application also provides a multimedia file playing method, which comprises the following steps:
extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
sending the audio stream to a server;
and displaying the voice translation text of the audio stream returned by the server side in the player.
Optionally, the player comprises a browser player;
the extracting of the audio stream corresponding to the playing progress includes:
and acquiring the audio stream through a data stream capturing module of the browser player.
Optionally, the audio stream includes an audio stream of millisecond duration.
Optionally, the method further includes:
performing compression processing on the audio stream;
the sending the audio stream to a server includes:
and sending the compressed audio stream to the server.
Optionally, the compressing the audio stream is performed by at least one of the following methods:
performing a down-sampling process on the audio stream;
and performing gain reduction processing on the audio stream according to the volume data of the audio stream.
Optionally, the performing down-sampling processing on the audio stream includes:
determining a down-sampling rate;
and performing down-sampling processing on the audio stream according to the down-sampling rate.
Optionally, the player comprises a browser player;
the performing compression processing on the audio stream includes:
creating an audio input node according to the audio stream;
creating an audio handler for the audio stream according to the audio input node;
performing compression processing on the audio stream by an audio processing program.
Optionally, the extracting the audio stream corresponding to the playing progress includes:
extracting an audio stream to be played;
after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed.
Optionally, the method further includes:
and sending target language information to the server so that the server translates the audio stream into a text of the target language.
The application also provides a multimedia file playing method, which comprises the following steps:
receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;
determining a speech translation text of the audio stream through a speech translation model;
and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
The present application further provides a multimedia file playing apparatus, including:
the audio stream extraction unit is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
the audio stream sending unit is used for sending the audio stream to a server;
and the text display unit is used for displaying the voice translation text of the audio stream returned by the server side in the player.
The present application further provides an electronic device, comprising:
a processor; and
a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; and displaying the voice translation text of the audio stream returned by the server side in the player.
The present application further provides a multimedia file playing apparatus, including:
the data receiving unit is used for receiving an audio stream which is sent by the client and corresponds to the playing progress of the currently played multimedia file;
the translation unit is used for determining a voice translation text of the audio stream through a voice translation model;
and the text returning unit is used for returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
The present application further provides an electronic device, comprising:
a processor; and
a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client; determining a speech translation text of the audio stream through a speech translation model; and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
The present application further provides a speech translation model quality evaluation system, including:
the server is used for collecting at least one multimedia file for evaluating the quality of the real-time voice translation model and sending the multimedia file to the client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of at least one multimedia file;
the client is used for playing the multimedia file through a browser and extracting the audio stream; and displaying the speech translation text in a player; and determining the voice translation quality information according to the voice translation text.
The application also provides a method for evaluating the quality of the voice translation model, which comprises the following steps:
collecting at least one multimedia file for evaluating the quality of a real-time speech translation model, and sending the multimedia file to a client;
receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;
determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client;
receiving voice translation quality information which is sent by a client and corresponds to the multimedia file;
determining quality information of the translation model based on the quality information of at least one multimedia file.
The application also provides a method for evaluating the quality of the voice translation model, which comprises the following steps:
playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser;
extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server;
displaying the voice translation text of the audio stream returned by the server side in a player;
and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.
The application also provides a multimedia file playing control method, which comprises the following steps:
aiming at a multimedia file currently played by a player, extracting an audio stream corresponding to the playing progress, and sending the audio stream to a server;
determining display delay time length information of the voice translation text;
and displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information.
Optionally, the determining the display delay duration information of the speech translation text includes:
and determining the duration information according to the hearing level information of the user.
Optionally, the method further includes:
if the voice hearing difficulty exceeds the hearing level of the user, pausing playing the multimedia file and repeatedly playing the played file segment;
and adjusting the duration information according to the repeated playing times.
Optionally, the method further includes:
and if the original text of the audio stream comprises words which are not included in the user source language word list, repeatedly playing the audio stream.
Optionally, the repeatedly playing the audio stream includes:
determining reading following duration information;
and determining the playing time interval of the two adjacent audio streams according to the follow-up reading time length information.
Optionally, the method further includes:
collecting the following reading voice data of a user;
determining a reading following score according to the reading following voice data;
and determining the repeated playing times of the audio stream according to the reading-after score.
Optionally, the method further includes:
intercepting a file segment of the multimedia file, wherein the voice hearing difficulty of the file exceeds the hearing level of a user;
and storing the file segments so as to repeatedly play the file segments.
The present application further provides a multimedia file playing system, including:
the client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player;
the server is used for determining the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; and returning the voice data of the target language to the client.
The application also provides a multimedia file playing method, which comprises the following steps:
extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
sending the audio stream to a server;
and playing the voice data of the target language of the audio stream returned by the server side in the player.
The application also provides a multimedia file playing method, which comprises the following steps:
receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;
determining the voice translation text through a voice translation model;
determining voice data of the target language through a voice synthesis model;
and returning the voice data of the target language to the client.
The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.
The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.
Compared with the prior art, the method has the following advantages:
according to the multimedia file playing system provided by the embodiment of the application, aiming at the multimedia file currently played by the client player, the audio stream corresponding to the playing progress is extracted through the client; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; the server side determines the voice translation text through a voice translation model and returns the voice translation text to the client side; the processing mode calls the voice translation service according to the audio stream generated by the current user to realize the instant translation of the voice; therefore, the method can effectively ensure that the user can watch the new file and synchronously display the subtitles, achieves the real-time subtitle effect of 'what you hear is' and can meet the subtitle watching requirements of users with different languages.
The voice translation model quality evaluation system provided by the embodiment of the application collects a plurality of multimedia files for evaluating the quality of a real-time voice translation model through a server and sends the multimedia files to a client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of a plurality of multimedia files; the client plays the multimedia file through a browser and extracts the audio stream; and displaying the speech translation text in a player; determining the voice translation quality information according to the voice translation text; the processing mode enables multimedia files with rich contents to be easily collected and used as model evaluation data, and a specially-assigned person is not required to generate real-time voice data in a conference mode; therefore, the model evaluation efficiency can be effectively improved, the model evaluation cost is reduced, and the online period of the model is shortened.
According to the multimedia file playing control method provided by the embodiment of the application, an audio stream corresponding to the playing progress is extracted according to the currently played multimedia file of a player, and the audio stream is sent to a server; determining display delay time length information of the voice translation text; displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information; the processing mode can control the display of the translated text subtitles, can realize the effect of displaying the translated text subtitles in a delayed way, and can also realize the subtitle display effect obtained by seeing; therefore, the user experience can be effectively improved, the language learning requirement of the user is met, and the language learning effect of the user is improved.
According to the multimedia file playing system provided by the embodiment of the application, an audio stream corresponding to the playing progress is extracted aiming at the multimedia file currently played by the player through the client; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player; the server determines the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; returning the voice data of the target language to the client; the processing mode enables the voice of the source language to be converted into the voice of the target language, and the voice is played to a user for listening; therefore, the listening requirement of the user can be effectively met, and the user experience can be effectively improved.
Drawings
FIG. 1 is a schematic diagram of a multimedia file playing system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an application scenario of an embodiment of a multimedia file playing system provided in the present application;
FIG. 3 is a schematic diagram of an apparatus interaction of an embodiment of a multimedia file playing system provided in the present application;
FIG. 4 is a schematic interaction diagram of an embodiment of a multimedia file playing system provided by the present application;
FIG. 5 is a schematic processing flow diagram illustrating an embodiment of a multimedia file playback system provided by the present application;
FIG. 6 is a schematic diagram illustrating an application scenario of an embodiment of a speech translation model quality assessment system provided by the present application;
FIG. 7 is a schematic diagram illustrating device interaction of an embodiment of a speech translation model quality assessment system provided by the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
In the application, a multimedia file playing system, a multimedia file playing method and a multimedia file playing device, a voice translation model quality evaluation system and a voice translation model quality evaluation method and electronic equipment are provided. Each of the schemes is described in detail in the following examples.
First embodiment
Please refer to fig. 1, which is a block diagram of an embodiment of a multimedia file playing system according to the present application. The system comprises: a server 1 and a client 2.
The server 1 may be a server deployed on a cloud server, or may be a server dedicated to playing a multimedia file, and may be deployed in a data center. The server may be a cluster server or a single server.
The client 2 includes but is not limited to a mobile communication device, namely: the mobile phone or the smart phone also includes terminal devices such as a personal computer, a PAD, and an iPad.
Please refer to fig. 2, which is a schematic view of a multimedia file playing system according to the present application. The server and the client can be connected through a network, for example, the client can be networked through a WIFI or the like, and the like. The method comprises the steps that a user plays a multimedia file provided by a server through a client, the file does not have subtitles, when the user watches the file, the client sends an audio stream corresponding to the current watching progress to the server, a target language text of the audio stream is determined through a voice translation model of the server and is sent back to the client, and the client displays the text, so that the user can synchronously watch translated subtitles when watching the multimedia file without the subtitles, the real-time subtitle effect of 'what you hear' is achieved, and the user can better understand audio and video contents.
Please refer to fig. 3, which is a schematic device interaction diagram of an embodiment of a multimedia file playing system according to the present application. In one example, a multimedia file currently played by a client player extracts an audio stream corresponding to a playing progress through the client; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; and the server determines the voice translation text through a voice translation model and returns the voice translation text to the client.
The multimedia file can be an audio file, such as an English speech audio; or a video file such as a movie or a theatrical work.
The client player may be a browser (e.g., an IE browser), a desktop player (e.g., microsoft multimedia player), a mobile application player installed on a smart phone (e.g., a shrimp music application player), and the like.
In one example, the client player is a browser through which a user opens a video website looking for multimedia files of interest to view. In the process of playing the multimedia file, the audio stream can be acquired through a data stream capturing module of the browser player. For example, with htmlmediaelement capturestream, an audio stream originating from < audio > or < video > in a web page is acquired.
The speech translation model deployed at the server is a speech processing model which can transcribe speech into characters in a source language and translate the speech into characters in a target language. The model can call the voice recognition and translation service in real time to generate a voice translation subtitle stream according to the currently played audio of the user, and the voice translation subtitle stream is provided for the user to watch.
In practical applications, since different users may have viewing requirements of subtitles in different languages, the users can specify a target language in specific implementation; the client can also send target language information to the server so that the server translates the audio stream into text in the target language. In specific implementation, the server may include a speech translation model of multiple languages, and the server may select a speech translation model corresponding to a target language to determine a translation text of the audio stream.
It should be noted that, although the client may send the audio stream to the server while playing the audio stream to identify the translated text corresponding to the audio stream, and the server returns the translated text to the client for display, since the audio stream may be an audio stream with a duration on the order of milliseconds, such as 10 milliseconds, the user may perceive the viewing effect of synchronizing the sound and the caption.
Please refer to fig. 4, which is a schematic device interaction diagram of an embodiment of a multimedia file playing system according to the present application. In this embodiment, the multimedia file currently played by the client player extracts an audio stream corresponding to the playing progress through the client; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; and the server determines the voice translation text through a voice translation model and returns the voice translation text to the client. By adopting the processing mode, the network transmission volume is reduced, the network resource consumption can be reduced, the synchronous display speed of the translated captions can be effectively improved, and the user experience is improved.
In one example, the performing of the compression process on the audio stream may be as follows: down-sampling processing is performed on the audio stream. In a specific implementation, the performing down-sampling processing on the audio stream may include the following steps: determining a down-sampling rate; and performing down-sampling processing on the audio stream according to the down-sampling rate. For example, with a 48k sampling rate, an audio stream volume of about 1/3 may be reduced.
In another example, the performing of the compression process on the audio stream may further adopt the following manner: and performing gain reduction processing on the audio stream according to the volume data of the audio stream. For example, if the volume of the audio stream is large and exceeds a certain threshold, the volume of the audio stream can be reduced by performing a gain reduction process on the audio stream, so as to achieve a compression effect.
In this embodiment, the player comprises a browser player; in specific implementation, the performing compression processing on the audio stream includes: creating an audio input node according to the audio stream; creating an audio handler for the audio stream according to the audio input node; performing compression processing on the audio stream by an audio processing program.
Please refer to fig. 5, which is a schematic processing flow diagram of an embodiment of a multimedia file playing system according to the present application. In this embodiment, a user plays a multimedia file through a browser, and the subtitle synchronization processing process includes the following steps:
1. the browser plays the audio-video file through the < audio > or < video > tag.
2. Audio streams emitted by < audio > or < video > in step 1 are acquired with htmlmediaelement.
3. Create an audio input node in conjunction with the audio content.
4. Creating an audio handler using a baseaudiocontext createscriptprocessor, and using the output node in step 3 as an input of the audio handler.
5. And in the audio processing program, the audio stream is transmitted to the cloud speech recognition module and the translation service after the transmission volume of the audio stream is reduced by the down-sampling module according to actual needs. And presenting the translation result returned in real time to the user.
6. And sending the audio stream to a default playing device (such as a browser) of a user while identifying so as to achieve the aim of synchronizing the sound and the picture and the subtitle.
In one example, the extracting the audio stream corresponding to the playing progress may include the following steps: extracting an audio stream to be played; after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed. By adopting the processing mode, the audio stream is translated in advance, so that the synchronization degree of the sound and the picture and the translated caption can be effectively improved.
As can be seen from the foregoing embodiments, the multimedia file playing system provided in the embodiments of the present application extracts, by a client, an audio stream corresponding to a playing progress for a multimedia file currently played by a client player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; the server side determines the voice translation text through a voice translation model and returns the voice translation text to the client side; the processing mode calls the voice translation service according to the audio stream generated by the current user to realize the instant translation of the voice; therefore, the method can effectively ensure that the user can watch the new file and synchronously display the subtitles, achieves the real-time subtitle effect of 'what you hear is' and can meet the subtitle watching requirements of users with different languages.
Second embodiment
Corresponding to the multimedia file playing system, the application also provides a multimedia file playing method, and an execution main body of the method includes but is not limited to a client side, and can be other terminal equipment. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
In this embodiment, the method includes the steps of:
step 1: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
step 2: sending the audio stream to a server;
and step 3: and displaying the voice translation text of the audio stream returned by the server side in the player.
The player includes but is not limited to: a browser player; the audio stream corresponding to the playing progress is extracted in the following manner: and acquiring the audio stream through a data stream capturing module of the browser player.
The audio stream comprises an audio stream of millisecond duration.
In one example, the method may further comprise the steps of: performing compression processing on the audio stream; correspondingly, the sending of the audio stream to the server may be performed as follows: and sending the compressed audio stream to the server.
In one example, the performing of the compression process on the audio stream may be at least one of: 1) performing a down-sampling process on the audio stream; 2) and performing gain reduction processing on the audio stream according to the volume data of the audio stream.
In one example, the performing down-sampling processing on the audio stream may include the sub-steps of: determining a down-sampling rate; and performing down-sampling processing on the audio stream according to the down-sampling rate.
In one example, the player comprises a browser player; the performing of the compression process on the audio stream may include the following sub-steps: creating an audio input node according to the audio stream; creating an audio handler for the audio stream according to the audio input node; performing compression processing on the audio stream by an audio processing program.
In one example, the extracting the audio stream corresponding to the playing progress may include the following sub-steps: extracting an audio stream to be played; after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed.
In one example, the method may further comprise the steps of: and sending target language information to the server so that the server translates the audio stream into a text of the target language.
Third embodiment
In the foregoing embodiment, a multimedia file playing method is provided, and correspondingly, a multimedia file playing apparatus is also provided in the present application. The apparatus corresponds to an embodiment of the method described above.
Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment. The application provides a multimedia file playing device, including:
the audio stream extraction unit is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
the audio stream sending unit is used for sending the audio stream to a server;
and the text display unit is used for displaying the voice translation text of the audio stream returned by the server side in the player.
Fourth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; and displaying the voice translation text of the audio stream returned by the server side in the player.
Fifth embodiment
Corresponding to the multimedia file playing system, the application also provides a multimedia file playing method, and an execution subject of the method includes but is not limited to a server side, and the method can also be any device capable of implementing the method. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
In this embodiment, the method includes the steps of:
step 1: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;
step 2: determining a speech translation text of the audio stream through a speech translation model;
and step 3: and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
Sixth embodiment
In the foregoing embodiment, a multimedia file playing method is provided, and correspondingly, a multimedia file playing apparatus is also provided in the present application. The apparatus corresponds to an embodiment of the method described above.
Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment. The application provides a multimedia file playing device, including:
the data receiving unit is used for receiving an audio stream which is sent by the client and corresponds to the playing progress of the currently played multimedia file;
the translation unit is used for determining a voice translation text of the audio stream through a voice translation model;
and the text returning unit is used for returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
Seventh embodiment
The application also provides an electronic device embodiment. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client; determining a speech translation text of the audio stream through a speech translation model; and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
Eighth embodiment
In the foregoing embodiment, a speech translation model quality evaluation system is provided, and correspondingly, the present application also provides a speech translation model quality evaluation system. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
In practical applications, a common speech translation scenario is that machine simultaneous interpretation is performed on live conference content, and a translation result can be displayed on a large screen for a user to watch. The machine simultaneous interpretation is mainly achieved by combining real-time speech recognition and machine interpretation, namely the real-time speech interpretation model. In order to ensure a better translation effect, the speech translation model needs to be evaluated before being online.
At present, a typical speech translation model quality assessment method is to simulate a live conference, collect real-time speech data of a person, translate the data in real time through a real-time speech translation system, and perform quality labeling on a translation result by a technician to determine the translation quality of the model.
However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: the field conference is simulated, and real-time voice data of people are collected, so that more human resources and equipment resources are consumed.
As shown in fig. 6, an application scenario diagram of a speech translation model quality evaluation system provided by the present application is shown. In this embodiment, the system includes: the system comprises a voice translation server, a multimedia file source server and a client. The server and the client can be connected through a network, for example, the client can be networked through a WIFI or the like, and the like. The voice translation server can collect multimedia files from various multimedia source servers and send the multimedia files to the user client; the method comprises the steps that a user plays a multimedia file provided by a voice translation server through a client, the file does not have subtitles, when the user watches the file, the client sends an audio stream corresponding to the current watching progress to the server, a target language text of the audio stream is determined through a to-be-evaluated voice translation model of the server, the text is returned to the client, and the client displays the text, so that the user can synchronously watch translated subtitles when watching the multimedia file without the subtitles, the user can evaluate a translation effect by combining sound, and evaluation information is uploaded to the voice translation server; and the voice translation server integrates the translation quality evaluation information of the user on the plurality of multimedia files, determines the quality of the voice translation model, and decides whether to put the voice translation model into use.
Please refer to fig. 7, which is a schematic diagram illustrating an apparatus interaction of an embodiment of a speech translation model quality assessment system according to the present application. The method comprises the steps that a server side collects a plurality of multimedia files for evaluating the quality of a real-time voice translation model and sends the multimedia files to a client side; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of a plurality of multimedia files; the client plays the multimedia file through a browser and extracts the audio stream; and displaying the speech translation text in a player; and determining the voice translation quality information according to the voice translation text.
Table 1 shows the model evaluation data of the present example.
Multimedia file identification Speech translation quality information
File A 60
File B 80
File C 76
TABLE 1 model evaluation data
The server stores the voice translation quality information of each multimedia file, the average value of the quality scores can be used as a model score, and whether the model is put into use or not is determined according to the score.
In one example, the speech translation server fetches multimedia files of different languages from multimedia source servers of different languages to evaluate speech translation models of the various languages.
As can be seen from the foregoing embodiments, the speech translation model quality evaluation system provided in the embodiments of the present application collects, by a server, a plurality of multimedia files for evaluating the quality of a real-time speech translation model, and sends the multimedia files to a client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of a plurality of multimedia files; the client plays the multimedia file through a browser and extracts the audio stream; and displaying the speech translation text in a player; determining the voice translation quality information according to the voice translation text; the processing mode enables multimedia files with rich contents to be easily collected and used as model evaluation data, and a specially-assigned person is not required to generate real-time voice data in a conference mode; therefore, the model evaluation efficiency can be effectively improved, the model evaluation cost is reduced, and the online period of the model is shortened.
Ninth embodiment
Corresponding to the above-mentioned speech translation model quality evaluation system, the present application also provides a speech translation model quality evaluation method, and the execution subject of the method includes but is not limited to a client. Parts of this embodiment that are the same as the eighth embodiment will not be described again, please refer to corresponding parts in the eighth embodiment.
The quality evaluation method of the speech translation model provided by the application can comprise the following steps:
step 1: playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser;
step 2: extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server;
and step 3: displaying the voice translation text of the audio stream returned by the server side in a player;
and 4, step 4: and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.
Tenth embodiment
In the foregoing embodiment, a method for evaluating the quality of a speech translation model is provided, and correspondingly, an apparatus for evaluating the quality of a speech translation model is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a speech translation model quality evaluation device includes:
the playing unit is used for playing a multimedia file for evaluating the quality of the real-time speech translation model through a browser;
the extraction unit is used for extracting the audio stream corresponding to the playing progress of the multimedia file and sending the audio stream to a server;
the display unit is used for displaying the voice translation text of the audio stream returned by the server side in the player;
and the determining unit is used for determining the voice translation quality information corresponding to the multimedia file according to the voice translation text and sending the quality information to a server.
Eleventh embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the speech translation model quality evaluation method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser; extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in a player; and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.
Twelfth embodiment
Corresponding to the above-mentioned speech translation model quality assessment system, the present application also provides a speech translation model quality assessment method, and the execution subject of the method includes but is not limited to a server. Parts of this embodiment that are the same as the eighth embodiment will not be described again, please refer to corresponding parts in the eighth embodiment.
The remote health detection method provided by the application can comprise the following steps:
step 1: collecting a plurality of multimedia files for evaluating the quality of a real-time speech translation model, and sending the multimedia files to a client;
step 2: receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;
and step 3: determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client;
and 4, step 4: receiving voice translation quality information which is sent by a client and corresponds to the multimedia file;
and 5: determining quality information of the translation model based on the quality information of the plurality of multimedia files.
Thirteenth embodiment
In the foregoing embodiment, a method for evaluating the quality of a speech translation model is provided, and correspondingly, an apparatus for evaluating the quality of a speech translation model is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The application provides a speech translation model quality evaluation device includes:
the collecting unit is used for collecting a plurality of multimedia files for evaluating the quality of the real-time voice translation model and sending the multimedia files to the client;
the receiving unit is used for receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;
the translation unit is used for determining a voice translation text of the audio stream through the translation model and returning the voice translation text to the client;
the receiving unit is used for receiving the voice translation quality information which is sent by the client and corresponds to the multimedia file;
a determining unit for determining quality information of the translation model according to the quality information of the plurality of multimedia files.
Fourteenth embodiment
The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the speech translation model quality evaluation method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: collecting a plurality of multimedia files for evaluating the quality of a real-time speech translation model, and sending the multimedia files to a client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model based on the quality information of the plurality of multimedia files.
Fifteenth embodiment
Corresponding to the multimedia file playing system, the application also provides a multimedia file playing control method, and an execution main body of the method includes but is not limited to a client. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The multimedia file playing method provided by the application can comprise the following steps:
step 1: and aiming at the multimedia file currently played by the player, extracting an audio stream corresponding to the playing progress, and sending the audio stream to the server.
In the present embodiment, the reason why the user views and listens to the multimedia file includes a desire to learn a language of the original (source language), for example, a desire to exercise an english hearing level by viewing an english lecture video. In this case, the "as-heard-as-seen" real-time caption effect may not be needed, as this display would affect the user's level of exercising hearing. In order to meet the requirement of the user, the method provided by the embodiment of the application can control the display of the translated text (the speech translated text) subtitles, and realize the effect of displaying the translated text subtitles in a delayed manner, thereby achieving the purpose of helping the user to learn the language.
The multimedia file can be an audio file, such as an English speech audio; or a video file such as a movie or a theatrical work.
The server side can determine the voice translation text through a voice translation model and send the voice translation text back to the client side.
Step 2: and determining display delay time length information of the voice translation text.
And the display delay time comprises a time difference value between a second time for displaying the translated text subtitles and a first time for playing the audio stream, and if the display delay time is 0.1 second, the corresponding translated text subtitles are displayed after the audio stream is played for 0.1 second.
In one example, the display delay time duration may be determined as follows: and determining the duration information according to the hearing level information of the user.
For example, if the hearing level of the user is cambridge english FCE and the hearing level of the multimedia file is cambridge english KET, and the hearing level of the user exceeds the hearing level of the file, the setting of the delay time duration may be displayed to be shorter, such as 1 millisecond, so as to achieve the real-time subtitle effect of "what you hear is seen" to facilitate the user to quickly check whether the understanding of the information heard by the user is correct. In addition, the display delay time length is not set, and the translated text subtitles are not displayed, so that the user is not disturbed by the subtitles.
For another example, if the hearing level of the user is cambridge english KET, and the hearing level of the multimedia file is cambridge english FCE, the hearing level of the user does not reach the hearing level of the file, the display delay time may be set to be slightly longer, for example, set to be 5 seconds or 10 seconds, so as to achieve a more obvious subtitle delay display effect, so as to give the user sufficient time to the meaning of the content that the user wants to hear. In addition, the display delay time can be set to be short, so that the user can learn quickly, and the learning efficiency is improved.
The hearing level of the user can be preset information, such as 10 seconds; or the information input by the user during the watching, such as 5 seconds; or the device can automatically adjust the information according to the reading following condition of the user, for example, the reading following speed of the user is slow, which indicates that the hearing level of the user is slightly poor.
And step 3: and displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information.
In one example, the audio may be played for a period of time and paused for a period of time, and the text subtitle of the audio that has been played may be displayed while the playing is paused. That is, the user can stop playing for a while, for example, playing 1 minute and 10 seconds, which not only gives the user thinking time, but also avoids displaying the front audio translation when playing the back audio again, which affects the learning of the user.
In one example, the method may further comprise the steps of: if the voice hearing difficulty exceeds the hearing level of the user, pausing playing the multimedia file and repeatedly playing the played file segment; and adjusting the duration information according to the repeated playing times. With this processing, the user can repeatedly watch the original text paragraph which is listened to beyond the level; therefore, the learning effect of the user can be effectively improved.
The number of times of repeated playing may be a fixed number of times or a number of times of real-time adjustment. For example, the device may automatically adjust the repeated playing according to the reading following condition of the user, if the reading following speed of the user is slow, the playing times are increased, and if the reading following of the user is very fluent, the next segment may be played continuously.
In specific implementation, the larger the number of repeated playing times is, the smaller the display delay time is, for example, the display delay time in the 10 th playing time may be lower than the display delay time in the 5 th playing time, or may be set by an opposite rule, and the specific rule may be determined according to actual requirements.
In one example, the method may further comprise the steps of: and if the original text of the audio stream comprises words which are not included in the user source language word list, repeatedly playing the audio stream. For example, if the user source language does not include the word professional in the table, then the subsequent audio may be paused at this word and the word may be played back over and over. In specific implementation, the 'professional' translation can be synchronously displayed when the repeated playing is started, and the translation can be displayed in a delayed manner along with the increase of the repeated playing times until the translation is not displayed. By adopting the processing mode, the user is helped to learn key words repeatedly; therefore, the learning effect can be effectively improved.
In one example, the repeatedly playing the audio stream may include the following steps: determining reading following duration information; and determining the playing time interval of the two adjacent audio streams according to the follow-up reading time length information. For example, the following reading condition of the user can be collected through the camera and the microphone, the following reading duration is determined, and generally, the longer the following reading duration is, the longer the temporal duration is, the user does not have a better grasp yet, and sufficient time is provided for the user to follow the reading.
In one example, the method may further comprise the steps of: collecting user follow-up voice data, such as collecting user voice through a microphone; determining a follow-up reading score according to the follow-up reading voice data, wherein if the follow-up reading time is short and the follow-up reading sound wave is close to the standard sound wave, the follow-up reading score is higher; and determining the repeated playing times of the audio stream according to the reading-after score. For example, the higher the read-after score, the fewer the number of repeat plays, and so on. By adopting the processing mode, the user is helped to repeatedly learn key paragraphs; therefore, the learning effect can be effectively improved.
In one example, the method may further comprise the steps of: intercepting a file segment of the multimedia file, wherein the voice hearing difficulty of the file exceeds the hearing level of a user; and storing the file segments so as to repeatedly play the file segments. By adopting the processing mode, the user can repeatedly listen to the unfamiliar audio paragraph at any time to realize the rereading function; therefore, the language learning effect can be effectively improved.
As can be seen from the foregoing embodiments, in the multimedia file playing control method provided in the embodiments of the present application, an audio stream corresponding to a playing progress is extracted for a multimedia file currently played by a player, and the audio stream is sent to a server; determining display delay time length information of the voice translation text; displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information; the processing mode can control the display of the translated text subtitles, can realize the effect of displaying the translated text subtitles in a delayed way, and can also realize the subtitle display effect obtained by seeing; therefore, the user experience can be effectively improved, the language learning requirement of the user is met, and the language learning effect of the user is improved.
Sixteenth embodiment
Corresponding to the multimedia file playing system, the application also provides a multimedia file playing system. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.
The multimedia file playing system provided by the application can comprise: a server and a client. The client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player; the server is used for determining the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; and returning the voice data of the target language to the client.
For example, the user is playing an english speech video (the original text is english), and at the same time, wants to hear the corresponding speech of a language other than the native language, for example, the native language is chinese, and wants to hear the german speech of the speech, so that the user can learn german according to the content of the english speech video, or learn two languages of english and german.
For another example, the user is playing a newly-shown english movie, the user's native language is chinese, and wants to listen to the chinese speech of the movie, so that the user can watch the movie more comfortably.
In specific implementation, the language synthesis model can adopt a mature prior art to convert the translation into the voice. Since the language synthesis model can adopt the mature prior art, it is not described herein again.
As can be seen from the foregoing embodiments, in the multimedia file playing system provided in the embodiment of the present application, an audio stream corresponding to a playing progress is extracted for a multimedia file currently played by a player through a client; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player; the server determines the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; returning the voice data of the target language to the client; the processing mode enables the voice of the source language to be converted into the voice of the target language, and the voice is played to a user for listening; therefore, the listening requirement of the user can be effectively met, and the user experience can be effectively improved.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (28)

1. A multimedia file playing system, comprising:
the client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player;
and the server is used for determining the voice translation text through a voice translation model and returning the voice translation text to the client.
2. A method for playing a multimedia file, comprising:
extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
sending the audio stream to a server;
and displaying the voice translation text of the audio stream returned by the server side in the player.
3. The method of claim 2,
the player comprises a browser player;
the extracting of the audio stream corresponding to the playing progress includes:
and acquiring the audio stream through a data stream capturing module of the browser player.
4. The method of claim 2,
the audio stream comprises an audio stream of millisecond duration.
5. The method of claim 2,
the method further comprises the following steps:
performing compression processing on the audio stream;
the sending the audio stream to a server includes:
and sending the compressed audio stream to the server.
6. The method of claim 5, wherein the performing the compression process on the audio stream is performed by at least one of:
performing a down-sampling process on the audio stream;
and performing gain reduction processing on the audio stream according to the volume data of the audio stream.
7. The method of claim 6, wherein the performing down-sampling processing on the audio stream comprises:
determining a down-sampling rate;
and performing down-sampling processing on the audio stream according to the down-sampling rate.
8. The method of claim 5,
the player comprises a browser player;
the performing compression processing on the audio stream includes:
creating an audio input node according to the audio stream;
creating an audio handler for the audio stream according to the audio input node;
performing compression processing on the audio stream by an audio processing program.
9. The method of claim 2,
the extracting of the audio stream corresponding to the playing progress includes:
extracting an audio stream to be played;
after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed.
10. The method of claim 2, further comprising:
and sending target language information to the server so that the server translates the audio stream into a text of the target language.
11. A method for playing a multimedia file, comprising:
receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;
determining a speech translation text of the audio stream through a speech translation model;
and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
12. A multimedia file playback apparatus, comprising:
the audio stream extraction unit is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
the audio stream sending unit is used for sending the audio stream to a server;
and the text display unit is used for displaying the voice translation text of the audio stream returned by the server side in the player.
13. An electronic device, comprising:
a processor; and
a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; and displaying the voice translation text of the audio stream returned by the server side in the player.
14. A multimedia file playback apparatus, comprising:
the data receiving unit is used for receiving an audio stream which is sent by the client and corresponds to the playing progress of the currently played multimedia file;
the translation unit is used for determining a voice translation text of the audio stream through a voice translation model;
and the text returning unit is used for returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
15. An electronic device, comprising:
a processor; and
a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client; determining a speech translation text of the audio stream through a speech translation model; and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.
16. A speech translation model quality assessment system, comprising:
the server is used for collecting at least one multimedia file for evaluating the quality of the real-time voice translation model and sending the multimedia file to the client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of at least one multimedia file;
the client is used for playing the multimedia file through a browser and extracting the audio stream; and displaying the speech translation text in a player; and determining the voice translation quality information according to the voice translation text.
17. A speech translation model quality assessment method is characterized by comprising the following steps:
collecting at least one multimedia file for evaluating the quality of a real-time speech translation model, and sending the multimedia file to a client;
receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;
determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client;
receiving voice translation quality information which is sent by a client and corresponds to the multimedia file;
determining quality information of the translation model based on the quality information of at least one multimedia file.
18. A speech translation model quality assessment method is characterized by comprising the following steps:
playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser;
extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server;
displaying the voice translation text of the audio stream returned by the server side in a player;
and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.
19. A multimedia file playing control method is characterized by comprising the following steps:
aiming at a multimedia file currently played by a player, extracting an audio stream corresponding to the playing progress, and sending the audio stream to a server;
determining display delay time length information of the voice translation text;
and displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information.
20. The method of claim 19, wherein determining the display delay duration information of the speech translation text comprises:
and determining the duration information according to the hearing level information of the user.
21. The method of claim 19, further comprising:
if the voice hearing difficulty exceeds the hearing level of the user, pausing playing the multimedia file and repeatedly playing the played file segment;
and adjusting the duration information according to the repeated playing times.
22. The method of claim 19, further comprising:
and if the original text of the audio stream comprises words which are not included in the user source language word list, repeatedly playing the audio stream.
23. The method of claim 22, wherein said repeatedly playing the audio stream comprises:
determining reading following duration information;
and determining the playing time interval of the two adjacent audio streams according to the follow-up reading time length information.
24. The method of claim 19, further comprising:
collecting the following reading voice data of a user;
determining a reading following score according to the reading following voice data;
and determining the repeated playing times of the audio stream according to the reading-after score.
25. The method of claim 19, further comprising:
intercepting a file segment of the multimedia file, wherein the voice hearing difficulty of the file exceeds the hearing level of a user;
and storing the file segments so as to repeatedly play the file segments.
26. A multimedia file playing system, comprising:
the client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player;
the server is used for determining the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; and returning the voice data of the target language to the client.
27. A method for playing a multimedia file, comprising:
extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;
sending the audio stream to a server;
and playing the voice data of the target language of the audio stream returned by the server side in the player.
28. A method for playing a multimedia file, comprising:
receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;
determining the voice translation text through a voice translation model;
determining voice data of the target language through a voice synthesis model;
and returning the voice data of the target language to the client.
CN202010376043.5A 2020-05-06 2020-05-06 Multimedia file playing system, related method, device and equipment Pending CN113630620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010376043.5A CN113630620A (en) 2020-05-06 2020-05-06 Multimedia file playing system, related method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010376043.5A CN113630620A (en) 2020-05-06 2020-05-06 Multimedia file playing system, related method, device and equipment

Publications (1)

Publication Number Publication Date
CN113630620A true CN113630620A (en) 2021-11-09

Family

ID=78376686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010376043.5A Pending CN113630620A (en) 2020-05-06 2020-05-06 Multimedia file playing system, related method, device and equipment

Country Status (1)

Country Link
CN (1) CN113630620A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114630144A (en) * 2022-03-03 2022-06-14 广州方硅信息技术有限公司 Audio replacement method, system and device in live broadcast room and computer equipment
WO2023098412A1 (en) * 2021-11-30 2023-06-08 华为技术有限公司 Subtitle control method, electronic device, and computer-readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226947A (en) * 2013-03-27 2013-07-31 广东欧珀移动通信有限公司 Mobile terminal-based audio processing method and device
JP2014056241A (en) * 2010-03-30 2014-03-27 Polycom Inc Method and system for adding translation in videoconference
CN105848004A (en) * 2016-05-16 2016-08-10 乐视控股(北京)有限公司 Caption playing method and caption playing device
CN108600773A (en) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 Caption data method for pushing, subtitle methods of exhibiting, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014056241A (en) * 2010-03-30 2014-03-27 Polycom Inc Method and system for adding translation in videoconference
CN103226947A (en) * 2013-03-27 2013-07-31 广东欧珀移动通信有限公司 Mobile terminal-based audio processing method and device
CN105848004A (en) * 2016-05-16 2016-08-10 乐视控股(北京)有限公司 Caption playing method and caption playing device
CN108600773A (en) * 2018-04-25 2018-09-28 腾讯科技(深圳)有限公司 Caption data method for pushing, subtitle methods of exhibiting, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023098412A1 (en) * 2021-11-30 2023-06-08 华为技术有限公司 Subtitle control method, electronic device, and computer-readable storage medium
CN114630144A (en) * 2022-03-03 2022-06-14 广州方硅信息技术有限公司 Audio replacement method, system and device in live broadcast room and computer equipment

Similar Documents

Publication Publication Date Title
US11252444B2 (en) Video stream processing method, computer device, and storage medium
US11463779B2 (en) Video stream processing method and apparatus, computer device, and storage medium
CN107911646B (en) Method and device for sharing conference and generating conference record
CN110557678B (en) Video processing method, device and equipment
CN111050201B (en) Data processing method and device, electronic equipment and storage medium
CN108259971A (en) Subtitle adding method, device, server and storage medium
CN107979763B (en) Virtual reality equipment video generation and playing method, device and system
EP4099709A1 (en) Data processing method and apparatus, device, and readable storage medium
KR101983107B1 (en) Method for inserting information push into live video streaming, server and terminal
CN112653902B (en) Speaker recognition method and device and electronic equipment
Waltl et al. Increasing the user experience of multimedia presentations with sensory effects
CN112616062B (en) Subtitle display method and device, electronic equipment and storage medium
CN102655585B (en) Video conference system and time delay testing method, device and system thereof
CN112423081B (en) Video data processing method, device and equipment and readable storage medium
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
CN111629253A (en) Video processing method and device, computer readable storage medium and electronic equipment
CN112437337A (en) Method, system and equipment for realizing live broadcast real-time subtitles
US20180286381A1 (en) Information processing method
CN113630620A (en) Multimedia file playing system, related method, device and equipment
CN111479124A (en) Real-time playing method and device
CN111629222B (en) Video processing method, device and storage medium
US20230300429A1 (en) Multimedia content sharing method and apparatus, device, and medium
CN114341866A (en) Simultaneous interpretation method, device, server and storage medium
CN110324702B (en) Information pushing method and device in video playing process
US20220007078A1 (en) An apparatus and associated methods for presentation of comments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211109

RJ01 Rejection of invention patent application after publication