CN113630620A

CN113630620A - Multimedia file playing system, related method, device and equipment

Info

Publication number: CN113630620A
Application number: CN202010376043.5A
Authority: CN
Inventors: 周明智; 龙舟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2021-11-09

Abstract

The application discloses a multimedia file playing related system, method, device and equipment. The system extracts an audio stream corresponding to the playing progress through the client aiming at a multimedia file currently played by the client player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; and the server determines the voice translation text through a voice translation model and returns the voice translation text to the client. By adopting the processing mode, the voice translation service is called according to the audio stream generated by the current user, and the instant translation of the voice is realized; therefore, the method can effectively ensure that the user can watch the new file and synchronously display the subtitles, achieves the real-time subtitle effect of 'what you hear is' and can meet the subtitle watching requirements of users with different languages.

Description

Multimedia file playing system, related method, device and equipment

Technical Field

The application relates to the technical field of voice processing, in particular to a multimedia file playing system, method and device, a voice translation model quality evaluation system and method and electronic equipment.

Background

With the continuous development of internet technology, video websites have been increasingly widely used. When a user watches the audio and video files, the video website can accurately match the current playing progress of the audio and video files and display multi-language subtitles in real time, so that the user can better understand the audio and video contents.

At present, a video network station mainly adopts an off-line voice translation scheme and generates multi-language subtitles based on a video file. Specifically, the scheme calls a voice recognition and translation service to recognize the whole file through the whole voice file provided by the user, and the user can see the real-time caption result of the synchronization of the sound picture and the translation caption after the whole voice file is translated.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: 1) for the newly added audio and video, because the voice translation subtitle of the newly added file is generated in an off-line voice translation mode, a user needs to wait for a certain time, and can only see the synchronous voice translation subtitle of the newly added audio and video after the system completes voice recognition and translation processing on the whole newly added file, but only the file without the subtitle can be watched before the whole newly added file is translated, and the real-time subtitle effect which can be seen by listening cannot be achieved; 2) off-line speech translation usually only generates translation subtitles of one common language, and cannot meet the subtitle viewing requirements of users of different languages. In summary, how to implement real-time speech translation to achieve the effect of synchronizing sound and picture with subtitles, and meet the viewing requirements of users with different languages, is a technical problem that needs to be solved urgently by technical personnel in the field.

Disclosure of Invention

The application provides a multimedia file playing system, which aims to solve the problem that subtitles cannot be displayed when a new file is watched in the prior art. The application further provides a multimedia file playing method and device, a voice translation model quality evaluation system and method and electronic equipment.

The application provides a multimedia file playing system, comprising:

the client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player;

and the server is used for determining the voice translation text through a voice translation model and returning the voice translation text to the client.

The application also provides a multimedia file playing method, which comprises the following steps:

extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;

sending the audio stream to a server;

and displaying the voice translation text of the audio stream returned by the server side in the player.

Optionally, the player comprises a browser player;

the extracting of the audio stream corresponding to the playing progress includes:

and acquiring the audio stream through a data stream capturing module of the browser player.

Optionally, the audio stream includes an audio stream of millisecond duration.

Optionally, the method further includes:

performing compression processing on the audio stream;

the sending the audio stream to a server includes:

and sending the compressed audio stream to the server.

Optionally, the compressing the audio stream is performed by at least one of the following methods:

performing a down-sampling process on the audio stream;

and performing gain reduction processing on the audio stream according to the volume data of the audio stream.

Optionally, the performing down-sampling processing on the audio stream includes:

determining a down-sampling rate;

and performing down-sampling processing on the audio stream according to the down-sampling rate.

Optionally, the player comprises a browser player;

the performing compression processing on the audio stream includes:

creating an audio input node according to the audio stream;

creating an audio handler for the audio stream according to the audio input node;

performing compression processing on the audio stream by an audio processing program.

Optionally, the extracting the audio stream corresponding to the playing progress includes:

extracting an audio stream to be played;

after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed.

Optionally, the method further includes:

and sending target language information to the server so that the server translates the audio stream into a text of the target language.

receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;

determining a speech translation text of the audio stream through a speech translation model;

and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.

The present application further provides a multimedia file playing apparatus, including:

the audio stream extraction unit is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;

the audio stream sending unit is used for sending the audio stream to a server;

and the text display unit is used for displaying the voice translation text of the audio stream returned by the server side in the player.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; and displaying the voice translation text of the audio stream returned by the server side in the player.

the data receiving unit is used for receiving an audio stream which is sent by the client and corresponds to the playing progress of the currently played multimedia file;

the translation unit is used for determining a voice translation text of the audio stream through a voice translation model;

and the text returning unit is used for returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client; determining a speech translation text of the audio stream through a speech translation model; and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.

The present application further provides a speech translation model quality evaluation system, including:

the server is used for collecting at least one multimedia file for evaluating the quality of the real-time voice translation model and sending the multimedia file to the client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of at least one multimedia file;

the client is used for playing the multimedia file through a browser and extracting the audio stream; and displaying the speech translation text in a player; and determining the voice translation quality information according to the voice translation text.

The application also provides a method for evaluating the quality of the voice translation model, which comprises the following steps:

collecting at least one multimedia file for evaluating the quality of a real-time speech translation model, and sending the multimedia file to a client;

receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;

determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client;

receiving voice translation quality information which is sent by a client and corresponds to the multimedia file;

determining quality information of the translation model based on the quality information of at least one multimedia file.

playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser;

extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server;

displaying the voice translation text of the audio stream returned by the server side in a player;

and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.

The application also provides a multimedia file playing control method, which comprises the following steps:

aiming at a multimedia file currently played by a player, extracting an audio stream corresponding to the playing progress, and sending the audio stream to a server;

determining display delay time length information of the voice translation text;

and displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information.

Optionally, the determining the display delay duration information of the speech translation text includes:

and determining the duration information according to the hearing level information of the user.

Optionally, the method further includes:

if the voice hearing difficulty exceeds the hearing level of the user, pausing playing the multimedia file and repeatedly playing the played file segment;

and adjusting the duration information according to the repeated playing times.

Optionally, the method further includes:

and if the original text of the audio stream comprises words which are not included in the user source language word list, repeatedly playing the audio stream.

Optionally, the repeatedly playing the audio stream includes:

determining reading following duration information;

and determining the playing time interval of the two adjacent audio streams according to the follow-up reading time length information.

Optionally, the method further includes:

collecting the following reading voice data of a user;

determining a reading following score according to the reading following voice data;

and determining the repeated playing times of the audio stream according to the reading-after score.

Optionally, the method further includes:

intercepting a file segment of the multimedia file, wherein the voice hearing difficulty of the file exceeds the hearing level of a user;

and storing the file segments so as to repeatedly play the file segments.

The present application further provides a multimedia file playing system, including:

the client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player;

the server is used for determining the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; and returning the voice data of the target language to the client.

sending the audio stream to a server;

and playing the voice data of the target language of the audio stream returned by the server side in the player.

determining the voice translation text through a voice translation model;

determining voice data of the target language through a voice synthesis model;

and returning the voice data of the target language to the client.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

according to the multimedia file playing system provided by the embodiment of the application, aiming at the multimedia file currently played by the client player, the audio stream corresponding to the playing progress is extracted through the client; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; the server side determines the voice translation text through a voice translation model and returns the voice translation text to the client side; the processing mode calls the voice translation service according to the audio stream generated by the current user to realize the instant translation of the voice; therefore, the method can effectively ensure that the user can watch the new file and synchronously display the subtitles, achieves the real-time subtitle effect of 'what you hear is' and can meet the subtitle watching requirements of users with different languages.

The voice translation model quality evaluation system provided by the embodiment of the application collects a plurality of multimedia files for evaluating the quality of a real-time voice translation model through a server and sends the multimedia files to a client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of a plurality of multimedia files; the client plays the multimedia file through a browser and extracts the audio stream; and displaying the speech translation text in a player; determining the voice translation quality information according to the voice translation text; the processing mode enables multimedia files with rich contents to be easily collected and used as model evaluation data, and a specially-assigned person is not required to generate real-time voice data in a conference mode; therefore, the model evaluation efficiency can be effectively improved, the model evaluation cost is reduced, and the online period of the model is shortened.

According to the multimedia file playing control method provided by the embodiment of the application, an audio stream corresponding to the playing progress is extracted according to the currently played multimedia file of a player, and the audio stream is sent to a server; determining display delay time length information of the voice translation text; displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information; the processing mode can control the display of the translated text subtitles, can realize the effect of displaying the translated text subtitles in a delayed way, and can also realize the subtitle display effect obtained by seeing; therefore, the user experience can be effectively improved, the language learning requirement of the user is met, and the language learning effect of the user is improved.

According to the multimedia file playing system provided by the embodiment of the application, an audio stream corresponding to the playing progress is extracted aiming at the multimedia file currently played by the player through the client; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player; the server determines the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; returning the voice data of the target language to the client; the processing mode enables the voice of the source language to be converted into the voice of the target language, and the voice is played to a user for listening; therefore, the listening requirement of the user can be effectively met, and the user experience can be effectively improved.

Drawings

FIG. 1 is a schematic diagram of a multimedia file playing system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario of an embodiment of a multimedia file playing system provided in the present application;

FIG. 3 is a schematic diagram of an apparatus interaction of an embodiment of a multimedia file playing system provided in the present application;

FIG. 4 is a schematic interaction diagram of an embodiment of a multimedia file playing system provided by the present application;

FIG. 5 is a schematic processing flow diagram illustrating an embodiment of a multimedia file playback system provided by the present application;

FIG. 6 is a schematic diagram illustrating an application scenario of an embodiment of a speech translation model quality assessment system provided by the present application;

FIG. 7 is a schematic diagram illustrating device interaction of an embodiment of a speech translation model quality assessment system provided by the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

In the application, a multimedia file playing system, a multimedia file playing method and a multimedia file playing device, a voice translation model quality evaluation system and a voice translation model quality evaluation method and electronic equipment are provided. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a block diagram of an embodiment of a multimedia file playing system according to the present application. The system comprises: a server 1 and a client 2.

The server 1 may be a server deployed on a cloud server, or may be a server dedicated to playing a multimedia file, and may be deployed in a data center. The server may be a cluster server or a single server.

The client 2 includes but is not limited to a mobile communication device, namely: the mobile phone or the smart phone also includes terminal devices such as a personal computer, a PAD, and an iPad.

Please refer to fig. 2, which is a schematic view of a multimedia file playing system according to the present application. The server and the client can be connected through a network, for example, the client can be networked through a WIFI or the like, and the like. The method comprises the steps that a user plays a multimedia file provided by a server through a client, the file does not have subtitles, when the user watches the file, the client sends an audio stream corresponding to the current watching progress to the server, a target language text of the audio stream is determined through a voice translation model of the server and is sent back to the client, and the client displays the text, so that the user can synchronously watch translated subtitles when watching the multimedia file without the subtitles, the real-time subtitle effect of 'what you hear' is achieved, and the user can better understand audio and video contents.

Please refer to fig. 3, which is a schematic device interaction diagram of an embodiment of a multimedia file playing system according to the present application. In one example, a multimedia file currently played by a client player extracts an audio stream corresponding to a playing progress through the client; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; and the server determines the voice translation text through a voice translation model and returns the voice translation text to the client.

The multimedia file can be an audio file, such as an English speech audio; or a video file such as a movie or a theatrical work.

The client player may be a browser (e.g., an IE browser), a desktop player (e.g., microsoft multimedia player), a mobile application player installed on a smart phone (e.g., a shrimp music application player), and the like.

In one example, the client player is a browser through which a user opens a video website looking for multimedia files of interest to view. In the process of playing the multimedia file, the audio stream can be acquired through a data stream capturing module of the browser player. For example, with htmlmediaelement capturestream, an audio stream originating from < audio > or < video > in a web page is acquired.

The speech translation model deployed at the server is a speech processing model which can transcribe speech into characters in a source language and translate the speech into characters in a target language. The model can call the voice recognition and translation service in real time to generate a voice translation subtitle stream according to the currently played audio of the user, and the voice translation subtitle stream is provided for the user to watch.

In practical applications, since different users may have viewing requirements of subtitles in different languages, the users can specify a target language in specific implementation; the client can also send target language information to the server so that the server translates the audio stream into text in the target language. In specific implementation, the server may include a speech translation model of multiple languages, and the server may select a speech translation model corresponding to a target language to determine a translation text of the audio stream.

It should be noted that, although the client may send the audio stream to the server while playing the audio stream to identify the translated text corresponding to the audio stream, and the server returns the translated text to the client for display, since the audio stream may be an audio stream with a duration on the order of milliseconds, such as 10 milliseconds, the user may perceive the viewing effect of synchronizing the sound and the caption.

Please refer to fig. 4, which is a schematic device interaction diagram of an embodiment of a multimedia file playing system according to the present application. In this embodiment, the multimedia file currently played by the client player extracts an audio stream corresponding to the playing progress through the client; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; and the server determines the voice translation text through a voice translation model and returns the voice translation text to the client. By adopting the processing mode, the network transmission volume is reduced, the network resource consumption can be reduced, the synchronous display speed of the translated captions can be effectively improved, and the user experience is improved.

In one example, the performing of the compression process on the audio stream may be as follows: down-sampling processing is performed on the audio stream. In a specific implementation, the performing down-sampling processing on the audio stream may include the following steps: determining a down-sampling rate; and performing down-sampling processing on the audio stream according to the down-sampling rate. For example, with a 48k sampling rate, an audio stream volume of about 1/3 may be reduced.

In another example, the performing of the compression process on the audio stream may further adopt the following manner: and performing gain reduction processing on the audio stream according to the volume data of the audio stream. For example, if the volume of the audio stream is large and exceeds a certain threshold, the volume of the audio stream can be reduced by performing a gain reduction process on the audio stream, so as to achieve a compression effect.

In this embodiment, the player comprises a browser player; in specific implementation, the performing compression processing on the audio stream includes: creating an audio input node according to the audio stream; creating an audio handler for the audio stream according to the audio input node; performing compression processing on the audio stream by an audio processing program.

Please refer to fig. 5, which is a schematic processing flow diagram of an embodiment of a multimedia file playing system according to the present application. In this embodiment, a user plays a multimedia file through a browser, and the subtitle synchronization processing process includes the following steps:

1. the browser plays the audio-video file through the < audio > or < video > tag.

2. Audio streams emitted by < audio > or < video > in step 1 are acquired with htmlmediaelement.

3. Create an audio input node in conjunction with the audio content.

4. Creating an audio handler using a baseaudiocontext createscriptprocessor, and using the output node in step 3 as an input of the audio handler.

5. And in the audio processing program, the audio stream is transmitted to the cloud speech recognition module and the translation service after the transmission volume of the audio stream is reduced by the down-sampling module according to actual needs. And presenting the translation result returned in real time to the user.

6. And sending the audio stream to a default playing device (such as a browser) of a user while identifying so as to achieve the aim of synchronizing the sound and the picture and the subtitle.

In one example, the extracting the audio stream corresponding to the playing progress may include the following steps: extracting an audio stream to be played; after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed. By adopting the processing mode, the audio stream is translated in advance, so that the synchronization degree of the sound and the picture and the translated caption can be effectively improved.

As can be seen from the foregoing embodiments, the multimedia file playing system provided in the embodiments of the present application extracts, by a client, an audio stream corresponding to a playing progress for a multimedia file currently played by a client player; sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in the player; the server side determines the voice translation text through a voice translation model and returns the voice translation text to the client side; the processing mode calls the voice translation service according to the audio stream generated by the current user to realize the instant translation of the voice; therefore, the method can effectively ensure that the user can watch the new file and synchronously display the subtitles, achieves the real-time subtitle effect of 'what you hear is' and can meet the subtitle watching requirements of users with different languages.

Second embodiment

Corresponding to the multimedia file playing system, the application also provides a multimedia file playing method, and an execution main body of the method includes but is not limited to a client side, and can be other terminal equipment. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

In this embodiment, the method includes the steps of:

step 1: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player;

step 2: sending the audio stream to a server;

and step 3: and displaying the voice translation text of the audio stream returned by the server side in the player.

The player includes but is not limited to: a browser player; the audio stream corresponding to the playing progress is extracted in the following manner: and acquiring the audio stream through a data stream capturing module of the browser player.

The audio stream comprises an audio stream of millisecond duration.

In one example, the method may further comprise the steps of: performing compression processing on the audio stream; correspondingly, the sending of the audio stream to the server may be performed as follows: and sending the compressed audio stream to the server.

In one example, the performing of the compression process on the audio stream may be at least one of: 1) performing a down-sampling process on the audio stream; 2) and performing gain reduction processing on the audio stream according to the volume data of the audio stream.

In one example, the performing down-sampling processing on the audio stream may include the sub-steps of: determining a down-sampling rate; and performing down-sampling processing on the audio stream according to the down-sampling rate.

In one example, the player comprises a browser player; the performing of the compression process on the audio stream may include the following sub-steps: creating an audio input node according to the audio stream; creating an audio handler for the audio stream according to the audio input node; performing compression processing on the audio stream by an audio processing program.

In one example, the extracting the audio stream corresponding to the playing progress may include the following sub-steps: extracting an audio stream to be played; after the audio stream is sent to the server, the audio stream to be played is played through the player, so that when the audio stream to be played is played, the speech translation text of the audio stream to be played is displayed.

In one example, the method may further comprise the steps of: and sending target language information to the server so that the server translates the audio stream into a text of the target language.

Third embodiment

In the foregoing embodiment, a multimedia file playing method is provided, and correspondingly, a multimedia file playing apparatus is also provided in the present application. The apparatus corresponds to an embodiment of the method described above.

Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment. The application provides a multimedia file playing device, including:

the audio stream sending unit is used for sending the audio stream to a server;

Fourth embodiment

The application also provides an electronic device. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; and displaying the voice translation text of the audio stream returned by the server side in the player.

Fifth embodiment

Corresponding to the multimedia file playing system, the application also provides a multimedia file playing method, and an execution subject of the method includes but is not limited to a server side, and the method can also be any device capable of implementing the method. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

In this embodiment, the method includes the steps of:

step 1: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client;

step 2: determining a speech translation text of the audio stream through a speech translation model;

and step 3: and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.

Sixth embodiment

Seventh embodiment

The application also provides an electronic device embodiment. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a multimedia file playing method, the device executing the following steps after being powered on and running the program of the method through the processor: receiving an audio stream corresponding to the playing progress of a currently played multimedia file sent by a client; determining a speech translation text of the audio stream through a speech translation model; and returning the voice translation text to the client so that the client displays the voice translation text when playing the audio stream.

Eighth embodiment

In the foregoing embodiment, a speech translation model quality evaluation system is provided, and correspondingly, the present application also provides a speech translation model quality evaluation system. The system corresponds to the embodiments of the system described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

In practical applications, a common speech translation scenario is that machine simultaneous interpretation is performed on live conference content, and a translation result can be displayed on a large screen for a user to watch. The machine simultaneous interpretation is mainly achieved by combining real-time speech recognition and machine interpretation, namely the real-time speech interpretation model. In order to ensure a better translation effect, the speech translation model needs to be evaluated before being online.

At present, a typical speech translation model quality assessment method is to simulate a live conference, collect real-time speech data of a person, translate the data in real time through a real-time speech translation system, and perform quality labeling on a translation result by a technician to determine the translation quality of the model.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: the field conference is simulated, and real-time voice data of people are collected, so that more human resources and equipment resources are consumed.

As shown in fig. 6, an application scenario diagram of a speech translation model quality evaluation system provided by the present application is shown. In this embodiment, the system includes: the system comprises a voice translation server, a multimedia file source server and a client. The server and the client can be connected through a network, for example, the client can be networked through a WIFI or the like, and the like. The voice translation server can collect multimedia files from various multimedia source servers and send the multimedia files to the user client; the method comprises the steps that a user plays a multimedia file provided by a voice translation server through a client, the file does not have subtitles, when the user watches the file, the client sends an audio stream corresponding to the current watching progress to the server, a target language text of the audio stream is determined through a to-be-evaluated voice translation model of the server, the text is returned to the client, and the client displays the text, so that the user can synchronously watch translated subtitles when watching the multimedia file without the subtitles, the user can evaluate a translation effect by combining sound, and evaluation information is uploaded to the voice translation server; and the voice translation server integrates the translation quality evaluation information of the user on the plurality of multimedia files, determines the quality of the voice translation model, and decides whether to put the voice translation model into use.

Please refer to fig. 7, which is a schematic diagram illustrating an apparatus interaction of an embodiment of a speech translation model quality assessment system according to the present application. The method comprises the steps that a server side collects a plurality of multimedia files for evaluating the quality of a real-time voice translation model and sends the multimedia files to a client side; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of a plurality of multimedia files; the client plays the multimedia file through a browser and extracts the audio stream; and displaying the speech translation text in a player; and determining the voice translation quality information according to the voice translation text.

Table 1 shows the model evaluation data of the present example.

Multimedia file identification	Speech translation quality information
		File A	60
File B	80
		File C	76
…

TABLE 1 model evaluation data

The server stores the voice translation quality information of each multimedia file, the average value of the quality scores can be used as a model score, and whether the model is put into use or not is determined according to the score.

In one example, the speech translation server fetches multimedia files of different languages from multimedia source servers of different languages to evaluate speech translation models of the various languages.

As can be seen from the foregoing embodiments, the speech translation model quality evaluation system provided in the embodiments of the present application collects, by a server, a plurality of multimedia files for evaluating the quality of a real-time speech translation model, and sends the multimedia files to a client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model according to the quality information of a plurality of multimedia files; the client plays the multimedia file through a browser and extracts the audio stream; and displaying the speech translation text in a player; determining the voice translation quality information according to the voice translation text; the processing mode enables multimedia files with rich contents to be easily collected and used as model evaluation data, and a specially-assigned person is not required to generate real-time voice data in a conference mode; therefore, the model evaluation efficiency can be effectively improved, the model evaluation cost is reduced, and the online period of the model is shortened.

Ninth embodiment

Corresponding to the above-mentioned speech translation model quality evaluation system, the present application also provides a speech translation model quality evaluation method, and the execution subject of the method includes but is not limited to a client. Parts of this embodiment that are the same as the eighth embodiment will not be described again, please refer to corresponding parts in the eighth embodiment.

The quality evaluation method of the speech translation model provided by the application can comprise the following steps:

step 1: playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser;

step 2: extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server;

and step 3: displaying the voice translation text of the audio stream returned by the server side in a player;

and 4, step 4: and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.

Tenth embodiment

In the foregoing embodiment, a method for evaluating the quality of a speech translation model is provided, and correspondingly, an apparatus for evaluating the quality of a speech translation model is also provided. The apparatus corresponds to an embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a speech translation model quality evaluation device includes:

the playing unit is used for playing a multimedia file for evaluating the quality of the real-time speech translation model through a browser;

the extraction unit is used for extracting the audio stream corresponding to the playing progress of the multimedia file and sending the audio stream to a server;

the display unit is used for displaying the voice translation text of the audio stream returned by the server side in the player;

and the determining unit is used for determining the voice translation quality information corresponding to the multimedia file according to the voice translation text and sending the quality information to a server.

Eleventh embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the speech translation model quality evaluation method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: playing a multimedia file for evaluating the quality of the real-time voice translation model through a browser; extracting an audio stream corresponding to the playing progress of the multimedia file, and sending the audio stream to a server; displaying the voice translation text of the audio stream returned by the server side in a player; and determining voice translation quality information corresponding to the multimedia file according to the voice translation text, and sending the quality information to a server.

Twelfth embodiment

Corresponding to the above-mentioned speech translation model quality assessment system, the present application also provides a speech translation model quality assessment method, and the execution subject of the method includes but is not limited to a server. Parts of this embodiment that are the same as the eighth embodiment will not be described again, please refer to corresponding parts in the eighth embodiment.

The remote health detection method provided by the application can comprise the following steps:

step 1: collecting a plurality of multimedia files for evaluating the quality of a real-time speech translation model, and sending the multimedia files to a client;

step 2: receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;

and step 3: determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client;

and 4, step 4: receiving voice translation quality information which is sent by a client and corresponds to the multimedia file;

and 5: determining quality information of the translation model based on the quality information of the plurality of multimedia files.

Thirteenth embodiment

the collecting unit is used for collecting a plurality of multimedia files for evaluating the quality of the real-time voice translation model and sending the multimedia files to the client;

the receiving unit is used for receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file;

the translation unit is used for determining a voice translation text of the audio stream through the translation model and returning the voice translation text to the client;

the receiving unit is used for receiving the voice translation quality information which is sent by the client and corresponds to the multimedia file;

a determining unit for determining quality information of the translation model according to the quality information of the plurality of multimedia files.

Fourteenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the speech translation model quality evaluation method, the apparatus performing the following steps after being powered on and running the program of the method by the processor: collecting a plurality of multimedia files for evaluating the quality of a real-time speech translation model, and sending the multimedia files to a client; receiving an audio stream which is sent by a client and corresponds to the playing progress of the multimedia file; determining a voice translation text of the audio stream through the translation model, and returning the voice translation text to a client; receiving voice translation quality information which is sent by a client and corresponds to the multimedia file; determining quality information of the translation model based on the quality information of the plurality of multimedia files.

Fifteenth embodiment

Corresponding to the multimedia file playing system, the application also provides a multimedia file playing control method, and an execution main body of the method includes but is not limited to a client. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The multimedia file playing method provided by the application can comprise the following steps:

step 1: and aiming at the multimedia file currently played by the player, extracting an audio stream corresponding to the playing progress, and sending the audio stream to the server.

In the present embodiment, the reason why the user views and listens to the multimedia file includes a desire to learn a language of the original (source language), for example, a desire to exercise an english hearing level by viewing an english lecture video. In this case, the "as-heard-as-seen" real-time caption effect may not be needed, as this display would affect the user's level of exercising hearing. In order to meet the requirement of the user, the method provided by the embodiment of the application can control the display of the translated text (the speech translated text) subtitles, and realize the effect of displaying the translated text subtitles in a delayed manner, thereby achieving the purpose of helping the user to learn the language.

The server side can determine the voice translation text through a voice translation model and send the voice translation text back to the client side.

Step 2: and determining display delay time length information of the voice translation text.

And the display delay time comprises a time difference value between a second time for displaying the translated text subtitles and a first time for playing the audio stream, and if the display delay time is 0.1 second, the corresponding translated text subtitles are displayed after the audio stream is played for 0.1 second.

In one example, the display delay time duration may be determined as follows: and determining the duration information according to the hearing level information of the user.

For example, if the hearing level of the user is cambridge english FCE and the hearing level of the multimedia file is cambridge english KET, and the hearing level of the user exceeds the hearing level of the file, the setting of the delay time duration may be displayed to be shorter, such as 1 millisecond, so as to achieve the real-time subtitle effect of "what you hear is seen" to facilitate the user to quickly check whether the understanding of the information heard by the user is correct. In addition, the display delay time length is not set, and the translated text subtitles are not displayed, so that the user is not disturbed by the subtitles.

For another example, if the hearing level of the user is cambridge english KET, and the hearing level of the multimedia file is cambridge english FCE, the hearing level of the user does not reach the hearing level of the file, the display delay time may be set to be slightly longer, for example, set to be 5 seconds or 10 seconds, so as to achieve a more obvious subtitle delay display effect, so as to give the user sufficient time to the meaning of the content that the user wants to hear. In addition, the display delay time can be set to be short, so that the user can learn quickly, and the learning efficiency is improved.

The hearing level of the user can be preset information, such as 10 seconds; or the information input by the user during the watching, such as 5 seconds; or the device can automatically adjust the information according to the reading following condition of the user, for example, the reading following speed of the user is slow, which indicates that the hearing level of the user is slightly poor.

And step 3: and displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information.

In one example, the audio may be played for a period of time and paused for a period of time, and the text subtitle of the audio that has been played may be displayed while the playing is paused. That is, the user can stop playing for a while, for example, playing 1 minute and 10 seconds, which not only gives the user thinking time, but also avoids displaying the front audio translation when playing the back audio again, which affects the learning of the user.

In one example, the method may further comprise the steps of: if the voice hearing difficulty exceeds the hearing level of the user, pausing playing the multimedia file and repeatedly playing the played file segment; and adjusting the duration information according to the repeated playing times. With this processing, the user can repeatedly watch the original text paragraph which is listened to beyond the level; therefore, the learning effect of the user can be effectively improved.

The number of times of repeated playing may be a fixed number of times or a number of times of real-time adjustment. For example, the device may automatically adjust the repeated playing according to the reading following condition of the user, if the reading following speed of the user is slow, the playing times are increased, and if the reading following of the user is very fluent, the next segment may be played continuously.

In specific implementation, the larger the number of repeated playing times is, the smaller the display delay time is, for example, the display delay time in the 10 th playing time may be lower than the display delay time in the 5 th playing time, or may be set by an opposite rule, and the specific rule may be determined according to actual requirements.

In one example, the method may further comprise the steps of: and if the original text of the audio stream comprises words which are not included in the user source language word list, repeatedly playing the audio stream. For example, if the user source language does not include the word professional in the table, then the subsequent audio may be paused at this word and the word may be played back over and over. In specific implementation, the 'professional' translation can be synchronously displayed when the repeated playing is started, and the translation can be displayed in a delayed manner along with the increase of the repeated playing times until the translation is not displayed. By adopting the processing mode, the user is helped to learn key words repeatedly; therefore, the learning effect can be effectively improved.

In one example, the repeatedly playing the audio stream may include the following steps: determining reading following duration information; and determining the playing time interval of the two adjacent audio streams according to the follow-up reading time length information. For example, the following reading condition of the user can be collected through the camera and the microphone, the following reading duration is determined, and generally, the longer the following reading duration is, the longer the temporal duration is, the user does not have a better grasp yet, and sufficient time is provided for the user to follow the reading.

In one example, the method may further comprise the steps of: collecting user follow-up voice data, such as collecting user voice through a microphone; determining a follow-up reading score according to the follow-up reading voice data, wherein if the follow-up reading time is short and the follow-up reading sound wave is close to the standard sound wave, the follow-up reading score is higher; and determining the repeated playing times of the audio stream according to the reading-after score. For example, the higher the read-after score, the fewer the number of repeat plays, and so on. By adopting the processing mode, the user is helped to repeatedly learn key paragraphs; therefore, the learning effect can be effectively improved.

In one example, the method may further comprise the steps of: intercepting a file segment of the multimedia file, wherein the voice hearing difficulty of the file exceeds the hearing level of a user; and storing the file segments so as to repeatedly play the file segments. By adopting the processing mode, the user can repeatedly listen to the unfamiliar audio paragraph at any time to realize the rereading function; therefore, the language learning effect can be effectively improved.

As can be seen from the foregoing embodiments, in the multimedia file playing control method provided in the embodiments of the present application, an audio stream corresponding to a playing progress is extracted for a multimedia file currently played by a player, and the audio stream is sent to a server; determining display delay time length information of the voice translation text; displaying the voice translation text of the audio stream returned by the server side in the player according to the duration information; the processing mode can control the display of the translated text subtitles, can realize the effect of displaying the translated text subtitles in a delayed way, and can also realize the subtitle display effect obtained by seeing; therefore, the user experience can be effectively improved, the language learning requirement of the user is met, and the language learning effect of the user is improved.

Sixteenth embodiment

Corresponding to the multimedia file playing system, the application also provides a multimedia file playing system. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The multimedia file playing system provided by the application can comprise: a server and a client. The client is used for extracting an audio stream corresponding to the playing progress aiming at the multimedia file currently played by the player; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player; the server is used for determining the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; and returning the voice data of the target language to the client.

For example, the user is playing an english speech video (the original text is english), and at the same time, wants to hear the corresponding speech of a language other than the native language, for example, the native language is chinese, and wants to hear the german speech of the speech, so that the user can learn german according to the content of the english speech video, or learn two languages of english and german.

For another example, the user is playing a newly-shown english movie, the user's native language is chinese, and wants to listen to the chinese speech of the movie, so that the user can watch the movie more comfortably.

In specific implementation, the language synthesis model can adopt a mature prior art to convert the translation into the voice. Since the language synthesis model can adopt the mature prior art, it is not described herein again.

As can be seen from the foregoing embodiments, in the multimedia file playing system provided in the embodiment of the present application, an audio stream corresponding to a playing progress is extracted for a multimedia file currently played by a player through a client; sending the audio stream to a server; playing the voice data of the target language of the audio stream returned by the server side in the player; the server determines the voice translation text through a voice translation model; determining voice data of the target language through a voice synthesis model; returning the voice data of the target language to the client; the processing mode enables the voice of the source language to be converted into the voice of the target language, and the voice is played to a user for listening; therefore, the listening requirement of the user can be effectively met, and the user experience can be effectively improved.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A multimedia file playing system, comprising:

2. A method for playing a multimedia file, comprising:

sending the audio stream to a server;

3. The method of claim 2,

the player comprises a browser player;

4. The method of claim 2,

the audio stream comprises an audio stream of millisecond duration.

5. The method of claim 2,

the method further comprises the following steps:

performing compression processing on the audio stream;

the sending the audio stream to a server includes:

and sending the compressed audio stream to the server.

6. The method of claim 5, wherein the performing the compression process on the audio stream is performed by at least one of:

performing a down-sampling process on the audio stream;

7. The method of claim 6, wherein the performing down-sampling processing on the audio stream comprises:

determining a down-sampling rate;

8. The method of claim 5,

the player comprises a browser player;

the performing compression processing on the audio stream includes:

creating an audio input node according to the audio stream;

9. The method of claim 2,

extracting an audio stream to be played;

10. The method of claim 2, further comprising:

11. A method for playing a multimedia file, comprising:

12. A multimedia file playback apparatus, comprising:

the audio stream sending unit is used for sending the audio stream to a server;

13. An electronic device, comprising:

a processor; and

14. A multimedia file playback apparatus, comprising:

15. An electronic device, comprising:

a processor; and

16. A speech translation model quality assessment system, comprising:

17. A speech translation model quality assessment method is characterized by comprising the following steps:

18. A speech translation model quality assessment method is characterized by comprising the following steps:

19. A multimedia file playing control method is characterized by comprising the following steps:

20. The method of claim 19, wherein determining the display delay duration information of the speech translation text comprises:

21. The method of claim 19, further comprising:

and adjusting the duration information according to the repeated playing times.

22. The method of claim 19, further comprising:

23. The method of claim 22, wherein said repeatedly playing the audio stream comprises:

determining reading following duration information;

24. The method of claim 19, further comprising:

collecting the following reading voice data of a user;

25. The method of claim 19, further comprising:

and storing the file segments so as to repeatedly play the file segments.

26. A multimedia file playing system, comprising:

27. A method for playing a multimedia file, comprising:

sending the audio stream to a server;

28. A method for playing a multimedia file, comprising:

determining the voice translation text through a voice translation model;

determining voice data of the target language through a voice synthesis model;

and returning the voice data of the target language to the client.