CN112118309A

CN112118309A - Audio translation method and system

Info

Publication number: CN112118309A
Application number: CN202010972182.4A
Authority: CN
Inventors: 高伟; 张勇; 孙晔
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2020-12-22

Abstract

A terminal device can send a translation service request to a central node, and the central node sends characteristic information of at least one candidate edge node to the terminal device according to the translation service request. The terminal equipment determines a target edge node based on the characteristic information of each candidate edge node, and then sends translation configuration information and the audio stream to be translated to the target edge node. And the target edge node translates the audio stream to be translated based on the translation configuration information to obtain a translation result, and sends the translation result of the audio stream to be translated to the terminal equipment. The central node is responsible for distributing the edge node to the terminal equipment without providing translation service, so that the load of the central node can be effectively reduced. Each edge node only needs to provide translation service for a small number of terminal devices, so that the load of the edge node can be effectively avoided from being too high, and the edge node can have the condition of providing the translation service in real time.

Description

Audio translation method and system

Technical Field

The present application relates to speech translation technology, and in particular, to an audio translation method and system.

Background

The existing video programs are more and more abundant in variety and quantity, however, most of the video programs are often shot and broadcasted in the local mainstream language, and even the video programs in multiple languages generally only support several general languages. On one hand, for video distributors, there are huge language differences in different countries and regions, and due to the limitations of cost and technology, specialized translation cannot be performed for each language, and only video versions of several general languages can be provided; on the other hand, due to the limitation of bandwidth, only several video versions in general languages can be selected for playing.

Due to the above limitations, video distributors cannot distribute video programs to a wider market, so that the viewing population of video programs is limited, and viewers cannot understand video programs in other languages due to the limitation of language capabilities. The existing technology for providing translation service by using a conventional cloud server cannot provide real-time translation service for a user due to the defects of the performance and the real-time performance of the cloud server, and when the cloud server receives a large number of requests for translation service, huge impact is easily caused to the cloud server, so that the load of the cloud server is too high.

Disclosure of Invention

Exemplary embodiments of the present disclosure may or may not address at least the above-mentioned problems.

In one aspect, there is provided an audio translation method performed by a terminal device, including: sending a translation service request to the central node; receiving feature information of at least one candidate edge node sent by the central node for the translation service request; determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node; transmitting translation configuration information and an audio stream to be translated to the determined target edge node; and receiving a translation result of the audio stream to be translated, which is sent by the target edge node and is obtained based on the translation configuration information.

Optionally, the method further comprises: and when the audio stream to be translated is played, synchronously displaying the translation result with the audio stream to be translated.

Optionally, the at least one candidate edge node and the terminal device belong to the same preset area; or the at least one candidate edge node and the terminal device belong to the same preset region, and the load of the at least one candidate edge node meets a preset load condition.

Optionally, the feature information includes performance information capable of reflecting performance of the candidate edge node; the step of determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node comprises: and determining a candidate edge node with the optimal performance from the at least one candidate edge node based on the performance information of the at least one candidate edge node, and taking the candidate edge node with the optimal performance as the target edge node.

Optionally, the performance information includes translation capability, load and network delay of the candidate edge node; the step of determining a candidate edge node with the best performance among the at least one candidate edge node based on the performance information of the at least one candidate edge node comprises: scoring the performance of each candidate edge node based on at least one of translation capability, load, and network delay of each candidate edge node; and taking the candidate edge node with the highest performance score as the candidate edge node with the optimal performance.

Optionally, the step of sending the translation configuration information and the audio stream to be translated to the target edge node includes: separating the audio stream to be translated from the audio and video data to be played, and sending the audio stream to be translated to the target edge node, wherein when the audio stream to be translated is an encrypted audio stream, the audio stream to be translated is decrypted before the audio stream to be translated is sent.

Optionally, the step of sending the separated audio stream to the target edge node includes: periodically separating the audio stream to be translated from the audio and video data to be played; sequentially storing the audio streams to be translated in a data queue; and when the data volume in the data queue reaches a preset data threshold value, sending all the audio streams to be translated in the data queue to the target edge node together.

In another aspect, there is provided an audio translation method performed by a central node, comprising: receiving a translation service request sent by terminal equipment; determining at least one candidate edge node among a plurality of edge nodes for the translation service request; sending feature information of at least one candidate edge node to the terminal equipment; the terminal determines a candidate edge node of the at least one candidate edge node to be a target edge node, and the target edge node is configured to translate an audio stream to be translated, which is sent by the terminal device, based on translation configuration information sent by the terminal device to obtain a translation result.

Optionally, the step of determining at least one candidate edge node among a plurality of edge nodes for the translation service request comprises: and determining edge nodes in the same preset area with the terminal equipment from the plurality of edge nodes, and taking at least one determined edge node as a candidate edge node.

Optionally, the step of determining at least one candidate edge node among a plurality of edge nodes for the translation service request comprises: determining edge nodes in the same preset area with the terminal equipment in the edge nodes; determining at least one edge node with the load meeting a preset load condition in the at least one edge node based on the determined load of the at least one edge node; and taking at least one edge node with the determined load meeting the preset load condition as a candidate edge node.

Optionally, the method further comprises: and when a voice translation model which can be matched with the requirement of the translation configuration information sent by the terminal equipment is absent in the target edge node, providing the voice translation model for the target edge node.

In another aspect, there is provided an audio translation method performed by an edge node, comprising: receiving translation configuration information and an audio stream to be translated, which are sent by terminal equipment; translating the audio stream to be translated to obtain a translation result based on the translation configuration information; and sending a translation result of the audio stream to be translated to the terminal device, wherein the target edge node is a target edge node determined by the terminal device in the at least one candidate edge node based on the characteristic information of the at least one candidate edge node, and the at least one candidate edge node is determined by a central node in a plurality of edge nodes for a translation service request sent by the terminal device.

Optionally, the translating the audio stream to be translated to obtain a translation result based on the translation configuration information includes: acquiring a voice translation model capable of matching the requirement of the translation configuration information; and translating the audio stream to be translated by using the acquired voice translation model based on the translation configuration information to obtain a translation result.

Optionally, the step of obtaining a speech translation model capable of matching the requirement of the translation configuration information includes: determining a voice translation model capable of matching the requirement of the translation configuration information in preset voice translation models based on the translation configuration information; or, when a speech translation model capable of matching the requirements of the translation configuration information is absent, downloading the speech translation model capable of matching the requirements of the translation configuration information in the central node based on the translation configuration information.

Optionally, the type of the preset speech translation model in each edge node is related to the area where the edge node is located.

In another aspect, an audio translation system is provided, the audio translation system comprising a terminal device, a center node, and an edge node;

the terminal device is configured to execute the above-described audio translation method executed by the terminal device;

the central node is configured to perform the audio translation method described above as being performed by the central node;

the edge node is configured to perform the audio translation method described above as being performed by the edge node.

In another aspect, an audio translation apparatus preset in a terminal device is provided, which includes a request sending module, an information receiving module, a target node determining module, a data sending module and a result receiving module;

the request sending module is configured to: sending a translation service request to the central node;

the information receiving module is configured to: receiving feature information of at least one candidate edge node sent by the central node for the translation service request;

the target node determination module is configured to: determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node;

the data transmission module is configured to: transmitting translation configuration information and an audio stream to be translated to the determined target edge node;

the result receiving module is configured to: and receiving a translation result of the audio stream to be translated, which is sent by the target edge node and is obtained based on the translation configuration information.

In another aspect, an audio translation apparatus is provided, which includes a request receiving module, a candidate node determining module, and a feature information transmitting module;

the request receiving module is configured to: receiving a translation service request sent by terminal equipment;

the candidate node determination module is configured to: determining at least one candidate edge node among a plurality of edge nodes for the translation service request;

the feature information sending module is configured to: sending feature information of at least one candidate edge node to the terminal equipment;

the terminal determines a candidate edge node of the at least one candidate edge node to be a target edge node, and the target edge node is configured to translate an audio stream to be translated, which is sent by the terminal device, based on translation configuration information sent by the terminal device to obtain a translation result.

In another aspect, an audio translation apparatus is provided, which includes a data receiving module, a translation module, and a result sending module;

the data receiving module is configured to: receiving translation configuration information and an audio stream to be translated, which are sent by terminal equipment;

the translation module is configured to: translating the audio stream to be translated to obtain a translation result based on the translation configuration information;

the result sending module is configured to: sending a translation result of the audio stream to be translated to the terminal equipment;

the target edge node is a target edge node determined by the terminal device among the at least one candidate edge node based on the feature information of the at least one candidate edge node, and the at least one candidate edge node is determined by a central node among a plurality of edge nodes for a translation service request sent by the terminal device.

In another aspect, there is provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method of any of the above aspects.

In another aspect, a terminal device is provided comprising at least one computing apparatus and at least one storage apparatus storing instructions that, when executed by the at least one computing apparatus, cause the at least one computing apparatus to perform the method described above as being performed by a terminal device.

In another aspect, there is provided a central node comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the method described above as being performed by the central node.

In another aspect, there is provided an edge node comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the method described above as being performed by the edge node.

In the audio translation method provided by the exemplary embodiment of the present invention, the central node is responsible for allocating the edge node to the terminal device without providing a translation service, and thus, the load of the central node can be effectively reduced. Each edge node only needs to provide translation service for a small number of terminal devices, so that the load of the edge node can be effectively avoided from being too high, and the edge node can have the condition of providing the translation service in real time.

Drawings

These and/or other aspects and advantages of the present invention will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 shows an architecture diagram of an audio translation system provided by an exemplary embodiment of the present invention.

Fig. 2 shows a flowchart of an audio translation method provided by an exemplary embodiment of the present invention.

Fig. 3 is a block diagram illustrating a first audio translation apparatus according to an exemplary embodiment of the present invention.

Fig. 4 shows a block diagram of a second audio translation apparatus according to an exemplary embodiment of the present invention.

Fig. 5 is a block diagram illustrating a third audio translation apparatus provided in an exemplary embodiment of the present invention.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the invention defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Referring to fig. 1, an audio translation system includes a center node, an edge node, and a terminal device. One central node can manage a plurality of edge nodes, the edge nodes can provide translation services, and terminal equipment can play video programs for users to watch.

The central node (or edge node) may include one server, a plurality of servers, a cloud computing center, and the like. The terminal device may be an electronic device having a video playing function, for example, the terminal device may include a mobile phone, a smart television, a tablet computer, a personal computer, a wearable display device, and the like.

When a user needs a translation service, the terminal device may send a translation service request to the central node, and the central node sends feature information of at least one candidate edge node to the terminal device for the translation service request. The terminal equipment determines a target edge node based on the characteristic information of each candidate edge node, and then sends translation configuration information and the audio stream to be translated to the target edge node. And the target edge node translates the audio stream to be translated based on the translation configuration information to obtain a translation result, and then sends the translation result of the audio stream to be translated to the terminal equipment. The terminal equipment can display the translation result to the user while playing the video, so that the user can understand the content of the video program.

According to the audio translation method provided by the exemplary embodiment of the invention, the central node is responsible for distributing the edge node to the terminal equipment without providing the translation service, so that the load of the central node can be effectively reduced. Each edge node only needs to provide translation service for a small number of terminal devices, so that the load of the edge node can be effectively avoided from being too high, and the edge node can have the condition of providing the translation service in real time.

Fig. 2 shows a flowchart of an audio translation method provided by an exemplary embodiment of the present invention, and it should be understood that fig. 2 shows a flowchart of an audio translation method for one terminal device, and a flowchart of an audio translation method for any one terminal device may be the same.

Referring to fig. 2, the terminal device transmits a translation service request to the center node at step S101.

Alternatively, the terminal device may transmit a translation service request to the center node in response to an operation of the user requesting a translation service.

Alternatively, the transmission of the translation service request may be set as a work performed by the terminal device by default in a specified scenario. For example, the terminal device, upon start-up, automatically sends a translation service request to the terminal device to the central node.

Alternatively, the terminal device may determine whether to send a translation service request to the terminal device to the central node based on the channel or application to which the currently playing video (or the video to be played) belongs. For example, some channels or applications are channels or applications in languages that are not commonly used by users, and when the terminal device determines that a video of such a channel or application is currently being played (or a video that is about to be played), a translation service request is sent to the terminal device to the central node.

It is to be understood that the above-mentioned non-user-common language is a language that the user cannot understand or is unfamiliar with, and the type of the non-user-common language may be determined based on the selection of the user. The non-user common languages may include a language of a certain country, a language of a certain ethnic group, a dialect of a certain group or region, and the like. For example, for a user who can only understand mandarin chinese, all languages and dialects other than mandarin chinese may be used as the non-user-common languages for the user.

In step S102, the central node receives a translation service request sent by the terminal device, and determines at least one candidate edge node in the plurality of edge nodes for the translation service request.

The central node may determine, for the translation service request, one or more candidate edge nodes from among a plurality of edge nodes managed by the central node, where one candidate edge node of the at least one candidate edge node is determined by the terminal as a target edge node in a subsequent step, and the target edge node is configured to translate, based on translation configuration information sent by the terminal device, an audio stream to be translated sent by the terminal device to obtain a translation result.

It is to be understood that at least one filtering condition may be preset, and the candidate edge node may be determined among the plurality of edge nodes based on the filtering condition. The specific steps for determining at least one candidate edge node among a plurality of edge nodes for a translation service request are described below.

Optionally, the central node determines, among the plurality of edge nodes, an edge node in the same preset area as the terminal device, and takes at least one determined edge node as a candidate edge node. The number of the candidate edge nodes may be a default value, or the number of the candidate edge nodes may be set by a user, or the number of the candidate edge nodes may be determined in real time by the center node based on an actual situation.

The following describes specific implementation steps for determining an edge node in the same preset area as the terminal device among a plurality of edge nodes.

As an example, a plurality of regions each including one or more edge nodes may be previously divided based on the distribution of the edge nodes. The translation service request sent by the terminal device may carry the location information of the terminal device, the central node may determine the area where the terminal device is located according to the location information of the terminal device, and the edge node in the area where the terminal device is located is the edge node in the same preset area as the terminal device.

As an example, a plurality of regions each including one or more edge nodes may be previously divided based on the distribution of the edge nodes. For a terminal device with a relatively fixed position (such as a non-mobile terminal device), the area where the terminal device is located is determined, so that an association relationship between the terminal device and an edge node in the area where the terminal device is located can be established in advance, when the central node receives a translation service request sent by the terminal device, the edge node associated with the terminal device is determined, and the edge node associated with the terminal device is an edge node in the same preset area as the terminal device.

For example, an association relationship is established between the identity information of the terminal device and the edge node identity information in the area where the terminal device is located in advance, the translation service request sent by the terminal device may carry the identity information of the terminal device, when the central node receives the translation service request sent by the terminal device, the edge node associated with the terminal device is determined based on the identity information of the terminal device, and the edge node associated with the terminal device is an edge node in the same preset area as the terminal device.

As an example, the translation service request sent by the terminal device may carry location information of the terminal device, and the central node may obtain location information of each edge node. When the central node receives a translation service request sent by the terminal equipment, the distance between the terminal equipment and each edge node can be calculated based on the position information of the terminal equipment and the position information of each edge node; and the central node takes the edge node with the distance from the terminal equipment smaller than the preset threshold distance as the edge node in the same preset area with the terminal equipment.

In addition to determining candidate edge nodes based on the area in which the terminal device is located, the candidate edge nodes may also be determined based on the load of the edge nodes. Optionally, the step of determining at least one candidate edge node among the plurality of edge nodes for the translation service request comprises: determining edge nodes in the same preset area with the terminal equipment in the plurality of edge nodes; determining at least one edge node with the load meeting a preset load condition in the at least one edge node based on the determined load of the at least one edge node; and taking at least one edge node with the determined load meeting the preset load condition as a candidate edge node.

Here, the specific implementation steps for determining, among the plurality of edge nodes, an edge node in the same preset area as the terminal device may be the same as the related specific implementation steps described above, and are not described herein again.

After the edge node in the same preset area with the terminal device is determined, whether the load of the edge node in the same preset area with the terminal device meets the preset load condition is continuously determined.

As an example, the edge nodes in the same preset area as the terminal device may be sorted in the order from small to large in load, and the first N edge nodes with a smaller load are determined as edge nodes whose loads meet a preset load condition, where N is a positive integer not less than 1.

As an example, a load threshold may be preset, and an edge node whose load is smaller than the load threshold in edge nodes in the same preset area as the terminal device is determined as an edge node whose load meets a preset load condition, where the load threshold may be determined according to actual design requirements.

In step S103, the central node sends feature information of at least one candidate edge node to the terminal device.

After determining at least one candidate edge node in the plurality of edge nodes aiming at the translation service request, the central node acquires the characteristic information of the determined at least one candidate edge node, and sends the characteristic information of the at least one candidate edge node to the terminal equipment sending the translation service request.

Optionally, the feature information of the candidate edge node may include performance information capable of reflecting the performance of the candidate edge node. Here, the performance information may include translation capability, load, network delay, and the like of the candidate edge node.

It should be noted that the translation capability may include at least one of a type of a language that the edge node can translate, a translation speed of the edge node, and a translation accuracy of the edge node.

Optionally, the feature information of the candidate edge node may further include IP (Internet Protocol) information and port information of the candidate edge node, and the like.

In step S104, the terminal device receives feature information of at least one candidate edge node sent by the central node for the translation service request.

Optionally, at least one candidate edge node in step S104 and the terminal device belong to the same preset area.

Optionally, at least one candidate edge node in step S104 and the terminal device belong to the same preset region, and a load of the at least one candidate edge node meets a preset load condition.

In step S105, the terminal device determines a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node.

As previously described, the feature information includes performance information that can reflect the performance of the candidate edge node. Optionally, the step of determining a target edge node in the at least one candidate edge node based on the feature information of the at least one candidate edge node includes: and determining a candidate edge node with the optimal performance from the at least one candidate edge node based on the performance information of the at least one candidate edge node, and taking the candidate edge node with the optimal performance as a target edge node.

It is understood that the candidate edge node with the best performance may refer to that all candidate edge nodes of a certain performance of the candidate edge node are the best, may refer to that at least two performances of the candidate edge node are the best at all candidate edge nodes, and may refer to that all candidate edge nodes of a comprehensive evaluation of at least two performances of the candidate edge node are the best.

The performance information includes translation capabilities, load and network latency of the candidate edge nodes. It will be appreciated that the translation capability, load and network latency are each a property of the candidate edge nodes. Optionally, the step of determining, based on the performance information of the at least one candidate edge node, a candidate edge node with the best performance from the at least one candidate edge node includes: scoring the performance of each candidate edge node based on at least one of translation capability, load, and network delay of each candidate edge node; and taking the candidate edge node with the highest performance score as the candidate edge node with the optimal performance.

As an example, when the performance of each candidate edge node is scored based on one of the translation capability, the load and the network delay of the candidate edge node, the score of the performance may be directly used as the performance score of the corresponding candidate edge node. For example, the performance of each candidate edge node is scored based on the translation capability of each candidate edge node, and the score of the translation capability may be directly used as the performance score of the corresponding candidate edge node.

As an example, when the performance of each candidate edge node is scored based on at least two of the translation capability, the load, and the network latency of the candidate edge node, a composite score of the at least two performances may be taken as the performance score of the corresponding candidate edge node.

The composite score for at least two properties may be any of: an average of the scores of the at least two properties, a weighted average of the scores of the at least two properties, a sum of the scores of the at least two properties. When the composite score of the at least two performances is a weighted average of the scores of the at least two performances, the weight of each performance can be determined according to the actual design requirement.

For example, when the performance of each candidate edge node is scored based on the translation capability, the load, and the network latency of the candidate edge node, the terminal device may take a weighted average of the score of the translation capability, the score of the load, and the score of the network latency as the performance score of the corresponding candidate edge node. The terminal equipment determines the candidate edge node with the highest performance score from all the candidate edge nodes, and takes the candidate edge node with the highest performance score as the candidate edge node with the optimal performance

In step S106, the terminal device sends the translation configuration information and the audio stream to be translated to the determined target edge node.

Here, the translation configuration information is used to characterize the user's translation requirements for the audio stream to be translated.

Optionally, translating the configuration information may include: the audio stream to be translated is desired to be translated into the target language, the form of the translation result of the audio stream to be translated. Wherein, the form of the translation result comprises a voice form and/or a caption form.

Optionally, when the form of the translation result includes a voice form, the translation configuration information may further include a tone and/or a tone of the voice.

Alternatively, when the form of the translation result includes a subtitle form, the translation configuration information may further include at least one of a color, a font, and a font size of the subtitle.

Alternatively, the terminal device may generate translation configuration information based on a user's selection for a current translation service or an operation of inputting a translation configuration, and transmit the generated translation configuration information to the target edge node.

Alternatively, the terminal device may obtain pre-stored translation configuration information, and send the pre-stored translation configuration information to the target edge node. Wherein the pre-stored translation configuration information may generate translation configuration information based on a user's selection for a previous translation service or an operation of inputting a translation configuration; the pre-stored translation configuration information may also be pre-configured translation configuration information in the terminal device, e.g. the pre-stored translation configuration information is pre-configured translation configuration information by the provider of the terminal device.

It should be noted that, in the process of the one-time translation service, when the translation requirement of the user for the audio stream to be translated is not changed, the terminal device may send the translation configuration information only once, and when the translation requirement of the user for the audio stream to be translated is changed, the terminal device may resend the new translation configuration information generated according to the changed translation requirement.

The terminal device can store the audio and video to be played in advance locally, and can also cache the audio and video to be played on line, wherein the audio stream to be translated is the audio stream in the audio and video to be played.

As an example, for an audio and video to be played that is stored locally in advance, the terminal device may send all audio streams to be translated in the audio and video to be played to the target edge node together, or may send audio stream segments to be translated in the audio and video to be played to the target edge node.

As an example, for an online cached audio/video to be played, after the audio/video caching is completed, the terminal device sends all audio streams to be translated in the cached audio/video to the target edge node together, or sends the audio streams to be translated in the audio/video to be played cached in the period to the target edge node every time a period passes.

Optionally, the step of sending the translation configuration information and the audio stream to be translated to the target edge node includes: and separating the audio stream to be translated from the audio and video data to be played, and sending the audio stream to be translated to the target edge node.

As an example, for an audio/video to be played that is stored locally in advance, the terminal device may separate all audio streams to be translated in the audio/video to be played together, and then send the separated audio streams to be translated to the target edge node. The terminal device may also segmentally separate the audio stream to be translated in the audio/video to be played, and send the separated audio stream to be translated to the target edge node.

As an example, for an online cached audio/video to be played, after the audio/video caching is completed, the terminal device separates all audio streams to be translated in the cached audio/video together, and sends the separated audio streams to be translated to the target edge node; the terminal device may also separate the audio stream to be translated from the audio/video to be played, which is cached in the period, every time a period passes, and send the separated audio stream to be translated to the target edge node.

Optionally, the step of sending the separated audio stream to the target edge node includes: periodically separating audio streams to be translated from audio and video data to be played; sequentially storing each audio stream to be translated in a data queue; and when the data volume in the data queue reaches a preset data threshold value, sending all the audio streams to be translated in the data queue to the target edge node together. The mode of sending the audio stream to be translated can avoid excessive fragmentation of data, and is beneficial to improving the translation accuracy.

As described above, the feature information of the candidate edge node may include IP information and port information of the candidate edge node, and the like. The terminal device may send the translation configuration information and the audio stream to be translated to the target edge node based on the IP information and the port information of the target edge node.

Optionally, the terminal device includes a demultiplexer, and the terminal device separates the audio stream to be translated from the audio/video data to be played through the demultiplexer.

It should be noted that the terminal device may determine whether the audio stream to be translated is an encrypted audio stream. When the audio stream to be translated is an unencrypted audio stream, before the audio stream to be translated is sent, the terminal equipment does not need to decrypt the audio stream to be translated; when the audio stream to be translated is an encrypted audio stream, the terminal device decrypts the audio stream to be translated before transmitting the audio stream to be translated.

Optionally, the terminal device may include an audio decoder. The terminal device may obtain the key for decryption in the decoding tool, and in the case of a smart television, the smart television may obtain the key for decryption in a television plug-in card or a decryption usb disk. The terminal device may decrypt the encrypted audio stream to be translated based on the key and the audio decoder.

Alternatively, the terminal device and the target edge node may communicate based on a communication technology with high speed and low latency characteristics, for example, the terminal device may communicate based on a 5G (5th generation wireless systems, fifth generation mobile communication technology) communication technology.

In step S107, the target edge node receives the translation configuration information and the audio stream to be translated, which are transmitted by the terminal device.

In step S108, the target edge node translates the audio stream to be translated based on the translation configuration information to obtain a translation result.

As previously described, the translation configuration information may include: the audio stream to be translated is desired to be translated into the target language, the form of the translation result of the audio stream to be translated. Wherein, the form of the translation result comprises a voice form and/or a caption form. The target edge node may translate the audio stream to be translated into speech and/or subtitles in the target language based on the translation configuration information.

As previously mentioned, the translation configuration information also includes the tone and/or pitch of the speech. Optionally, the target edge node may also adjust the timbre and/or pitch of the speech in the target language based on the timbre and/or pitch in the translation configuration information.

As described above, the translation configuration information further includes at least one of the font size, font style, and color of the subtitle. Optionally, the target edge node may further adjust at least one of a font size, a font style, and a color of the subtitle of the target language based on at least one of a font size, a font style, and a color in the translation configuration information.

Optionally, before translating the audio stream to be translated, the audio stream to be translated may be further preprocessed, and preprocessing the audio stream to be translated may include: cutting off the silence of the audio stream to be translated, reducing the noise of the audio stream to be translated, framing the audio stream to be translated, separating the audios of different frequency bands of the audio stream to be translated, and the like.

Optionally, after the audio stream to be translated is translated, a timestamp may be generated for the translation result of the audio stream to be translated, where the timestamp includes a start presentation time and an end presentation time of the translation result of the audio stream to be translated. The terminal device may present the translation result of the audio stream to be translated at the corresponding time based on the time stamp.

Optionally, the step of translating, by the target edge node, the audio stream to be translated to obtain a translation result based on the translation configuration information includes: acquiring a voice translation model which can match the requirement of the translation configuration information; and translating the audio stream to be translated by using the acquired voice translation model based on the translation configuration information to obtain a translation result.

Here, the speech translation model is a machine learning model having speech translation capability that is trained in advance by a specified machine learning method. Among them, using a designated machine learning method may include Transform, GPT2, BERT, GAN, and the like.

The speech translation model can only translate an audio stream in one language into speech and/or subtitles in another target language; the speech translation model is also capable of translating audio streams in one language into speech and/or subtitles in multiple target languages. The speech translation model is also capable of translating the audio stream in each of the plurality of languages into speech and/or subtitles in the plurality of target languages, respectively.

For example, the speech translation model a can translate an audio stream in english into speech and/or subtitles in chinese; the voice translation model b can translate the audio stream of English into voice and/or subtitles of Chinese or Japanese; the speech translation model c may translate an audio stream in english or french into speech and/or subtitles in korean or japanese.

Alternatively, the target edge node may locally preset at least one speech translation model. The step of obtaining a speech translation model that matches the requirements of the translation configuration information includes: and the target edge node determines a voice translation model which can match the requirement of the translation configuration information in the preset voice translation models based on the translation configuration information.

Optionally, the central node may also preset at least one speech translation model, and when a speech translation model capable of matching the requirement of the translation configuration information is absent in the target edge node, the central node may provide the target edge node with the speech translation model. The step of obtaining a speech translation model that matches the requirements of the translation configuration information includes: when the target edge node lacks a speech translation model capable of matching the requirements of the translation configuration information, the target edge node downloads a speech translation model capable of matching the requirements of the translation configuration information in the central node based on the translation configuration information.

More types of voice translation models can be preset in the central node, so that the audio stream to be translated can be translated into voices and/or subtitles of multiple languages, the translation requirements of different users are met, the cost of video translation work of a video publisher can be reduced, and the video can be favorably spread in different language regions.

Optionally, the type of the preset speech translation model in each edge node is related to the area where the edge node is located. And each edge node can only preset the speech translation model with higher use frequency in the area where the edge node is positioned, so that the burden of data storage of the edge node is greatly reduced.

In step S109, the target edge node transmits the translation result of the audio stream to be translated to the terminal device.

The target edge node can send the received translation results of all the audio streams to be translated to the terminal equipment together; the target edge node may also sequentially send the translation results of the multiple segments of audio streams to be translated to the terminal device according to the time of the received multiple segments of audio streams to be translated.

As an example, for an audio/video to be played that is stored locally in advance, the terminal device may send all audio streams to be translated in the audio/video to be played to the target edge node together. The target edge node can transmit the translation results of all the audio streams to be translated to the terminal equipment together after all the audio streams to be translated are translated; the target edge node may also translate the audio stream segment to be translated, and when the translation of the audio stream segment to be translated is completed, the translation result of the audio stream segment to be translated may be sent to the terminal device.

As an example, for an audio/video to be played that is stored locally in advance, the terminal device may send an audio stream segment to be translated in the audio/video to be played to the target edge node. The target edge node may translate the audio stream to be translated each time it receives the audio stream to be translated, and send the translation results of all the audio streams to be translated to the terminal device after all the audio streams to be translated are translated, or send the translation results of the audio stream to be translated to the terminal device each time the target edge node completes the translation of one audio stream to be translated.

As an example, for an online cached audio/video to be played, the terminal device may send all audio streams to be translated in the cached audio/video to the target edge node after the audio/video caching is completed. The target edge node can transmit the translation results of all the audio streams to be translated to the terminal equipment together after all the audio streams to be translated are translated; the target edge node may also translate the audio stream segment to be translated, and when the translation of the audio stream segment to be translated is completed, the translation result of the audio stream segment to be translated may be sent to the terminal device.

As an example, for an online cached audio/video to be played, the terminal device may send, to the target edge node, an audio stream to be translated in the audio/video to be played cached in a cycle every time the cycle passes. The target edge node may receive the audio stream to be translated in one period each time, that is, translate the audio stream to be translated in the period, and after all the audio streams to be translated are translated, send the translation results of all the audio streams to be translated together to the terminal device, or the target edge node may send the translation results of the audio stream to be translated in the period to the terminal device each time the audio stream to be translated in one period is translated.

In step S110, the terminal device receives a translation result of the audio stream to be translated, which is sent by the target edge node and is obtained based on the translation configuration information.

The terminal device may receive the translation results of all the audio streams to be translated at one time, or may receive the translation results of multiple segments of audio streams to be translated in sequence. After the terminal device receives the translation result of the audio stream to be translated, the translation result may be stored in its memory.

Optionally, after step S110, the audio translation method further includes: and when the audio stream to be translated is played, the terminal equipment synchronously displays the translation result with the audio stream to be translated.

As described above, the target edge node may generate a time stamp for the translation result of the audio stream to be translated, and the terminal device may present the translation result of the audio stream to be translated at a corresponding time based on the time stamp, so that the terminal device may present the translation result synchronously when the audio stream to be translated is played.

As an example, when the form of the translation result of the audio stream to be translated includes a speech form, the terminal device may eliminate the speech in the audio stream to be translated and play the speech in the target language in the translation result.

As an example, when the form of the translation result of the audio stream to be translated includes a caption form, the terminal device may synchronously display the caption of the target language in the translation result while playing the voice in the audio stream to be translated.

As an example, when the form of the translation result of the audio stream to be translated includes a speech form and a caption form, the terminal device may eliminate speech in the audio stream to be translated, play speech in the target language in the translation result, and synchronously display captions in the target language in the translation result.

The following describes a flow of a real-time audio translation method for the smart television by taking a smart television as an example.

Step a: and the intelligent television sends a translation service request to the central node.

Step b: the central node receives a translation service request sent by the intelligent television, determines at least one candidate edge node in the plurality of edge nodes aiming at the translation service request, and sends the characteristic information of the at least one candidate edge node to the intelligent television.

Step c: the intelligent television receives feature information of at least one candidate edge node sent by the central node aiming at the translation service request, and determines a target edge node in the at least one candidate edge node based on the feature information of the at least one candidate edge node.

Step d: and the intelligent television sends translation configuration information to the determined target edge node.

Step e: and the target edge node receives the translation configuration information and acquires a voice translation model which can match the requirement of the translation configuration information.

Step f: the intelligent television separates audio streams to be translated from the cached audio and video to be played in real time, stores the separated audio streams to be translated in the data queue in sequence, and sends all the audio streams to be translated in the data queue to the target edge node together when the data amount in the data queue reaches a preset data threshold value.

Step g: and when the target edge node receives the audio stream to be translated each time, translating the received audio stream to be translated by using the obtained voice translation model based on the translation configuration information to obtain a translation result, and sending the translation result of the audio stream to be translated to the intelligent television.

Step h: during or before the intelligent television plays the video, the intelligent television can receive and store the translation result of the audio stream to be translated sent by the target edge node each time, and when a certain segment of the audio stream to be translated is played, the corresponding translation result is displayed synchronously with the segment of the audio stream to be translated. For example, when the audio stream of the 3 rd to 6 th seconds is played, the translation result (e.g., the voice and/or the subtitle of the target language) corresponding to the audio stream of the 3 rd to 6 th seconds is displayed.

The architecture of the terminal devices, the central node and the edge nodes is described below.

The terminal device may include at least one computing apparatus and at least one storage apparatus storing instructions that, when executed by the at least one computing apparatus, cause the at least one computing apparatus to perform the method described above as being performed by the terminal device.

The central node may comprise at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method described above as being performed by the central node.

The edge node may include at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method described above as being performed by the edge node.

Referring to fig. 3, the first audio translation apparatus includes a request transmission module 210, an information reception module 220, a target node determination module 230, a data transmission module 210, and a result reception module 220.

The request sending module 210 is configured to: a translation service request is sent to the central node.

The information receiving module 220 is configured to: and receiving the characteristic information of at least one candidate edge node sent by the central node aiming at the translation service request.

The target node determination module 230 is configured to: determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node.

The data transmission module 210 is configured to: and sending the translation configuration information and the audio stream to be translated to the determined target edge node.

The result receiving module 220 is configured to: and receiving a translation result of the audio stream to be translated, which is sent by the target edge node and is obtained based on the translation configuration information.

Optionally, the result receiving module 220 is further configured to: and displaying the translation result synchronously with the audio stream to be translated when the audio stream to be translated is played.

Optionally, at least one candidate edge node and the terminal device belong to the same preset area; or the at least one candidate edge node and the terminal device belong to the same preset region, and the load of the at least one candidate edge node meets the preset load condition.

Optionally, the feature information includes performance information capable of reflecting performance of the candidate edge node. The target node determination module 230 is further configured to: and determining a candidate edge node with the optimal performance from the at least one candidate edge node based on the performance information of the at least one candidate edge node, and taking the candidate edge node with the optimal performance as a target edge node.

Optionally, the performance information includes translation capabilities, load and network latency of the candidate edge nodes. The target node determination module 230 is further configured to: scoring the performance of each candidate edge node based on at least one of translation capability, load, and network delay of each candidate edge node; and taking the candidate edge node with the highest performance score as the candidate edge node with the optimal performance.

Optionally, the data sending module 210 is further configured to: periodically separating audio streams to be translated from audio and video data to be played; sequentially storing each audio stream to be translated in a data queue; and when the data volume in the data queue reaches a preset data threshold value, sending all the audio streams to be translated in the data queue to the target edge node together.

Referring to fig. 4, the second audio translation apparatus includes a request receiving module 310, a candidate node determining module 320, and a feature information transmitting module 330.

The request receiving module 310 is configured to: and receiving a translation service request sent by the terminal equipment.

The candidate node determination module 320 is configured to: at least one candidate edge node is determined among a plurality of edge nodes for the translation service request.

The feature information sending module 330 is configured to: and sending the characteristic information of at least one candidate edge node to the terminal equipment.

The terminal determines a candidate edge node of the at least one candidate edge node as a target edge node, and the target edge node is configured to translate an audio stream to be translated, which is sent by the terminal device, based on translation configuration information sent by the terminal device to obtain a translation result.

Optionally, the candidate node determination module 320 is further configured to: and determining edge nodes in the same preset area with the terminal equipment from the plurality of edge nodes, and taking at least one determined edge node as a candidate edge node.

Optionally, the candidate node determination module 320 is further configured to: determining edge nodes in the same preset area with the terminal equipment in the plurality of edge nodes; determining at least one edge node with the load meeting a preset load condition in the at least one edge node based on the determined load of the at least one edge node; and taking at least one edge node with the determined load meeting the preset load condition as a candidate edge node.

The second audio translation apparatus comprises a model providing module 310, the model providing module 310 is configured to: when a speech translation model capable of matching the requirements of the translation configuration information sent by the terminal device is absent in the target edge node, the speech translation model is provided to the target edge node.

Referring to fig. 5, the third audio interpreting apparatus includes a data receiving module 410, an interpreting module 420, and a result transmitting module 430.

The data receiving module 410 is configured to: and receiving the translation configuration information sent by the terminal equipment and the audio stream to be translated.

The translation module 420 is configured to: and translating the audio stream to be translated to obtain a translation result based on the translation configuration information.

The result sending module 430 is configured to: and sending the translation result of the audio stream to be translated to the terminal equipment.

The target edge node is a target edge node determined in at least one candidate edge node by the terminal device based on the characteristic information of the at least one candidate edge node, and the at least one candidate edge node is determined in the plurality of edge nodes by the central node for a translation service request sent by the terminal device.

Optionally, the translated module 420 is further configured to: acquiring a voice translation model which can match the requirement of the translation configuration information; and translating the audio stream to be translated by using the acquired voice translation model based on the translation configuration information to obtain a translation result.

Optionally, the translated module 420 is further configured to: determining a voice translation model capable of matching the requirement of the translation configuration information in preset voice translation models based on the translation configuration information; alternatively, when a speech translation model capable of matching the requirements of the translation configuration information is absent, the speech translation model capable of matching the requirements of the translation configuration information is downloaded in the central node based on the translation configuration information.

The audio translation method and system according to the exemplary embodiment of the present disclosure have been described above with reference to fig. 1 to 5.

Each unit in each audio translation apparatus shown in fig. 3 to 5 may be configured as software, hardware, firmware, or any combination thereof that performs a specific function. For example, each unit may correspond to an application-specific integrated circuit, to pure software code, or to a module combining software and hardware. Furthermore, one or more functions implemented by the respective units may also be uniformly executed by components in a physical entity device (e.g., a processor, a client, a server, or the like).

Further, the audio translation method described with reference to fig. 2 may be implemented by a program (or instructions) recorded on a computer-readable storage medium. For example, according to an exemplary embodiment of the present disclosure, a computer-readable storage medium storing instructions may be provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform an audio translation method according to the present disclosure.

The computer program in the computer-readable storage medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may also be used to perform additional steps other than the above steps or perform more specific processing when the above steps are performed, and the content of the additional steps and the further processing is already mentioned in the description of the related method with reference to fig. 2, and therefore will not be described again here in order to avoid repetition.

It should be noted that each unit in the audio translation apparatus according to the exemplary embodiments of the present disclosure may fully depend on the execution of the computer program to realize the corresponding function, that is, each unit corresponds to each step in the functional architecture of the computer program, so that the entire system is called by a special software package (e.g., a library of libs) to realize the corresponding function.

On the other hand, each unit shown in fig. 3 to 5 may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component having stored therein a set of computer-executable instructions that, when executed by a processor, perform an audio translation method according to exemplary embodiments of the present disclosure.

In particular, computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions.

The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In a computing device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the audio translation method according to the exemplary embodiments of the present disclosure may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.

The processor may execute instructions or code stored in one of the memory components, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.

The audio translation method according to the exemplary embodiment of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.

Thus, the audio translation method described with reference to fig. 2 may be implemented by a system comprising at least one computing device and at least one storage device storing instructions.

According to an exemplary embodiment of the present disclosure, the at least one computing device is a computing device for performing an audio translation method according to an exemplary embodiment of the present disclosure, and the storage device has stored therein a set of computer-executable instructions that, when executed by the at least one computing device, performs the audio translation method described with reference to fig. 2.

While various exemplary embodiments of the present disclosure have been described above, it should be understood that the above description is exemplary only, and not exhaustive, and that the present disclosure is not limited to the disclosed exemplary embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. Therefore, the protection scope of the present disclosure should be subject to the scope of the claims.

Claims

1. An audio translation method performed by a terminal device, comprising:

sending a translation service request to the central node;

receiving feature information of at least one candidate edge node sent by the central node for the translation service request;

determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node;

transmitting translation configuration information and an audio stream to be translated to the determined target edge node;

and receiving a translation result of the audio stream to be translated, which is sent by the target edge node and is obtained based on the translation configuration information.

2. The method of claim 1, further comprising: and when the audio stream to be translated is played, synchronously displaying the translation result with the audio stream to be translated.

3. The method of claim 1, wherein,

the at least one candidate edge node and the terminal equipment belong to the same preset area;

or the at least one candidate edge node and the terminal device belong to the same preset region, and the load of the at least one candidate edge node meets a preset load condition.

4. The method of claim 1, wherein,

the characteristic information comprises performance information capable of reflecting the performance of the candidate edge node;

the step of determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node comprises:

and determining a candidate edge node with the optimal performance from the at least one candidate edge node based on the performance information of the at least one candidate edge node, and taking the candidate edge node with the optimal performance as the target edge node.

5. The method of claim 4, wherein,

the performance information comprises translation capability, load and network delay of the candidate edge node;

the step of determining a candidate edge node with the best performance among the at least one candidate edge node based on the performance information of the at least one candidate edge node comprises:

scoring the performance of each candidate edge node based on at least one of translation capability, load, and network delay of each candidate edge node; and taking the candidate edge node with the highest performance score as the candidate edge node with the optimal performance.

6. The method of claim 1, wherein the step of sending translation configuration information and the audio stream to be translated to the target edge node comprises:

separating the audio stream to be translated from the audio/video data to be played, sending the audio stream to be translated to the target edge node,

when the audio stream to be translated is an encrypted audio stream, the audio stream to be translated is decrypted before being sent.

7. The method of claim 6, wherein the sending of the separated audio stream to the target edge node comprises:

periodically separating the audio stream to be translated from the audio and video data to be played;

sequentially storing the audio streams to be translated in a data queue;

and when the data volume in the data queue reaches a preset data threshold value, sending all the audio streams to be translated in the data queue to the target edge node together.

8. An audio translation method performed by a central node, comprising:

receiving a translation service request sent by terminal equipment;

determining at least one candidate edge node among a plurality of edge nodes for the translation service request;

sending feature information of at least one candidate edge node to the terminal equipment;

9. The method of claim 8, wherein the determining at least one candidate edge node among a plurality of edge nodes for the translation service request comprises:

and determining edge nodes in the same preset area with the terminal equipment from the plurality of edge nodes, and taking at least one determined edge node as a candidate edge node.

10. The method of claim 9, wherein the determining at least one candidate edge node among a plurality of edge nodes for the translation service request comprises:

determining edge nodes in the same preset area with the terminal equipment in the edge nodes;

determining at least one edge node with the load meeting a preset load condition in the at least one edge node based on the determined load of the at least one edge node;

and taking at least one edge node with the determined load meeting the preset load condition as a candidate edge node.

11. The method of claim 8, further comprising:

and when a voice translation model which can be matched with the requirement of the translation configuration information sent by the terminal equipment is absent in the target edge node, providing the voice translation model for the target edge node.

12. An audio translation method performed by an edge node, comprising:

receiving translation configuration information and an audio stream to be translated, which are sent by terminal equipment;

translating the audio stream to be translated to obtain a translation result based on the translation configuration information;

transmitting a translation result of the audio stream to be translated to the terminal device,

13. The method of claim 12, wherein the translating the audio stream to be translated to obtain a translation result based on the translation configuration information comprises:

acquiring a voice translation model capable of matching the requirement of the translation configuration information;

and translating the audio stream to be translated by using the acquired voice translation model based on the translation configuration information to obtain a translation result.

14. The method of claim 13, wherein the step of obtaining a speech translation model that can match requirements of the translation configuration information comprises:

determining a voice translation model capable of matching the requirement of the translation configuration information in preset voice translation models based on the translation configuration information;

or, when a speech translation model capable of matching the requirements of the translation configuration information is absent, downloading the speech translation model capable of matching the requirements of the translation configuration information in the central node based on the translation configuration information.

15. The method of claim 13, wherein the type of the preset speech translation model in each edge node is related to the area where the edge node is located.

16. An audio translation system, comprising:

a terminal device configured to perform the method of any one of claims 1-7;

a central node configured to perform the method of any one of claims 8-11;

an edge node configured to perform the method of any one of claims 12-15.

17. An audio translation device comprising:

a request sending module configured to: sending a translation service request to the central node;

an information receiving module configured to: receiving feature information of at least one candidate edge node sent by the central node for the translation service request;

a target node determination module configured to: determining a target edge node among the at least one candidate edge node based on the feature information of the at least one candidate edge node;

a data transmission module configured to: transmitting translation configuration information and an audio stream to be translated to the determined target edge node;

a result receiving module configured to: and receiving a translation result of the audio stream to be translated, which is sent by the target edge node and is obtained based on the translation configuration information.

18. An audio translation device comprising:

a request receiving module configured to: receiving a translation service request sent by terminal equipment;

a candidate node determination module configured to: determining at least one candidate edge node among a plurality of edge nodes for the translation service request;

a feature information sending module configured to: sending feature information of at least one candidate edge node to the terminal equipment;

19. An audio translation device comprising:

a data receiving module configured to: receiving translation configuration information and an audio stream to be translated, which are sent by terminal equipment;

a translation module configured to: translating the audio stream to be translated to obtain a translation result based on the translation configuration information;

a result sending module configured to: sending a translation result of the audio stream to be translated to the terminal equipment;

20. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method of any of claims 1 to 15.

21. A terminal device comprising at least one computing apparatus and at least one storage apparatus storing instructions that, when executed by the at least one computing apparatus, cause the at least one computing apparatus to perform the method of any of claims 1 to 7.

22. A central node comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method of any of claims 8 to 11.

23. An edge node comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method of any of claims 12 to 15.