CN116016779A

CN116016779A - Voice call translation assisting method, system, computer equipment and storage medium

Info

Publication number: CN116016779A
Application number: CN202211650023.8A
Authority: CN
Inventors: 张猷健
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-04-25

Abstract

The application provides a voice call translation assisting method, a voice call translation assisting system, computer equipment and a storage medium, wherein the voice call translation assisting method comprises the following steps: acquiring call audio of a target client; if the call audio is the target language, cutting off a voice channel between the target client and the target server; processing the call audio, outputting a translation text of a preset language corresponding to the call audio, and displaying the translation text on a display screen of a target server; and obtaining reply information of a preset language corresponding to the translation text, processing the reply information, generating reply audio corresponding to the target language, and transmitting the reply audio to the target client. According to the voice call translation auxiliary method, call interaction between the target client and the target server can be automatically achieved in a full-flow mode on the basis that no third party personnel are introduced, and service efficiency of a call center after a call of a non-specific language is received is effectively improved.

Description

Voice call translation assisting method, system, computer equipment and storage medium

Technical Field

The present invention relates to the field of call processing technologies, and in particular, to a voice call translation assisting method, a system, a computer device, and a storage medium (computer-readable storage medium).

Background

Call centers, also known as customer service centers, are relatively centralized locations where telephone calls from subscribers are diverted to a collection of service personnel to communicate and service the telephone calls to the subscribers' needs.

At present, various call centers can meet telephone service requests, but are limited by language service capability of customer service staff, and can only generally serve telephone requests of specific language service, after a telephone of a non-specific language is connected, translation requirements are required to be lifted by service staff, and then third party translation staff is contacted through a three-party communication function, so that waiting of calling clients is caused, and service efficiency is affected.

Disclosure of Invention

Accordingly, in order to solve the above-mentioned problems, it is necessary to provide a voice call translation assisting method, system, computer device and storage medium for solving the technical problem that the service efficiency is low after receiving the service requirement of the non-specific language in the existing call center.

In a first aspect, the present application provides a voice call translation assisting method, including:

acquiring call audio of a target client;

if the language type corresponding to the call audio is the target language, cutting off a voice channel between the target client and the target server;

Processing the call audio, outputting a translation text of a preset language corresponding to the call audio, and displaying the translation text on a display screen of the target server;

and obtaining reply information of a preset language corresponding to the translation text, processing the reply information, generating reply audio corresponding to the target language, and transmitting the reply audio to the target client.

As a possible embodiment of the present application, the processing the call audio, outputting a translation text of a preset language corresponding to the call audio, includes:

inquiring a preset database according to the target language, and acquiring an audio processing model associated with the target language;

and inputting the call audio into the audio processing model for processing, and outputting a translation text of a preset language corresponding to the call audio.

As a possible embodiment of the present application, after outputting the translated text of the preset language corresponding to the call audio and displaying the translated text on the display screen of the target server, the method further includes:

processing the translation text to obtain semantic information corresponding to the translation text;

inquiring a preset database according to the semantic information, acquiring an associated text associated with the semantic information, and displaying the associated text on the display screen;

Responding to a selection instruction of the associated text on the display screen, and taking text information corresponding to the selection instruction in the associated text as reply information corresponding to the translation text.

As a possible embodiment of the present application, the obtaining reply information corresponding to the translated text includes:

responding to a first operation instruction of a preset button on the display screen, and starting a preset recording device;

responding to a second operation instruction of a preset button on the display screen, and closing a preset recording device;

and generating reply information corresponding to the translation text according to the audio information acquired by the preset recording device.

As a possible embodiment of the present application, the generating reply information corresponding to the translation text according to the audio information collected by the preset recording device includes:

processing the audio information acquired by the preset recording device, generating a reply text of a preset language corresponding to the audio information, and displaying the reply text on the display screen;

and responding to a modification operation instruction of the reply text on the display screen, generating a modified reply text, and taking the modified reply text as reply information corresponding to the translation text.

As a possible embodiment of the present application, after the obtaining the reply information of the preset language corresponding to the translated text, the method further includes:

and processing the reply information to generate a reply text of the target language, and transmitting the reply text of the target language to the target client.

As a possible embodiment of the present application, after the obtaining the call audio of the target client, the method further includes:

acquiring a conversation interval corresponding to the conversation audio and preset reminding voice information corresponding to the target language;

if the call interval is greater than a preset duration threshold, transmitting the preset reminding voice information to the target client before the step of generating the reply audio corresponding to the target language.

As a possible embodiment of the present application, the cutting off the voice path between the target client and the target server refers to cutting off a unidirectional voice path sent by the target server to the target client.

In a second aspect, the present application provides a voice call translation assisting system, including:

the voice call subsystem is used for acquiring call audio of the target client; if the language type corresponding to the call audio is the target language, cutting off a voice channel between the target client and the target server;

The voice translation subsystem is used for processing the call audio, outputting a translation text of a preset language corresponding to the call audio and displaying the translation text on a display screen of the target server;

and the voice synthesis subsystem is used for acquiring the reply information of the preset language corresponding to the translation text, processing the reply information, generating the reply audio corresponding to the target language and transmitting the reply audio to the target client.

In a third aspect, the present application further provides a computer device, wherein the computer device includes:

a processor; and

one or more applications, wherein the one or more applications are stored in the processor, which when executing the one or more applications, is configured to implement a voice call translation assisting method as described in any of the above.

In a fourth aspect, the present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is configured to implement a voice call translation assisting method according to any one of the above.

According to the voice call translation auxiliary method, when call audio of the target client is obtained, if the language type corresponding to the call audio is detected to be the target language, a voice channel between the target client and the target server is cut off, a translation text of a preset language corresponding to the call audio is output and provided on a display screen of the target server, and after a service person feeds back reply information of the preset language according to the translation text, the reply information can be processed to generate reply audio corresponding to the target language and provided for the target client. According to the voice call translation auxiliary method, call interaction between the target client and the target server can be automatically achieved in a full-flow mode on the basis that no third party personnel are introduced, and service efficiency of a call center after a call of a non-specific language is received is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of a voice call translation assisting method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of steps of a voice call translation assisting method according to an embodiment of the present application;

fig. 3 is a flowchart illustrating steps of processing call audio by acquiring a corresponding audio processing model based on a language type according to an embodiment of the present application;

fig. 4 is a realization scheme for synchronously displaying associated text on a display screen to assist customer service personnel in replying according to the embodiment of the application;

fig. 5 is a schematic flowchart of a step of obtaining reply information based on a voice manner according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps of displaying reply text to assist in modifying reply information according to one embodiment of the present application;

Fig. 7 is a flowchart illustrating steps for providing reminding voice information to a target client according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a voice call translation assisting system according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the description of the present application, the term "for example" is used to mean "serving as an example, instance, or illustration. Any embodiment described herein as "for example" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the invention. In the following description, details are set forth for purposes of explanation. It will be apparent to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and processes have not been described in detail so as not to obscure the description of the invention with unnecessary detail. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

In order to facilitate understanding of the complete implementation scheme of the voice call translation assisting method provided by the embodiment of the present application, an implementation scenario of the voice call translation assisting method is first described in detail below.

The voice call translation auxiliary method provided by the embodiment of the invention is mainly used in the service scene of the call center, wherein a client can communicate with customer service personnel of the call center through the communication function between the client side on the terminal equipment and the corresponding service side of the call center, so that corresponding service requirements are acquired. However, in most cases, the customer service personnel can only provide services in specific languages, so when the customer service personnel receives the call of a target customer in an unspecified language, the customer service personnel needs to initiate a translation request first, and then the third party translation personnel is added into the conversation process through the three-party conversation function of the call center, but the process is complex, and the call customer often needs to wait for a certain time, so that the service efficiency of the call center is affected.

In order to solve the above problems, embodiments of the present application provide a voice call translation assisting method, a system, a computer device, and a storage medium. The voice call translation assisting method is installed in a voice call translation assisting system in a program mode, the voice call translation assisting system is arranged in computer equipment in a processor mode, and the voice call translation assisting system in the computer equipment realizes the voice call translation assisting method provided by the embodiment of the application by executing a software program corresponding to the voice call translation assisting method. The voice call translation auxiliary system may be provided by a call center, and has a function of performing data interaction with a server of the call center, or may be provided by a communication carrier, and has a function of simultaneously performing communication with a client and the server. Specifically, as shown in fig. 1, fig. 1 is a schematic application scenario of a voice call translation assisting method provided in the embodiment of the present application, where the schematic application scenario includes a target client 100 located on a user terminal device, a voice call translation assisting system 200 provided by a call center, and a target server 300 located on the call center, where call audio input by a user through the target client 100 is transmitted to the voice call translation assisting system 200, after identifying that a corresponding language type in the call audio is a target language that needs to be processed, the voice call translation assisting system 200 processes the call audio accordingly, converts the call audio into a translation text of a preset language, and transmits the translation text to the target server 300, so that the translation text is output to a customer service person through a display screen of the target server 300, so that the customer service person can receive information provided by the answer based on the translation text, and then synthesize answer audio of the corresponding target language, and resend the answer audio to the target client 100.

It should be noted that the application scenario schematic diagram of the voice call translation assisting system provided above is only one possible implementation scheme, and is not meant to limit the scheme of the embodiment of the present application, for example, fig. 1 is only an example of a single target client 100, and in fact, the voice call translation assisting system provided in the embodiment of the present application may be aimed at a plurality of different target clients 100 at the same time, which is not repeated herein.

In addition, the target client 100 on the user terminal device may exist in different forms, for example, the target client may be a web page, APP software, applet, etc., and the voice call translation assisting system 200 may be a server, or may be a cloud server, or may be a server cluster formed by a plurality of servers, which is not described herein again.

In the above application scenario, the embodiment of the present application provides a step flow diagram of a voice call translation assisting method, which specifically includes steps 201 to 204:

and 201, acquiring call audio of the target client.

In this embodiment of the present application, after a target client initiates a communication request to a target server of a call center through a target client, if the target server receives the request and establishes a communication relationship with the target client, at this time, call information input by the target client through the target client is sent to a voice call translation auxiliary system in the form of an audio stream, where the audio stream is call audio.

202, if the language type corresponding to the call audio is the target language, cutting off a voice path between the target client and the target server.

In this embodiment of the present application, call audio is obtained in a voice call translation auxiliary system, and a language type corresponding to the call audio may be detected through a trained language recognition model, where the trained language recognition model may be trained based on a supervised neural network algorithm, for example, by deep learning, the language recognition model learns association relations between different language types and acoustic spectrum features or phoneme features in the audio. Of course, in addition to the voice call translation assisting system automatically detecting the call audio, the target client may select a corresponding language type based on the language type options displayed by the target client in the process of initiating the communication request through the target client, and at this time, the selected language type may be sent to the voice call translation assisting system to serve as the language type corresponding to the call audio acquired later.

On the basis of determining the type of the language corresponding to the call audio, if the type of the language corresponding to the call audio is a preset target language, that is, a non-specific language other than the specific language service provided by the call center, the voice call translation auxiliary system cuts off a voice path between the target client and the target server so as to prevent the call audio provided by the target client from being directly transmitted to the target server and also prevent the audio of the non-target language replied by the target server from being directly transmitted to the target client, thereby causing unnecessary misunderstanding. Of course, as a possible embodiment of the present application, cutting off the voice path between the target client and the target server refers to cutting off the unidirectional voice path sent from the target server to the target client, that is, the call audio of the target client will be transmitted to the target server and played, so as to reduce the risk of errors in identifying and translating the call audio, but the audio of the non-target language replied by the target server will not be directly transmitted to the target client.

203, processing the call audio, outputting a translation text of a preset language corresponding to the call audio, and displaying the translation text on a display screen of the target server.

In the embodiment of the application, in order to realize understanding of the call audio of the target language by the customer service personnel, the voice call translation auxiliary system processes the call audio to output the translation text of the preset language, and provides the translation text to the target server. Specifically, the voice call translation auxiliary system mainly comprises an audio recognition process and a text translation process in the process of processing call audio, wherein the voice recognition process is mainly used for recognizing call audio of a target language into call text of the target language, and the text translation auxiliary system is mainly used for translating the obtained call text of the target language into a translation text of a preset language. The recognition of the audio and the translation of the text can be implemented based on a trained neural network model, for example, the speech recognition process can be implemented by using an existing speech recognition model, that is, determining the information content contained in a piece of audio through spectral feature information in the audio. Specifically, as a possible embodiment of the present application, the speech recognition model may be trained in advance based on training sample data, such as HMM (Hidden Markov Model ), DNN (Deep Neural Networks, deep neural network) model, or the like. The training process of the voice recognition model generally comprises the following steps: constructing a network model initialized by parameters; inputting sample audio carrying with a tag text into the initial network model to obtain a predicted text; updating parameters in the initial network model based on the difference between the predicted text and the tag text, so that the predicted text obtained by processing the sample audio by the updated network model is more similar to the tag text; and repeating the iterative process until the difference between the predicted text and the label text obtained by processing the sample audio by using the current obtained network model is smaller than a preset difference threshold value, wherein the current network model is a speech recognition model obtained by training, and the model has the function of accurately processing a section of audio into corresponding text content. Similarly, for text translation, the text translation process may be implemented by using a transducer model, RNN (Recurrent Neural Networks, recurrent neural network) model with text encoding and decoding functions to implement text translation, which is not described herein.

Of course, it should be noted that, because the spectral features included in the audio of different languages are often different, it is often difficult for a single speech recognition model to complete the audio recognition of different languages, and similarly, it is also difficult for a single text translation model to complete the text translation of different languages, so, in order to improve the applicability of the voice call translation assisting method provided in the embodiment of the present application, as a possible implementation scheme of the embodiment of the present application, an audio processing model associated with different languages is deployed in the database of the voice call translation assisting system, so that when processing the call audio, an appropriate audio processing model can be selected to process, so as to improve the translation effect of the call audio, and a specific implementation scheme may refer to fig. 3 and the content of the explanation thereof.

In addition, as another possible solution of the present application, in order to further improve the service efficiency of the customer service personnel, besides displaying the translation text, the display screen of the target service end also synchronously displays the associated text related to the translation text for the customer service personnel to reply, and the specific implementation scheme can refer to the following fig. 4 and the explanation content thereof.

204, obtaining the reply information of the preset language corresponding to the translation text, processing the reply information, generating the reply audio corresponding to the target language, and transmitting the reply audio to the target client.

In this embodiment of the present invention, since the voice path between the target client and the target server is already cut off, the reply information of the preset language fed back by the customer service personnel on the basis of the translation text on the display screen of the target server is not directly transmitted to the target client, but is transmitted to the voice call translation auxiliary system for processing, and after the processing of the reply information is completed, the voice call translation auxiliary system translates into the reply text of the target language, and then synthesizes the reply text into the reply audio of the target language, the reply text is transmitted to the target client.

Specifically, the reply information of the preset language corresponding to the translated text can be input to the target server in various modes, for example, after the associated text is displayed, the customer service personnel can select the text information corresponding to the selection instruction on the display screen as the reply information in a selection mode, but as another simple and feasible implementation scheme, the voice information replied by the customer service personnel can be obtained by starting and stopping the recording device through different operation instructions of buttons on the display screen. A specific implementation may refer to fig. 5 and the explanation thereof.

Further, as another alternative embodiment of the present application, it is considered that in the processing of the reply information by the voice call translation assisting system, a reply text of the target language is generated first, then a reply audio of the target language is synthesized and provided to the target client, and in this process, to further improve the service effect of the call center, the voice call translation assisting system may also transmit the reply text of the target language to the target client.

In addition, in the embodiment of the present application, because the processes of speech recognition, text translation and speech synthesis are additionally added, waiting of the target client is easy to cause, so in order to eliminate adverse effects caused by waiting of the target client, in the embodiment of the present application, after the call audio of the target client is obtained, the speech call translation auxiliary system records a call interval corresponding to the call audio, that is, a duration of time for completing the call audio from the target client is recorded, and after the call interval is greater than a preset duration threshold, if no reply audio of the target language is generated yet, at this time, the speech call translation auxiliary system plays preset reminding information to the target client. A specific implementation may be found in the following description of fig. 7 and its explanation.

As shown in fig. 3, fig. 3 is a flowchart illustrating steps of processing call audio by acquiring a corresponding audio processing model based on a language type according to an embodiment of the present application, and specifically includes steps 301 to 302:

and 301, inquiring a preset database according to the target language, and acquiring an audio processing model associated with the target language.

In this embodiment of the present application, audio processing models associated with different languages are pre-stored in a database of the voice call translation auxiliary system, including an audio recognition model and a text translation model, where the audio processing models associated with different languages are respectively obtained by training texts of the languages associated with the audio processing models, for example, the audio processing models associated with the language a are obtained by training texts of the language a, and the audio processing models associated with the language B are obtained by training texts of the language B.

Further, in addition to the foregoing provided speech recognition model and text translation model, as can be seen from the description related to the foregoing step 204, the speech call translation assisting system can also synthesize the reply text of the target language into the reply audio of the target language through the speech synthesis model, so that the audio processing model can also include the speech synthesis model, and the training of the model is also realized based on the training text of the corresponding language.

302, inputting the call audio to the audio processing model for processing, and outputting a translation text of a preset language corresponding to the call audio.

In the embodiment of the application, the call audio is input into the audio processing model for processing, and the audio processing model is realized based on the training text of the target language, so that the recognition and translation effect of the call audio can be effectively completed, and the accuracy of the output translation text is ensured.

As shown in fig. 4, fig. 4 is a realization scheme for synchronously displaying associated text on a display screen to assist customer service personnel in replying, which specifically includes steps 401 to 403:

401, processing the translation text to obtain semantic information corresponding to the translation text.

In the embodiment of the application, in order to improve the service effect of customer service personnel, the voice call translation auxiliary system further performs semantic recognition processing on the obtained translation text, that is, determines semantic information of the translation text by extracting keywords in the text. Specifically, the process may be generally implemented based on a trained semantic recognition model, for example, a semantic recognition model implemented based on NLP (Natural Language Processing ) technology is more common, which is not described herein in detail.

In the embodiment of the present application, the foregoing voice information obtained by processing the translated text generally describes what type of questions, such as life questions and work questions, the target client is based on to seek help or opinion feedback.

And 402, inquiring a preset database according to the semantic information, acquiring an associated text associated with the semantic information, and displaying the associated text on the display screen.

In this embodiment of the present application, after the foregoing semantic information corresponding to the translated text is identified, the voice call translation assisting system may query a corresponding database based on the semantic information, so as to directly obtain the associated text associated with the semantic information from the database and provide the associated text to the customer service staff through the foregoing display screen, where the associated text may be, for example, some policy, regulation file, etc. associated with the semantic information of the translated text.

Of course, while the translation text and the associated text are displayed, in order to facilitate the distinction between the translation text and the associated text displayed on the display screen, for example, the translation text may be displayed in different positions, such as displaying the translation text in an area above the display screen, and displaying the associated text in an area below the display screen, or of course, the translation text and the associated text may be distinguished by using different colors, which is not repeated herein in the embodiments of the present application.

And 403, responding to a selection instruction of the associated text on the display screen, and taking text information corresponding to the selection instruction in the associated text as reply information corresponding to the translation text.

In this embodiment of the present invention, after the related text of the translated text is displayed on the display screen, if some text information in the related text is available for replying to the call audio, at this time, a customer service personnel in the call center may directly input a selection instruction on the display screen in a selection manner, for example, a box selection manner or a click manner, and at this time, the voice call translation assisting system may automatically use the part of text information as reply information corresponding to the translated text for subsequent processing, thereby generating corresponding reply audio.

Of course, the above-mentioned another simplified implementation scheme for replying to the translated text may actually further reply based on the own business experience besides directly selecting the associated text, and in this case, in order to facilitate the reply of the customer service personnel, the target server may further include a recording device for implementing the recording of the user voice as reply information, and in particular, as shown in fig. 5, a step flow diagram for obtaining reply information based on the voice manner provided in the embodiment of the present application specifically includes steps 501 to 503:

And 501, responding to a first operation instruction of a preset button on the display screen, and starting a preset recording device.

And 502, closing the preset recording device in response to a second operation instruction of a preset button on the display screen.

In this embodiment, the preset buttons on the display screen may be single or multiple, for example, two preset buttons may be included. Specifically, when only one preset button is provided on the display screen, the first operation instruction at this time may be a first click, and at this time, the recording device of the target server is started to start recording the currently input voice information. Further, the second operation command to the preset button may be clicked again, so as to distinguish two different clicks, the preset button may be displayed in different manners, for example, before the first click, a triangular pattern may be displayed on the preset button, and after the first operation command to the button is received, a circular pattern may be displayed on the preset button, and the recording device may be started, and after the second operation command to the button is received again, the triangular pattern may be displayed again on the preset button, and the recording device may be turned off.

Of course, the foregoing description only uses the case that the preset buttons include a single button as an example, and in fact, the preset buttons may also include two or more buttons, for example, the preset buttons include two buttons, where the first operation instruction of the preset buttons may be a click operation on the first button of the preset buttons, and the second operation instruction of the preset buttons may be a click operation on the second button of the preset buttons, and of course, the preset buttons may also include more buttons corresponding to different functions of the recording device, such as noise reduction, waveform enhancement, echo cancellation, and so on, which are not described herein again.

503, generating reply information corresponding to the translation text according to the audio information acquired by the preset recording device,

in this embodiment of the present invention, in the starting process of the recording device, that is, between the first operation instruction and the second operation instruction of the preset button on the display screen, the voice information recorded by the recording device is the reply information of the preset language corresponding to the translation text.

On the basis of the above scheme, in order to further improve the reply effect of the customer service personnel, the audio information is firstly converted into text information in a preset language and displayed on a display screen, so that the customer service personnel can modify the text information conveniently, specifically, as shown in fig. 6, fig. 6 is a schematic flow chart of steps for displaying the reply text to assist in modifying the reply information provided by the embodiment of the present application, and specifically, the method includes steps 601 to 602:

601, processing the audio information acquired by the preset recording device, generating a reply text of a preset language corresponding to the audio information, and displaying the reply text on the display screen.

In this embodiment of the present application, the voice call translation assisting system may first perform a voice recognition process on the audio information collected by the recording device, process the audio information into a reply text in a preset language, and display the reply text on the display screen. Specifically, the speech recognition process may be similar to the foregoing steps, and is obtained by processing a speech recognition model, but it should be noted that the speech recognition model is obtained by training a training text in the preset language, and the embodiments of the present application are not described herein again.

602, generating a modified reply text in response to a modification operation instruction of the reply text on the display screen, and taking the modified reply text as reply information corresponding to the translation text.

In this embodiment, in addition to displaying the initial reply text of the customer service personnel through the display screen, if the customer service personnel want to modify some of the content therein, the modification may also be completed by a modification operation instruction for the reply text on the display screen, and specifically, the modified reply text will be used as final reply information for subsequently generating reply audio and providing the reply audio to the target client.

As shown in fig. 7, fig. 7 is a schematic flowchart of a step of providing reminding voice information to a target client according to an embodiment of the present application, and specifically includes steps 701 to 702:

701, obtaining a call interval corresponding to the call audio and preset reminding voice information corresponding to the target language.

In this embodiment of the present application, when the voice call translation auxiliary system obtains the call audio of the target client, the call interval of the call audio is recorded by the timing device, and then the preset reminding information existing in the target language information is obtained, for example, "in the process of inquiring, please be later", although other reminding information is also feasible, and the embodiment of the present application is not described here again.

702, if the call interval is greater than a preset duration threshold, transmitting the preset reminding voice information to the target client before the step of generating the reply audio corresponding to the target language.

In this embodiment of the present invention, after the call interval is greater than a preset duration threshold, for example, more than 2 minutes, the voice call translation auxiliary system does not generate a reply audio of the target language yet, at this time, the voice call translation auxiliary system may transmit the preset reminding voice information to the target client to remind the client to wait until the reply audio of the corresponding target language is generated, and close to play the reminding voice information, and transmit the reply audio to the target client.

Specifically, to more clearly understand the voice call translation assisting method provided in the embodiment of the present application, based on the voice call translation assisting method, the embodiment of the present application further provides a voice call translation assisting system, and specifically, as shown in fig. 8, a schematic structural diagram of the voice call translation assisting system provided in the embodiment of the present application specifically includes 810 to 830:

a voice call subsystem 810 for acquiring call audio of the target client; if the language type corresponding to the call audio is the target language, cutting off a voice channel between the target client and the target server;

the voice translation subsystem 820 is used for processing the call audio, outputting a translation text of a preset language corresponding to the call audio, and displaying the translation text on a display screen of the target server;

the speech synthesis subsystem 830 is configured to obtain reply information of a preset language corresponding to the translated text, process the reply information, generate reply audio corresponding to the target language, and transmit the reply audio to the target client.

In this embodiment of the present application, the speech translation subsystem is further configured to query a preset database according to the target language, and obtain an audio processing model associated with the target language; and inputting the call audio into the audio processing model for processing, and outputting a translation text of a preset language corresponding to the call audio.

In this embodiment of the present application, after outputting a translation text of a preset language corresponding to the call audio and displaying the translation text on a display screen of the target server, the speech translation subsystem is further configured to process the translation text to obtain semantic information corresponding to the translation text; inquiring a preset database according to the semantic information, acquiring an associated text associated with the semantic information, and displaying the associated text on the display screen; responding to a selection instruction of the associated text on the display screen, and taking text information corresponding to the selection instruction in the associated text as reply information corresponding to the translation text.

In this embodiment of the present application, the speech synthesis subsystem is further configured to start a preset recording device in response to a first operation instruction for a preset button on the display screen; responding to a second operation instruction of a preset button on the display screen, and closing a preset recording device; and generating reply information corresponding to the translation text according to the audio information acquired by the preset recording device.

In this embodiment of the present application, the speech synthesis subsystem is further configured to process the audio information collected by the preset recording device, generate a reply text of a preset language corresponding to the audio information, and display the reply text on the display screen; and responding to a modification operation instruction of the reply text on the display screen, generating a modified reply text, and taking the modified reply text as reply information corresponding to the translation text.

In this embodiment of the present application, after obtaining the reply information of the preset language corresponding to the translated text, the speech synthesis subsystem is further configured to process the reply information, generate a reply text of the target language, and transmit the reply text of the target language to the target client.

In this embodiment of the present application, after obtaining the call audio of the target client, the voice call subsystem is further configured to obtain a call interval corresponding to the call audio and preset alert voice information corresponding to the target language; at this time, if the call interval is greater than a preset duration threshold, the speech synthesis subsystem is configured to transmit the preset alert speech information to the target client before the step of generating the reply audio corresponding to the target language.

The specific limitation of the voice call translation assisting system can be found in the above description of the voice call translation assisting method, and the description thereof will not be repeated here. The steps in the voice call translation assisting method may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments of the present application, the voice call translation assistance system 200 may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 9. The memory of the computer device may store therein various program modules constituting the voice call translation assisting system 200, and a computer program constituted by the various program modules causes a processor to execute the steps in the voice call translation assisting method of the various embodiments of the present application described in the present specification.

For example, the computer device shown in FIG. 9 includes a processor, memory, and network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external computer device through a network connection. The computer program when executed by a processor implements a voice call translation assisting method.

It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In some embodiments of the present application, a computer device is provided that includes one or more processors; a memory; and one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to perform the steps of:

acquiring call audio of a target client;

In some embodiments of the present application, the processor when executing the computer program further performs the steps of: inquiring a preset database according to the target language, and acquiring an audio processing model associated with the target language; and inputting the call audio into the audio processing model for processing, and outputting a translation text of a preset language corresponding to the call audio.

In some embodiments of the present application, the processor when executing the computer program further performs the steps of: processing the translation text to obtain semantic information corresponding to the translation text; inquiring a preset database according to the semantic information, acquiring an associated text associated with the semantic information, and displaying the associated text on the display screen; responding to a selection instruction of the associated text on the display screen, and taking text information corresponding to the selection instruction in the associated text as reply information corresponding to the translation text.

In some embodiments of the present application, the processor when executing the computer program further performs the steps of: responding to a first operation instruction of a preset button on the display screen, and starting a preset recording device; responding to a second operation instruction of a preset button on the display screen, and closing a preset recording device; and generating reply information corresponding to the translation text according to the audio information acquired by the preset recording device.

In some embodiments of the present application, the processor when executing the computer program further performs the steps of: processing the audio information acquired by the preset recording device, generating a reply text of a preset language corresponding to the audio information, and displaying the reply text on the display screen; and responding to a modification operation instruction of the reply text on the display screen, generating a modified reply text, and taking the modified reply text as reply information corresponding to the translation text.

In some embodiments of the present application, the processor when executing the computer program further performs the steps of: and processing the reply information to generate a reply text of the target language, and transmitting the reply text of the target language to the target client.

In some embodiments of the present application, the processor when executing the computer program further performs the steps of: acquiring a conversation interval corresponding to the conversation audio and preset reminding voice information corresponding to the target language; if the call interval is greater than a preset duration threshold, transmitting the preset reminding voice information to the target client before the step of generating the reply audio corresponding to the target language.

In some embodiments of the present application, a computer readable storage medium is provided, storing a computer program, the computer program being loaded by a processor, causing the processor to perform the steps of:

acquiring call audio of a target client;

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can take many forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing has described in detail the methods, systems, computer devices and storage media for voice call translation assistance provided by the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present invention, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. A voice call translation assisting method, comprising:

acquiring call audio of a target client;

2. The voice call translation assisting method according to claim 1, wherein the processing the call audio and outputting the translated text of the preset language corresponding to the call audio comprise:

3. The voice call translation assisting method according to claim 1, wherein after outputting the translation text of the preset language corresponding to the call audio and displaying the translation text on the display screen of the target server, the method further comprises:

4. The voice call translation assisting method according to claim 1, wherein the obtaining reply information corresponding to the translated text includes:

5. The voice call translation assisting method according to claim 4, wherein the generating reply information corresponding to the translation text according to the audio information collected by the preset recording device comprises:

6. The voice call translation assisting method according to claim 1, wherein after the reply information of the preset language corresponding to the translated text is obtained, the method further comprises:

7. The voice call translation assisting method according to claim 1, wherein after the call audio of the target client is acquired, the method further comprises:

8. The voice call translation assisting method according to any one of claims 1 to 7, wherein the cutting off the voice path between the target client and the target server means cutting off a unidirectional voice path transmitted from the target server to the target client.

9. A voice call translation assisting system, comprising:

10. A computer device, the computer device comprising:

a processor; and

one or more applications, wherein the one or more applications are stored in the processor, which when executing the one or more applications, is configured to implement the voice call translation assisting method according to any one of claims 1 to 8.

11. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, is adapted to carry out the voice call translation assisting method according to any one of claims 1 to 8.