WO2020198799A1

WO2020198799A1 - Instant messaging/chat system with translation capability

Info

Publication number: WO2020198799A1
Application number: PCT/AU2020/050328
Authority: WO
Inventors: Danny Stephen MAY; Muhammad Zubair
Original assignee: Lingmo International Pty Ltd; Hangzhou Lingwosheng Intelligent Tech Co Ltd
Priority date: 2019-04-02
Filing date: 2020-04-02
Publication date: 2020-10-08
Also published as: CN110119514A

Abstract

In instant messaging /chat system, a method and system of translating a message is provided utilising an intermediary translation system that is configured to receive content; determining a language of a content of the message; if the determined language is different from a required language, to enable translating of the content to produce a translated content; and posting on the instant messaging /chat system within an established session. In some forms, the method and system include audio pre-processing of an audio stream associated with the instant messaging/chat system.

Description

INSTANT MESSAGING/CHAT SYSTEM WITH TRANSLATION CAPABILITY

Technical Field

Embodiments relate to real time communication systems and in particular to instant messaging and chat systems, and to systems and methods for processing audio and/or text content. The disclosure has particular application to improvements to instant messaging /chat systems to cater for multiple languages.

Background

Instant messaging /chat systems that provide real time communication between users with text and/or voice messages have become prevalent, particularly on mobile devices. In the context of the specification, the term“instant messaging/ chat systems” includes all types of IP telephony services including VOIP services, video conferencing, instant messaging, live chat systems for websites, dedicated interactive kiosks (at point of sale or point of service) and the like. Whilst the popularity of such services has increased dramatically in recent times, such services suffer from the disadvantage that their understanding is limited to the user’s understanding of the language of the message. This is naturally problematic when the users do not understand a common language.

There is a need for improvements in instant messaging and chat services that cater for multiple languages through a translation service without unduly effecting the user experience and/or which can be integrated readily into existing systems. There is also a need for improvements in translation systems to improve accuracy particularly in relation to translation of audio content in an instant message/ chat session.

Summary of the Disclosure

An embodiment relates to an intermediary translation system implemented using computer processing and memory resources and configured to integrate with one or more instant messaging /chat systems and one or more remote translation systems via a communication network, the system comprising;

an instant messaging /chat system interface configured to communicate with the at least one instant messaging /chat system to obtain and send content data within an established session of the instant messaging /chat session;

a translation system interface configured to communicate with the at least one translation system to obtain and send content data; and

a content processor which, within the established session of the instant messaging /chat session, is configured to:

determine a language of the content received from the instant messaging /chat system; determine a required language of a recipient of the content within the instant messaging /chat system and, in response to the required language being different from the determined language;

to forward the content via the translation system interface to a remote translation system for translating the content to produce translated content in the required language; and to receive the translated content from the remote translation system; and

to forward the translated content to the recipient within the established session via the instant messaging /chat system interface.

An advantage of the above disclosed system is that content translation can occur within an established instant messaging/chat session in a seamless real-time fashion without requiring additional user input in the established session. A further advantage of the system is that the intermediary translation system is separated from both the instant messaging /chat platform and translation service. This provides flexibility in design and application of the system. For example, the system can be deployed (through the appropriate communication interfaces) into existing messaging/chat platforms and can similarly access different translation services depending on system requirements.

In some forms, the content processor determines the language of the content by parsing the content.

In some forms, the content processor determines the language of the content by referencing information of the sender of the content within the instant messaging /chat system.

In some forms, the content processor determines the language of the content by reference to user input. In some forms, the intermediary translation system further comprises a user profile store that retains user data including the required language of the user. In some forms, the user data is derived from the instant messaging/chat system.

In some forms, the instant messaging /chat system interface is a machine to machine (M2M) interface. In some forms, the M2M interface utilises one or more application programming interfaces (APIs). In some forms, the communication is performed utilising HTTP protocol with a push notification service.

In some forms, the translation service interface is a M2M interface. In some forms, the M2M interface utilises one or more APIs. In some forms, the communication is performed utilising HTTP protocol with a push notification service.

In some forms, the content processor further comprises a pre -processing module that is configured to process the content prior to forwarding the content to the translation system.

The content may be processed in different ways. In one form, the pre-processing module may be configured to insert punctuation to text content to aid in establishing context to assist in translation.

In other forms, the pre-processing module may modify audio content to improve translation accuracy.

In some forms, the pre-processing module is operative to create audio data packets from an audio stream that allow for improved translation. The audio packets may equate to sentences or other parts of speech. The applicant has found that dividing audio streams into smaller sub groups can allow for improved translation accuracy as it enables translation without contextual bias of the translation service.

Also disclosed is a method of translating audio content from a source language into a target language, the method being executed by a computer system using computer processing and memory resources, the method comprising: separating an audio stream into audio frames of predetermined duration; detecting voice activity within individual ones of the frames; using the detected voice activity to group frames into audio data packets; and using the audio data packets to create input data packets for a translation service to allow for translation of the audio content into the target language.

In some forms, the pre-processing module may also analyse the audio packets and vary the audio data dependent on a characterisation of that data to improve subsequent translation accuracy. In one form, data in one or more data packets that is characterised as being from a sender in an instant messaging chat session is promoted whereas data from other audio (e.g. background, near noise or other speakers) is supressed.

In some forms, the content processor is operative to extract features from audio content. In one form, this step extracts features that have components representative of speech which can be reliably modelled. In some form, the feature extraction is undertaken on audio data packets that have been established from the audio data stream.

In some form, the extracted features are compared to a stored model of target language dialects so as to identify accent or dialects of the speech which can then be associated with the audio content so that the translation accuracy can be improved.

Also disclosed is a method of translating audio content from a source language into a target language, the method being executed by a computer system using computer processing and memory resources, the method comprising; extracting features from the audio content, the features extracted being compatible with dialect models of the target language;

comparing the extracted features with the language dialect models to identify any dialect indicated by the extracted features, associating any identified dialect with the audio content; forwarding audio content with information on any identified dialect to a translation service for translating the audio content.

In some forms, the content processor also discloses a post processing module that is operative to combine translated content and to incorporate punctuation into the combined text. In one form, the post processing module uses a punctuation model which uses characteristics from the pre-processing module as inputs to the punctuation model to improve accuracy. The applicant has found that the ability to include correct punctuation significantly improves translation accuracy and contributes significantly to allowing contextual translations.

Also disclosed is a method of translating audio content from a source language into a target language, the method being executed by a computer system using computer processing and memory resources, the method comprising: providing the audio content; extracting features from the audio content; generating a translated text of the audio content in the target language; and using a punctuation model to add punctuation to the translated text, wherein the punctuation model uses the extracted features from the audio content in determining placement of punctuation in the translated text.

A further embodiment relates to an intermediary translation system implemented using computer processing and memory resources and configured to integrate with one or more instant messaging /chat systems and one or more remote translation systems via a communication network, the system comprising;

an instant messaging /chat system interface configured to communicate with the at least one instant messaging /chat system to obtain and send content data within an

established session of the instant messaging /chat session;

extract features from the audio content, the features extracted being compatible with dialect models of the target language;

compare the extracted features with the language dialect models to identify any dialect indicated by the extracted features,

associate any identified dialect with the audio content;

forward audio content with information on any identified dialect to a translation service for translating the audio content.; and

to receive the translated content from the remote translation system; and

A further embodiment relates to an intermediary translation system implemented using computer processing and memory resources and configured to integrate with one or more instant messaging /chat systems and one or more remote translation systems via a

communication network, the system comprising;

established session of the instant messaging /chat session;

separate an audio stream into audio frames of predetermined duration;

detect voice activity within individual ones of the frames;

use the detected voice activity to group frames into audio data packets; and using the audio data packets to create input data packets;

forward input data packets via the translation system interface to a remote translation system for translating the input data packets to produce translated content in the required language; and

to receive the translated content from the remote translation system; and

In some forms, the content processor is further configured to;

extract features from the audio content; and

to receive the translated content from the remote translation system in the form of text; and using a punctuation model to add punctuation to the translated text, wherein the punctuation model uses the extracted features from the audio content in determining placement of punctuation in the translated text.

established session of the instant messaging /chat session;

extract features from the audio content; forward the content via the translation system interface to a remote translation system for translating the content to produce translated content in the required language; and

to receive the translated content from the remote translation system; and using a punctuation model to add punctuation to the translated text, wherein the punctuation model uses the extracted features from the audio content in determining placement of punctuation in the translated text.; and to forward the translated content to the recipient within the established session via the instant messaging /chat system interface.

In some forms, the translation system has a plurality of sub-systems, and the content processor is configured to select a translation sub-system for the received content in response to a characteristic of the content and/or system setting and forwarding the received content to the selected translation sub- system.

In some forms, the translation sub systems comprise one or more of text to text translation system, speech to text translation systems, and speech to text translation systems.

In some forms, the received content is in the form of any one or more of text, and audio information.

In some forms, the translated content is in the form of any one of text and audio information.

The system can be implemented using computer processing and memory resources in the form of one or more network connected servers and databases, these hardware resources executing software programmed to implement the functions as described above.

Alternatively, the computer processing and memory resources may be network accessible distributed "cloud based" resources, executing software to implement the system

functionality as described above. Some embodiments may utilise a combination of dedicated hardware and shared resources. A variety of different system architectures are contemplated within the scope of the present disclosure.

A further embodiment relates to an instant messaging/chat system comprising; an instant messaging /chat client; an instant messaging /chat host configured for communicatively coupling to said instant messaging /chat client, said host having logic for receiving application input from the instant messaging /chat client as a content posting to an established instant messaging /chat client session between instant messaging /chat clients; one or more translation systems implemented using computer processing and memory resources and configured to translate content from an instant messaging/chat session; and an intermediary translation system implemented using computer processing and memory resources and configured to integrate with the instant messaging /chat server and the one or more remote translation systems via a communication network, the intermediary translation system comprising;

an instant messaging /chat host interface configured to communicate with the instant messaging /chat host to obtain and send content data within an established session of the instant messaging /chat session;

a translation system interface configured to communicate with the at least one translation system to obtain and send content data; and a content processor which, within the established session of the instant messaging /chat session, is configured to:

determine a language of the content received from the instant messaging /chat session;

determine a required language of a recipient of the content within the instant messaging /chat and, in response to the required language being different from the determined language;

to forward the translated content to the instant messaging /chat host as an output to the recipient of the posted content within the established session.

The instant messaging/chat system may further comprise features of any

embodiments of the intermediary translation system (or combinations thereof) as disclosed above.

A further embodiment of the disclosure relates to a method of translating instant messaging/ chat content within an established session of an instant messaging/chat system that includes a sender and a recipient, the method being executed by an intermediary translation system using computer processing and memory resources via a communication network, the method comprising the steps of: receiving content from the sender within the established session;

determining a language of the content of the message;

determining a required language of the recipient; and

if the required language is different from the determined language, forwarding the content to a remote translation system for translating the content to produce translated content in the required language;

receiving the translated content from the remote translation system; and

forwarding the translated content to the recipient within the established session.

In some forms, the method of the intermediary translation system includes the step of determining the language of the content by parsing the content. In some forms, the method of the intermediary translation system includes the step of determining the language of the content by referencing information for the sender of the message.

In some forms, the method of the intermediary translation system includes the step of determining the language of the content by reference to user input. In some forms, the user input is stored by the intermediate system. In some forms, the user input is derived from the instant messaging/chat system.

In some forms, the intermediary translation system communicates with the instant messaging/chat system via an M2M interface. In some forms, the M2M interface utilises one or more APIs. In some forms, the communication is performed utilising HTTP protocol with a push notification service.

In some forms, the intermediary translation system communicates with the remote translation system via a M2M interface. In some forms, the M2M interface utilises one or more APIs. In some forms, the communication is performed utilising HTTP protocol with a push notification service.

In some forms, the method further comprises modifying the content prior to forwarding the content to the remote translation system. The content may be modified in different ways. For example, punctuation may be added to text content to aid in establishing context to assist in translation. In other forms, noise reduction may be applied to audio content to aid in translation.

In some forms, the method may also include any of the steps disclosed in other embodiments of methods (or combinations thereof) as disclosed above

In some forms, the translation system has a plurality of sub-systems, and the method further comprises the step of selecting a translation sub- system for the received content in response to a characteristic of the content and/or system setting and forwarding the received content to the selected translation sub- system.

In some forms, the translation sub systems comprise one or more of text to text translation system, speech to text translation systems, and speech to text translation systems. In some forms, the received content is in the form of any one or more of text, and audio information.

Description of Accompanying Drawings

Embodiments are described with reference to the accompanying drawings in which:

Fig. 1 is a schematic representation of an instant messaging/chat system configured to allow in session translation of content; Fig. 2 is a block diagram illustrating client side process and content flow of the instant messaging/chat system of Fig. 1;

Fig. 3 is block diagram illustrating components of the instant messaging/chat system of Fig. 1;

Fig. 4 is a flow chart illustrating content processing and routing between components of the system of Fig. 1;

Fig. 5 is an example of a first stage of pre-processing of audio content in the content processor of Fig. 4;

Fig. 6 is an example of a second stage of pre-processing of audio content in the content processor of Fig. 4; Fig. 7 is an example of a third stage of pre-processing of audio content in the content processor of Fig. 4; and

Fig. 8 is a block diagram of a post processing procedure in the content processor of

Fig. 4 Detailed Description of Specific Embodiment

An embodiment of the present disclosure is a system and method for allowing in session translation of content in an instant messaging/chat system. In accordance with the disclosure, users in a session are able to post content (via text or speech) in one language and within the session, the content may be processed, modified, and if required translated to a required language so that the received posted content is in the required language. The system and method of the disclosure allow the processing and translation of the content to occur within an established instant messaging/chat session in a seamless real-time fashion without requiring additional user input in the established session. An intermediary translation system which provides the logic to conduct the processing of content and enables communication to a translation system to provide the translation is separated from both the messaging platform and translation service. This provides flexibility in design and application of the system.

The system can be deployed (through the appropriate communication interfaces) into existing messaging/chat platforms and can similarly access different translation services depending on system requirements. Further, the content processing operations within the system allow for improved accuracy of translation by the translation service allowing, amongst other benefits, audio processing to improve context and accuracy of the translation through, audio filtering, dialect identification and improved punctuation. These processes are described in more detail below.

The system and method can be implemented using computer processing and memory resources in the form of one or more network connected servers and databases, these hardware resources executing software programmed to implement the functions as described above. Alternatively, the computer processing and memory resources may be network accessible distributed "cloud based" resources, executing software to implement the system functionality as described above. Some embodiments may utilise a combination of dedicated hardware and shared resources. A variety of different system architectures are contemplated within the scope of the present disclosure.

Fig. 1 is a schematic representation of an instant messaging/chat system 100 configured to allow in session translation of content. The system 100 includes one or more instant messaging/chat session clients 110, 112 that communicate via a data network 120. An instant messaging host 130 also communicates with the clients 110, 112 to provide instant messaging/chat communications between the instant messaging/chat clients 110, 112.

Whilst not shown, an operating system of the client 110, 112 supports the instant messaging/chat process that provides a user interface 150 (Fig. 2) through which a user can both receive content of an instant messaging/chat session and also add content to the instant messaging/chat session. Typically, the client resides on a mobile device (such as a smart phone or watch) but may reside on other computing devices. As used herein,“content” can mean SMS messages, MMS messages, messages on dedicated platforms such as WhatsApp, Messenger, Instagram etc. and includes messages having only a textual content as well as those having an audio content, or a mixture of these content types.

In addition to the functionality that allows for instant messaging/chat client sessions, the system 100 also includes a capability to allow in session translation of content.

Specifically, the system 100 includes an intermediary translation system 10 implemented using computer processing and memory resources and configured to integrate with the instant messaging /chat host 130 and one or more remote translation systems 140. The intermediary translation system 10 communicates with the instant messaging/chat host and the one or more translation via a machine to machine (M2M) interface. In some forms, the M2M interface utilises one or more application programming interfaces (APIs). In some forms, the communication is performed utilising HTTP protocol with a push notification service.

Fig. 2 illustrates is a block diagram illustrating client side process of the user interface 150 and content flow of the instant messaging/chat system 100. In a typical scenario, a first user operating on client 110 is able to enter into the instant messaging system 100 via a registration or login process 152. The client operating system provides a menu functionality 154 allowing user details and preferences to be inputted. In one form, this may include specific language preferences of the user. In other form, this language preference may be obtained from other information, such as location data, device setting data, or from parsing of content of the data, or as a default language. Once the language preference is established it may be stored in one or more locations (including locally on the device, at the instant messaging/chat host 130 and the intermediary translation service 10 at memory 12).

The client 110, allows for one on one sessions to be initiated with other clients 112, or with multiple clients in a group chat session. Once an instant messaging session is established, users are able to post content which is then managed by the instant

messaging/chat host 130. In initiating a content posting from client 110, the host may instigate an initialising routine to seek the language preferences of the recipients (if it is not already known by the host 130). This initialising process may involve a push notification 156 to client(s) 112 regarding incoming content and requesting language preference for the content. This initialising routine occurs before posting of the content with clients 112.

Where there is a multi-party chat session, each of clients 112 may input their own unique language preference such that the chat session may be conducted in more than two languages. It is to be understood, that this initialising routine may not occur in instances where a language preference of clients 112 is known (say from previous user input) or where it is determined from other information or where a default language is assumed.

The information of language preferences of the clients in the session is then provided to the intermediary translation system 10 to determine if the content requires translation. The intermediary translation system 10 includes a content processor 14 which, within the established session of the instant messaging /chat session, is designed to pre -process and/to post-process content to aid in translation accuracy (as will be described in more detail below) and is configured to determine a language of the content received from the instant messaging /chat system (‘source language’) to determine a required language of a recipient of the content within the instant messaging /chat system (‘target language’) and, in response to the required language being different from the determined language; to forward the content via the translation system interface to a remote translation system for translating the content to produce translated content in the required language.

The translation system 140 used for translating content may be a proprietary translation system of the intermediary translation system or may be a commercially available translation system or may be a hybrid system (where the translation is conducted within a commercially available translation service using proprietary data). Example of a hybrid system is when unique corpus (such as pertaining to a specific technical field or dialect). Further, the intermediary translation system may be configured to route the content for translation to one of a number of translation systems or subsystems (e.g. 140a, 140b or 140c of Fig. 3). A feature of the arrangement is that a separation of the intermediary translation system 10 from the translation service 140 allows flexibility of operation to bring new translation systems on line, or to route the content to a particular translation system dependent on the language required for the translation or other factors (such as content type (text or speech), latency, user preference etc). One exemplary translation system is the IBM Watson Translator which can identify the language of text and translate it into different languages programmatically.

All content received by the intermediary translation system 10 is logged and routing of the content to and from the translation system 140 and back to the instant messaging/chat host 130 for issuing to the recipient using meta information applied to the content. This process is managed by the content processor of the intermediary translation system 10 and communication system (160 Fig. 3) which acts as a message bus and is able to able to allow synchronous content within the instant messaging/chat session and, if required,

asynchronous routing of content.

Fig. 3 is block diagram illustrating components of the instant messaging/chat architecture that provides the language translation.

As illustrated, the architecture utilising the intermediary translation system 10 is designed so that it can integrate more easily into existing instant messaging/ chat platforms. These platforms may have various client device structures that may operate over multiple host platforms (although as illustrated only a single instant messaging/client host is shown). Using secure APIs, the intermediary translation system allows a secure communication channel for users to chat multilingually via a HTTP layer. Each request is logged, analysed and may be modified to improve context for translation. The intermediary translation system automatically routes the content requests to one or more related sub- systems of the translation system.

In one form, the translation sub-systems may comprise one or more of speech to text translation systems 140a, text to text translation systeml40b, text to speech translation systems 140c.

The speech to text system 140a may accepts requests routed from the intermediary translation system 10 in raw audio format and generates respective transcribed text. The system 140a may support more than 100 different dialects/ accents of each language and may support 27 languages. On generation of the transcribed text, the content request may be returned to the intermediary system 10 for further processing before being returned to the translation sub- system 140b for translation. Alternatively, the translation system 140b may incorporates its own models to enhance contextual translations.

The speech to text translation systems provide automated conversion of speech to text and may use machine learning training systems that processes the training data over deep learning RNN models. This allows trainees to train the system via automated routines.

The translation sub- system 140b accepts requests routed from the intermediary translation system 10 in text and generates the requested target request. The system 140b may support 27 languages.

The translation system provides automated translation of text and may use machine learning training systems that processes the training data over deep learning RNN models. This allows trainees to train the system via automated routines.

The text to speech sub system 140c may accepts requests routed from the

intermediary translation system 10 in text and generates respective audio files. The system 140c may support audio formats such as WAV (both mono and stereo) and FLAC and may generate both male and female voices. The system 140c may support 27 languages.

The text to speech sub system 140c provides automated translation of text and may use machine learning training systems that processes the training data over deep learning RNN models. This allows trainees to train the system via automated routines.

Accordingly, with the three subs- systems of the translation system 140, translated content may be provided in text or audio and translated content may be provided in either text or audio.

In addition to the routing of audio and or text content between the messaging platform 150 and translation platform 140, the content processor 14 of the intermediary translation system 10 is designed to pre -process and post process (i.e. after translation) the content to improve translation accuracy. These processes are described with references to Figs. 4 to 8. In general, pre-processing helps to improve speech input work-flows. However, post processing helps in improving text results by sentence and punctuation identification.

As shown in Fig. 4, the processing of the content depends on the nature of the content from the messaging platform 150 (through communication host 130). If the content is audio, the content is passed through an audio pre-processing module 16 of the content processor 14 which is typically in the form of a digital signal processor. The processed audio is then passed to the speech to text module 140a, then to the translation module 140b. After it is translated the content is returned to a post-processing module 18, where the translated content is assembled, and punctuation is added (as will be described in more detail below).

As required, the assembled translated text content is then passed either to the Text to Speech module 140c, or passed back to the messaging platform 150 via the communication system 130.

If the content is initially text, a simpler processing route is provided with an initial text processing step 20 being optionally provided to check for incomplete punctuation. This step may be bypassed such that the original text is passed directly to the translation module 140b. The text is then translated and returned to the text processing step 18 for punctuation checking before again either passing directly to the messaging platform 150, or through the Text to Speech module 140c for outputting as audio.

If language of the content that is required by the recipient of the content is the same, then content can be routed straight back to the instant messaging/chat host for posting with the recipient client

The audio pre-processing module 16 includes three stages which are described with reference to Figs 5 - 7. These stages are named Silence Detector, Speaker Identification, and Noise Purifier

In a first stage (Silence Detector) shown in Fig. 5, the raw audio stream 260 is processed so as to be grouped into audio data packets 262 for further processing in subsequent stages. It has been found that reducing the audio steam into smaller components and then subsequently reassembling them as long text strings after translation can lead to improved results as it provides more direct translation (as each packet is translated separately) and therefore avoids contextual bias which the translation services may erroneous apply on larger audio content. To allow the translation to be contextual, the system is designed to extract characteristics of the audio content in pre-processing which it will then use in post processing, when the translated text is reassembled and punctuated. These characteristics are selected to work with a punctuation model in the post processing module 18 and which is appropriately trained and has machine learning capability.

In the first stage, one objective is to detect silences within the audio content as a means of parsing sentences from the audio content. The audio content is initially framed 264 to be analysed. Speech has its statistical properties which are not constants across time. In an exemplary embodiment, to extract the spectral features from a small window of speech, an assumption is made that that the signal is stationary. Frame blocks of 20ms with 60% overlapping are used. Overlapping frames is preferred as non-overlapped samples may cause problems when Fourier analysis is used for voice activity detection (VAD). It abruptly cuts the signals at boundaries. Typical steps in conducting VAD on the frames is based on calculation of the energy levels with the audio frames. This process conducted through digital signal process may include conducting multiple -linear Fourier analysis and calculating mean and standard deviation of the first 500 ms samples of the given utterance. The noise and silence are characterized by calculated mean and standard deviation.

Once the noise and silence are calculated, for each sample (from 1^st to last) a determination is made whether the ID Mahalanobis distance is greater than a threshold value. As per Gaussian Distribution the threshold rejects up to 97% of frames and accepting only voice samples. In this regard, silence determined by this process gives indication of language structure, particular sentence lengths. Accordingly, the silence characteristic within the audio frames is captured as a characteristic for subsequent use in the post processing stage. Other characteristics that may be captured for subsequent use include frequency spectrum, magnitude spectrum, thresholding and power spectral density (PSD) estimation. In a third stage, the voice samples from windowed array are collected in new array of samples with the consecutive parts of voiced samples being combined to generate packets for next processing. These collected sample arrays are defined by threshold lengths of silence (as determined above). An exemplary duration of silence might be lsec. These pauses in speech activity are representative of sentences such that the newly combined data packets are representative of sentences in the audio content.

In the second stage of the pre-processing of the content data (as shown in Fig. 6), feature extraction 266 occurs through spectral analysis of the audio data packets. The features which are extracted are compatible with models to establish audio fingerprints to identify speakers within the voice samples, and other characteristics to aid in translation including to identify dialects of the target language based on established language dialects models stored in the memory 12 or retrieved on the fly. The features extracted may be based on frequency coefficients including long term specific divergence (LTSD). Pitch and distortion factors may also be established. Other features which may be used include speech rates, articulatory rates, syllables per minute rate, phonation-time rate ratios may be captured to assist in the post processing model for punctuation.

Following from this second stage of pre-processing, audio fingerprints are established which will enable speakers to be identified in the audio data packets. This identification will allow for enhanced filtering in the subsequent stage of pre- processing of the audio data. The feature extracted is also able to be used to compare with the language dialect models stored to the content processor so as to identify any specific dialect of the target language. This dialect can then be associated with the data packet and conveyed to the translation platform 140 to improve translation accuracy.

In the final stage of pre-processing (Fig. 7), the audio data packets are filtered to enhance the target speaker voice and to supress any other noises (such as other speakers, background and near noise). Digital filters 268 are applied to boost and cut the certain characteristics of sampled signal to make for better for mathematical models. To deal with this process, two different filters are needed. Environment noise causing mic bias, i.e. door being shut, are low in frequency but high in energy and dealt with high pass filter. In order to boost lower energy levels of high frequencies present in human speech, a pre-emphasis filter is used. A pre-emphasis filter boosts the high frequencies while attenuating the lower ones, which flattens the spectrum.

Following this noise purifying step, the audio data packets are then able to be passed to the translation platform 140 as discussed above. Typically, these data packets include information to initiate the translation request including Source Language”,“Target Language”, Audio information (sample size, sample rate, encoding format), and possible dialect.

Post processing of the translated data packets is required to reassemble the text (if the content was audio and subject to the pre-processing stages described above) and to add punctuation to improve context and meaning. An exemplary post processing stage is illustrated in Fig.8. To affect this process, a punctuation model 60 is provided which programmatically adds punctuation to the assembled and translated text. Typically for each audio stream event, multiple audio data packets are generated in the pre-processing stage and these are individually translated. The post processing stage waits for each of these translated data packets to be returned. These data packets are assembled in order and the punctuation model applies punctuation to the assembled text. The model 60 is typically trained on grammar and on punctuated and not punctuated text. To further aid the model, the characteristics obtained in the pre- processing stage are also inputted into the model to aid n the model decision making. These characteristics also are used in training of the model with the output of the post processing stage being periodically checked by language experts. Further routine analysis can be preformed which will analyse inputs performed in a prior period and calculate comparison matrices that allow for comparison under different sample sizes. This feedback can then be used to tune the post processing models to improve accuracy.

It has been found that utilising the combination of pre-processing of audio data to parse sentences from audio data and through the use of post processing the translated audio data using punctuation models and data from the pre-processing stages, significant improvements have been realised in translation accuracy as compared to existing translation services..

Accordingly, an intermediary translation platform is provided that is able to deliver in session translation to audio and text content. Procedures are also disclosed to improve accuracy of translation particularly when based on audio data.

It will be understood to persons skilled in the art of the invention that many modifications may be made without departing from the spirit and scope of the invention. It is to be understood that, if any prior art publication is referred to herein, such reference does not constitute an admission that the publication forms a part of the common general knowledge in the art.

In the claims which follow and in the preceding description, except where the context requires otherwise due to express language or necessary implication, the word "comprise" or variations such as "comprises" or "comprising" is used in an inclusive sense, i.e. to specify the presence of the stated features but not to preclude the presence or addition of further features in various embodiments of the disclosure.

Claims

1. An intermediary translation system implemented using computer processing and

memory resources and configured to integrate with one or more instant messaging /chat systems and one or more remote translation systems via a communication network, the system comprising;

determine a language of the content received from the instant messaging /chat system;

determine a required language of a recipient of the content within the instant messaging /chat system and, in response to the required language being different from the determined language;

to forward the content via the translation system interface to a remote translation system for translating the content to produce translated content in the required language; and

to receive the translated content from the remote translation system; and to forward the translated content to the recipient within the established session via the instant messaging /chat system interface.

2. An intermediary translation system according to claim 1, wherein the content

processor determines the language of the content by parsing the content.

3. An intermediary translation system according to claim 1 or 2, wherein the content processor determines the language of the content by referencing information of the sender of the content within the instant messaging /chat system.

4. An intermediary translation system according to any preceding claim, wherein the content processor determines the language of the content by reference to user input.

5. An intermediary translation system according to any preceding claim, wherein the intermediary translation system further comprises a user profile store that retain user data including the required language of the user. In some forms, the user input is derived from the instant messaging/chat system.

6. An intermediary translation system according to any preceding claim, wherein the instant messaging /chat system interface is a machine to machine (M2M) interface.

7. An intermediary translation system according to claim 6, wherein the instant

messaging /chat system interface utilises one or more application programming interfaces (APIs).

8. An intermediary translation system according to any preceding claim, wherein the communication is performed utilising HTTP protocol with a push notification service.

9. An intermediary translation system according to any preceding claim, wherein the translation interface is a M2M interface.

10. An intermediary translation system according to claim 9, wherein the translation

service interface utilises one or APIs.

11. An intermediary translation system according to any preceding claim, wherein the intermediary translation system further comprises a content modifying module that is configured to modify the content prior to forwarding the content to the remote translation system.

12. An intermediary translation system according to any preceding claim, wherein the translation system has a plurality of sub-systems, and the content processor is configured to select a translation sub- system for the received content in response to a characteristic of the content and/or system setting and forwarding the received content to the selected translation sub- system.

13. An intermediary translation system according to claim 12, wherein the translation sub systems comprise one or more of text to text translation system, speech to text system, and speech to text system.

14. An intermediary translation system according to any preceding claim, wherein the received content is in the form of any one or more of text, and audio information.

15. An intermediary translation system according to any preceding claim, wherein the translated content is in the form of any one of text and audio information.

16. An instant messaging/chat system comprising; an instant messaging /chat client; an instant messaging /chat host configured for communicatively coupling to said instant messaging /chat client, said host having logic for receiving application input from the instant messaging /chat client as a content posting to an established instant messaging /chat client session between instant messaging /chat clients; one or more translation systems implemented using computer processing and memory resources and configured to translate content from an instant messaging/chat session; and an intermediary translation system implemented using computer processing and memory resources and configured to integrate with the instant messaging /chat server and the one or more remote translation systems via a communication network, the intermediary translation system comprising;

determine a language of the content received from the instant messaging /chat session; determine a required language of a recipient of the content within the instant messaging /chat and, in response to the required language being different from the determined language;

to receive the translated content from the remote translation system; and to forward the translated content to the instant messaging /chat host as an output to the recipient of the posted content within the established session.

17. A method of translating audio content from a source language into a target language, the method being executed by a computer system using computer processing and memory resources, the method comprising: separating an audio stream into audio frames of predetermined duration; detecting voice activity within individual ones of the frames; using the detected voice activity to group frames into audio data packets; and using the audio data packets to create input data packets for a translation service to allow for translation of the audio content into the target language.

18. A method according to claim 17, further comprising establishing a frame energy level of the respective audio frames and wherein the voice activity is detected using the energy levels of the individual frames.

19. A method according to claim 17 or 18, wherein the grouping into the audio data

packets is based on identifying audio frames exhibiting one or more specified characteristic associated with the detected voice activity.

20. A method according to claim 19, wherein the groupings are established with a data packet containing some or all of the audio frames occurring between the identified audio frames.

21. A method according to claim 18 or 19, when dependent on claim 18, wherein the one or more specified characteristic is indicative of the presence of an interruption in detected voice activity over a specified duration.

22. A method according to claim 17 or 18, where input data packets are audio packets which are converted to text before translation into the target language.

23. A method according to claiml8 or 19, wherein the input data packets are translated into the target language separately to provide translated text packets.

24. A method according to claim 23, wherein the translated text packets are combined to provide a translation of the audio content.

25. A method according to claim 24, wherein a punctuation model is used to

programmatically add punctuation to the translated text.

26. A method according to claim 24, wherein the punctuation model uses features

extracted from the audio content to influence decision making on the incorporation of punctuation in the text.

27. A method of translating audio content from a source language into a target language, the method being executed by a computer system using computer processing and memory resources, the method comprising: providing the audio content; extracting features from the audio content; generating a translated text of the audio content in the target language; and using a punctuation model to add punctuation to the translated text, wherein the punctuation model uses the extracted features from the audio content in deciding on placement of punctuation in the translated text.

28. A method according to claim 27, further comprising separating the audio content into parts; translating the audio content parts separately into the target language; assembling the translated audio content parts to generate the translated text of the audio content.

29. A method of translating audio content from a source language into a target language, the method being executed by a computer system using computer processing and memory resources, the method comprising; extracting features from the audio content, the features extracted being compatible with dialect models of the target language; comparing the extracted features with the language dialect models to identify any dialect indicated by the extracted features; associating any identified dialect with the audio content; forwarding audio content with information on any identified dialect to a translation service for translating the audio content.

30. A method of translating instant messaging/chat content within an established session of an instant messaging/chat system that includes a sender and a recipient, the method being executed by an intermediary translation system using computer processing and memory resources via a communication network, the method comprising the steps of: receiving content from the sender within the established session; determining a language of the content of the message;

determining a required language of the recipient; and

if the required language is different from the determined language: forwarding the content to a remote translation system for translating the content to produce translated content in the required language; receiving the translated content from the remote translation system; and

31. A method of translating instant messaging/chat content according to claim 30, wherein the method of the intermediary translation system includes the step of determining the language of the content by parsing the content.

32. A method of translating instant messaging/chat content according to claim 30 or 31, wherein the method of the intermediary translation system includes the step of determining the language of the content by referencing information for the sender of the message.

33. A method of translating instant messaging/chat content according to any one of claims 30 to 32, wherein the method of the intermediary translation system includes the step of determining the language of the content by reference to user input.

34. A method of translating instant messaging/chat content according to any one of claims 30 to 33, wherein the method further comprises modifying the content prior to forwarding the content to the remote translation system.

35. A method of translating instant messaging/chat content according to any one of claims 30 to 34, wherein the translation system has a plurality of sub-systems, and the method further comprises the step of selecting a translation sub- system for the received content in response to a characteristic of the content and/or system setting and forwarding the received content to the selected translation sub- system.

36. A method of translating instant messaging/chat content according to claim 35, wherein the translation sub systems comprise one or more of text to text translation system, speech to text system, and speech to text system.

37. A method of translating instant messaging/chat content according to any one of claims 30 to 36, wherein the received content is in the form of any one or more of text, and audio information.

38. A method of translating instant messaging/chat content according to any one of claims 30 to 37, wherein the translated content is in the form of any one of text and audio information.