AU2020103587A4

AU2020103587A4 - A system and a method for cross-linguistic automatic speech recognition

Info

Publication number: AU2020103587A4
Application number: AU2020103587A
Authority: AU
Inventors: Poonam Bhargav; Rohit Daid; Sukhpreet Kaur; Yogesh Kumar; Ranbir Singh Batth; Ruchi Singla; Sushil Kumar
Original assignee: Bhargav Poonam Ms; Kaur Sukhpreet Dr; Singh Batth Ranbir Dr; Singla Ruchi Dr
Current assignee: Bhargav Poonam Ms; Kaur Sukhpreet Dr; Singh Batth Ranbir Dr; Singla Ruchi Dr
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-04
Anticipated expiration: 2028-11-20

Abstract

The present disclosure relates to a system and a method for cross linguistic automatic speech recognition. The method includes receiving corpus of speech from various sources including phone file and input speech signal of various languages using input unit, training corpus of speech and thereby creating a dictionary file by employing a training module, extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal by means of a dynamic feature extraction unit, making an utterance by deploying an acoustic model, wherein the acoustic model is enclosed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme, decoding various languages in a particular language, and generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages. 27 4-4 U) ) UU CVL~ 0 or 0 0 0 00V =5 00 00 0U C _ _ _ _ _ _ _ _ _ _ _ _ _ _

Description

4-4

U) ) UU

CVL~ 0

or

0 0

0 00V

=5 00

00 0U

C _ _ _ _ _ _ _ _ _ _ _ _ _ _ A SYSTEM AND A METHOD FOR CROSS-LINGUISTIC AUTOMATIC SPEECH RECOGNITION FIELD OF THE INVENTION

The present disclosure relates to automatic speech recognition systems. More specifically, the present disclosure relates to a system and a method for cross-linguistic automatic speech recognition.

BACKGROUND OF THE INVENTION

Human speech can probably be considered as a most natural and comfortable man-computer interface. Speech input provides the advantages of hands-free operation by providing access for physically challenged users or users that are using their hands for different operation, such as driving a car. Though, speech recognition is an imperative technology for use in smart phones and other programmable, portable devices requiring user interaction. Such technology enables a user to perform various tasks through any programmable portable device through audio command.

Currently, various speech recognition systems and methods are available that identifies the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process. Modern general-purpose speech recognition systems are based on Hidden Markov Models. These are statistical models that output a sequence of symbols or quantities. Hidden Markov Models are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. In a short time-scale speech can be approximated as a stationary process. Speech can be thought of as a Markov model for many stochastic purposes. Furthermore, some speech recognition systems use acoustic phoneme models for mapping text spellings of words to acoustic word models. A phoneme is a representation of any of the small units of speech sound in a language that assists to distinguish one word from another. An acoustic phoneme model is a model of different possible acoustics that are associated with a given phoneme.

Multilingual spontaneous speech recognition is a significant real world problem when more and more articulated discussion applications are being arranged in Asia. Due to the fact that there could happen an intrinsic prejudice in sound scores from diverse languages, it is also hypothetically stimulating to the invention conducts to recompense for this score bias, which could come from diverse acoustic and recorded environments, dissimilar extents of the training set for each language and dissimilar acoustic and phonetic tenacity in modeling each of the languages of interest. Multilingual phoneme set is attained from monolingual models by uniting acoustically comparable phones. The model grouping is based on the supposition that the articulatory depictions of phones are so alike across languages that they can be measured as elements that are independent from the original languages.

In one solution, ananchored speech detection and speech recognition is provided. A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wake word. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

In another solution, a method and system for considering information about an expected response when performing speech recognition is provided. A speech recognition system receives and analyzes speech input from a user in order to recognize and accept a response from the user. Under certain conditions, information about the response expected from the user may be available. In these situations, the available information about the expected response is used to modify the behavior of the speech recognition system by taking this information into account. The modified behavior of the speech recognition system according to the invention has several embodiments including: comparing the observed speech features to the models of the expected response separately from the usual hypothesis search in order to speed up the recognition system; modifying the usual hypothesis search to emphasize the expected response; updating and adapting the models when the recognized speech matches the expected response to improve the accuracy of the recognition system.

In another solution, a local speech recognition of frequent utterances is provided. In a distributed automated speech recognition (ASR) system, speech models may be employed on a local device to allow the local device to process frequently spoken utterances while passing other utterances to a remote device for processing. Upon receiving an audio signal, the local device compares the audio signal to the speech models of the frequently spoken utterances to determine whether the audio signal matches one of the speech models. When the audio signal matches one of the speech models, the local device processes the utterance, for example by executing a command. When the audio signal does not match one of the speech models, the local device transmits the audio signal to a second device for ASR processing. This reduces latency and the amount of audio signals that are sent to the second device for ASR processing.

However, the existing prior art solutions are ineffective as existing systems are meant for recognizing the speech for single language. Further, existing systems are not trained at real-time training environment and not able to identify the different speakers' voices. In view of the foregoing discussion, there exists a need to have a system and a method for cross-linguistic automatic speech recognition using machine learning classifications.

SUMMARY OF THE INVENTION

The present disclosure seeks to provide a system and a method for cross-linguistic automatic speech recognition using machine learning classifications.

In an embodiment, a system for cross-linguistic automatic speech recognition is provided. The system includes an input module for receiving corpus of speech from various sources including phone file and input speech signal of various languages.

The system further includes a training module connected to the input module for training corpus of speech and thereby creating a dictionary file, wherein the dictionary file maps every word to a sequence of sound units associated with each signal.

The system further includes a dynamic feature extraction unit associated with the training module for extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal, wherein phone file is a record of individual sound units used for creating a word.

The system further includes an acoustic model in connection with the dynamic feature extraction unit for making an utterance, wherein the acoustic model is enclosed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme.

The system further includes a language model in association with the acoustic model for decoding various languages in a particular language, wherein the language model presents word-level verbal communication and construction, which is further characterized by a predetermined amount of pluggable completions.

The system further includes a processing unit equipped with machine learning for generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages.

In an embodiment, a set of transcripts is provided for a database (in a single file) with two dictionaries, wherein legitimate words in the language are mapped into sequence of sound units (or sub-word units) in a first dictionary, wherein non-speech sounds are mapped to the corresponding non-speech or speech-like sound units in a second dictionary.

In an embodiment, sounds of lexis in a speech are collected from a set of sounds, i.e. phones, which might be guarded as sub-word units, wherein a sound model is constructed by captivating a wide corpus of words with the help of particular training algorithms, generation of arithmetical depictions for every phoneme in a verbal communication is made out.

In an embodiment, the language model is to map the series of acoustic units on behalf of articulation. In an embodiment, a spontaneous speech model is designed for the multilingual system containing a dataset of various languages from various sources including presentations, live debates, interviews, telephonic conversations and one to one communications of the human being.

In another embodiment, a method for cross-linguistic automatic speech recognition is provided. The method includes receiving corpus of speech from various sources including phone file and input speech signal of various languages.

The method further includes training corpus of speech and thereby creating a dictionary file, wherein a dictionary file maps every word to a sequence of sound units associated with each signal. The method further includes extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal.

The method further includes making an utterance, wherein the acoustic model is enclosed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme.

The method further includes decoding various languages in a particular language. The method further includes generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages.

In an embodiment, the method for cross-linguistic automatic speech recognition further comprises receiving corpus of speech from various sources including phone file and input speech signal of various languages; acquiring dictionary file of multilingual speech from a database for comparing and separating phone file and input speech signal in a particular language from multilingual speech; training corpus of speech using a training module and thereupon creating a dictionary file of the trained corpus of speech; extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal using dynamic feature extraction mechanism; and generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages.

In an embodiment, real-time training is performed on the corpus of speech using machine learning model to improve speech recognition.

An object of the present disclosure is to develop a system for cross linguistic automatic speech recognition.

Another object of the present disclosure is to train the multilingual acoustic model at real-time environment.

Another object of the present disclosure is to train the system by different speakers having different speaking styles and in different languages.

Another object of the present disclosure is to train the system in both noisy and noise-free environments.

Another object of the present disclosure is to apply the dynamic feature extraction mechanism for real-time training of the Proposed model for large vocabulary text corpus.

Another object of the present disclosure is to build the robust acoustic model for Multilingual system which recognize the sounds of any speakers.

Another object of the present disclosure is to apply the Machine learning model for the generation and classification purpose and check the performance of the proposed model at live environment.

Another object of the present disclosure is to compute performance of the system using parameters such as recognition accuracy, word error rate, convergence ratio and overall likelihood per frame.

Another object of the present disclosure is to use of voice as a natural and helpful technique for human-gadget communication that is predominantly related to hands-free things (e.g., whereas driving) and communication with small form-factor devices (e.g., wearable's).

Another object of the present disclosure is to develop an automatic speech recognition system that is used in customer service to progression repetitive phone requirements, or in healthcare and legal for documentationprocesses.

Another object of the present disclosure is to help companies /universities/colleges to improve communications and decode them in a data format that is easy to accomplish and search.

Yet another object of the present invention is to deliver an expeditious and cost-effective method for cross-linguistic automatic speech recognition.

To further clarify advantages and features of the present disclosure, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure 1 illustratesa schematic block diagram of a system for cross-linguistic automatic speech recognition in accordance with an embodiment of the present disclosure;

Figure 2 illustrates a flow chart of a method forcross-linguistic automatic speech recognition in accordance with an embodiment of the present disclosure; and

Figure 3 illustrates a block diagram of a system for cross-linguistic automatic speech recognition in accordance with an embodiment of the present disclosure.

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.

Referring to Figure 1, a schematic block diagram of a system for cross-linguistic automatic speech recognition is illustrated in accordance with an embodiment of the present disclosure. The system 100 facilitates development of a spontaneous speech model 114 for the recognition of the multilingual speech recognition systems. The system 100 includes an input module 102 for receiving corpus of speech from various sources including phone file and input speech signal of various languages.

In an embodiment, a training module 104is connected to the input module 102 for training corpus of speech and thereby creating a dictionary file. The dictionary file maps every word to a sequence of sound units associated with each signal.

In an embodiment, a dynamic feature extraction unit 106 is associated with the training module 104 for extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal. The phone file is a record of individual sound units used for creating a word.

In an embodiment, an acoustic model 108 is in connection with the dynamic feature extraction unit 106 for making an utterance. The acoustic model 108 is enclosed by arithmetical demonstrations for single discrete significance. Each of the arithmetical demonstrations is assigned with a tag related to a phoneme.

In an embodiment, a language model 110 is in association with the acoustic model 108 for decoding various languages in a particular language. The language model 110 presents word-level verbal communication and construction, which is further characterized by a predetermined amount of pluggable completions.

In an embodiment, a processing unit 112is equipped with machine learning for generating and classifying of robust spontaneous speech model 114 for multilingual speech system by employing machine learning model for generating transcription in different languages.

In an embodiment, the language model 110 is to map the series of acoustic units on behalf of articulation. In an embodiment, a spontaneous speech model 114 is designed for the multilingual system containing a dataset of various languages from various sources including presentations, live debates, interviews, telephonic conversations and one to one communications of the human being.

Figure 2 illustrates a flow chart of a method for cross-linguistic automatic speech recognition in accordance with an embodiment of the present disclosure. At step 202, the method 200 includes receiving corpus of speech from various sources including phone file and input speech signal of various languages through an input unit.

At step 204, the method 200 includes training corpus of speech and thereby creating a dictionary file by means of a training module 104, wherein a dictionary file maps every word to a sequence of sound units associated with each signal. At step 206, the method 200 includes extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal by employing a dynamic feature extraction unit 106.

At step 208, the method 200 includes making an utterance using an acoustic model 108, wherein the acoustic model 108 is enclosed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme.

At step 210, the method 200 includes decoding various languages in a particular language by using a language model 110. At step 212, the method 200 includes generating and classifying of robust spontaneous speech model 114 for multilingual speech system by employing machine learning model for generating transcription in different languages. Decoding is a process to compute series of different language words to match the acoustic signal represented by the feature vectors.

In an embodiment, the method for cross-linguistic automatic speech recognition further comprises receiving corpus of speech from various sources including phone file and input speech signal of various languages. The method further includes acquiring dictionary file of multilingual speech from a database for comparing and separating phone file and input speech signal in a particular language from multilingual speech. The method further includes training corpus of speech using a training module 104 and thereupon creating a dictionary file of the trained corpus of speech. The method further includes extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal using dynamic feature extraction mechanism. The method further includes generating and classifying of robust spontaneous speech model 114 for multilingual speech system by employing machine learning model for generating transcription in different languages.

Figure 3 illustrates a block diagram of a system for cross-linguistic automatic speech recognition in accordance with an embodiment of the present disclosure. Figure 3 includes various speech sources including dictionary file for multilingual speech 302, phone file 304 (different sound units), and input speech signal 306 (recorded WAV file).

In block 302, Figure 3 includes the dictionary file for multilingual speech, wherein dictionary file for text corpus of independent words are stored in it, wherein dictionary file maps every word to a sequence of sound units, associated with each signal. Thus, in addition to the speech signals, a set of transcripts for the database (in a single file) and two dictionaries is given, one in which legitimate words in the language are mapped to sequence of sound units (or sub-word units) and another in which non-speech sounds are mapped to the corresponding non-speech or speech-like sound units.

In block 304, Figure 3 includes the phone file is a record of individual sound units that is needed to make a word. In block 308, Figure 306 includes input speech signal in which recorded WAV file is provided as an input. The WAV file may be recorded from any recording device including tape recorder, mobile, and the like.

In block 308, Figure 3 includes a dynamic feature extraction that acquires the corpus from different sources. To train the corpus, create the dictionary file and thereupon extract the phones from the dictional file, and extract the unique word from the transcription. Then perform the recording for same transcription and thereafter apply dynamic feature extraction mechanism for the same.

In block 310, Figure 3 includes generation of acoustic speech model 114. The acoustic model 108 is a systemized that is enclosed by arithmetical demonstrations for single discrete significance that helps in making an utterance. Each of the arithmetical demonstrations is assigned a tag related to a phoneme. Sounds of lexis in a speech are collected from a set of sounds, i.e. phones, which might be guarded as sub-word units. A sound model is constructed by captivating a wide corpus of words and with the help of particular training algorithms, generation of arithmetical depictions for every phoneme in a verbal communication is made out.

In block 310, Figure 3 includes language model 110 used for decoding of speech model 114. Language model 110of the linguist presents word-level verbal communication, construction, which can be characterized by any amount of pluggable completions. The primary purpose of language model 110 is decoding. A language model 110 is to map the series of acoustic units on behalf of articulation. Decoding is the process to compute which series of different language words are likely to match the acoustic signal represented by the feature vectors.

In block 312, Figure 3 includes apply desired machine learning for classification and generation of robust. For the training and testing, machine learning is used to make the system more robust and reliable for recognition at the live environment. The real-time training is performed using machine learning model to improve the recognition accuracy of the proposed system. In block 314, Figure 3 includes final outcomes as transcription in different languages according to the requirement of the user.

In an embodiment, primary objective of the work includes the development of spontaneous speech recognizer for multilingual system systems which involves the generation of an acoustic model 108 for multilingual speech, phonetic decoding, language modelling and also for the development of the speech corpus for training and testing purpose. The system includes step by step process for training the multilingual speech model and testing on live environment using different speakers with a different accent, speaking styles. Because of the spontaneous speech system, the sounds are usually unprompted and non- designed and are commonly described by repetitions, repairs, false start, partial words and non-planned words, silence gap, etc. The system is focused on the development of the spontaneous speech model 114 for the recognition of the multilingual systems. The system is focused on the development of the spontaneous speech model 114 for the recognition of the multilingual speech recognition systems. So far, no work has been done for spontaneous speech recognition for multilingual speech system which includes different Indian and foreign languages including English, Roman French, Spanish, Punjabi, Hindi and others.

In an embodiment, in order to build up the spontaneous speech model 114 for the multilingual system, a very dataset from the different languages has to be taken from the presentations, live debates, interviews, telephonic conversations and one to one communications of the human being. For the training and testing, machine learning will be used to make the system more robust and reliable for recognition at the live environment. The real-time training is performed using machine learning model to improve the recognition accuracy of the proposed system. Objective of the proposed solution is that it will be speaker impendent and trained with both male and female voices of all the mentioned languages for Multilingual compatibility. Because speaker independent voice recognition is the opposite of speaker dependent voice recognition. It does not need any training by the speaker and can recognize the speech from a large variety of speakers. More processing capability is required by speaker independent voice recognition systems as compared with speaker dependent systems. The performance of the proposed model is evaluated by using the speech recognition accuracy in both noisy and noise-free environments. Other parameters will also be computed such as word error rate, convergence ratio and overall likelihood per frame.

In an embodiment, problems in the existing system can be overcome by training the multilingual acoustic model 108 at real-time environment. The disclosed system is trained by different speakers having different speaking styles and in different languages. The disclosed system is trained in both noisy and noise-free environments and performance of the proposed model can be tested using live testing of the multilingual speech. To apply the dynamic feature extraction mechanism for real-time training of the proposed model for large vocabulary text corpus.To build the robust acoustic model 108 for multilingual system which recognize the sounds of any speakers. To apply the machine learning model for the generation and classification purpose and check the performance of the proposed model at live environment. Compute the performance using parameters such as recognition accuracy, word error rate, convergence ratio and overall likelihood per frame.

In an embodiment, automatic saponaceous multilingual speech recognition (ASR) has become progressively applicable to date, tracking the explosive development of mobile devices. The use of voice as a natural and helpful technique for human-gadget communication is predominantly related to hands-free things (e.g., whereas driving) and communication with small form-factor devices (e.g., wearable's). The nature of the client involvement in these situations are basically pretentious by the translation accuracy and period of time openness of the ASR framework for Multilingual System. Automatic recognition System can also be used in customer service to progression repetitive phone requirements, or in healthcare and legal for documentation processes. Automatic spontaneous speech model 114 for Multilingual system can also help companies /universities/colleges to improve communications and decode them in a data format that is easy to accomplish and search.

In an embodiment, the use of voice as a natural and helpful technique for human-gadget communication is predominantly related to hands-free things (e.g., whereas driving) and communication with small form-factor devices (e.g., wearable's). The nature of the client involvement in these situations are basically pretentious by the translation accuracy and period of time openness of the Automatic Speech Recognition(ASR) framework for multilingual system. In our daily routine either we are at the office or at home or talking to someone else, we always use the mix language for communication, similarly, we do when we communicate on a mobile device. That is why the main consideration is the development of multilingual speech system instead of one language speech recognition system. And whatever we speak, the propose model will transcribe in that particular language. Automatic recognition system can also be used in customer service to progression repetitive phone requirements, or in healthcare and legal for documentation processes. Automatic spontaneous speech model 114 for multilingual system can also help companies /universities/colleges to improve communications and decode them in a data format that is easy to accomplish and search.

In an embodiment, multilingual spontaneous speech recognition is a significant real-world problem when more and more articulated discussion applications are being arranged in Asia. Due to the fact that there could happen an intrinsic prejudice in sound scores from diverse languages, it is also hypothetically stimulating to the invention conducts to recompense for this score bias, which could come from diverse acoustic and recorded environments, dissimilar extents of the training set for each language and dissimilar acoustic and phonetic tenacity in modeling each of the languages of interest. Multilingual phoneme set is attained from monolingual models by uniting acoustically comparable phones. The model grouping is based on the supposition that the articulatory depictions of phones are so alike across languages that they can be measured as elements that are independent from the original languages.

In an embodiment, the system is focused on the development of the spontaneous speech model 114 for the recognition of the multilingual speech recognition systems. So far, no work has been done for spontaneous speech recognition for multilingual speech system which includes different Indian and foreign languages including English, Roman French, Spanish, Punjabi, Hindi and others. In order to build up the spontaneous speech model 114 for the multilingual system, a very dataset from the different languages has to be taken from the presentations, live debates, interviews, telephonic conversations and one to one communications of the human being. For the training and testing, machine learning will be used to make the system more robust and reliable for recognition at the live environment.

In an embodiment, the real-time training is done using machine learning model to improve the recognition accuracy of the proposed system. Another objective of the proposed solution is that it will be speaker impendent and trained with both male and female voices of all the mentioned languages for multilingual compatibility. Because speaker independent voice recognition is the opposite of speaker dependent voice recognition. It does not need any training by the speaker and can recognize the speech from a large variety of speakers. More processing capability is required by speaker independent voice recognition systems as compared with speaker dependent systems.

In an embodiment, industrial application of the system includes using the system by persons having disabilities, for in-car systems, in the military, and similarly by businesses for transcription, or to renovate audio and video files into transcript. Hospitals using voice recognition technology see a significant improvement in the reliability and accuracy of electronic health records. Most of the medical errors tend to happen during clinical hand-off when a patient is referred to another physician. During this hand-off, if the entire information, thought process and actions are not transferred appropriately, that can increase the chances of medical errors. With voice recognition technology, all this information is adequately and accurately transferred between the physicians thereby reducing the chances of medical errors. Automatic spontaneous speech model 114 for multilingual system can also help companies /universities/colleges to improve communications and decode them in a data format that is easy to accomplish and search. The use of voice as a natural and helpful technique for human-gadget communication is predominantly related to hands-free things (e.g., whereas driving) and communication with small form-factor devices (e.g., wearable's). The nature of the client involvement in these situations are basically pretentious by the translation accuracy and period of time openness of the ASR framework for multilingual system.

The system developed in accordance with the present disclosure are improved at developing a method for cross-linguistic automatic speech recognition. The system facilitates training the multilingual acoustic modelat real-time environment. The disclosed system is trained by different speakers having different speaking styles and in different languages. The disclosed system is trained for both noisy and noise-free environments. The disclosed system is used to apply the dynamic feature extraction mechanism for real-time training of the Proposed model for large vocabulary text corpus. The disclosed systembuilds the robust acoustic model for Multilingual system which recognize the sounds of any speakers. The disclosed system is used to apply the Machine learning model for the generation and classification purpose and check the performance of the proposed model at live environment. The system is used to compute performance of the system using parameters such as recognition accuracy, word error rate, convergence ratio and overall likelihood per frame. The system uses voice as a natural and helpful technique for human-gadget communication that is predominantly related to hands-free things (e.g., whereas driving) and communication with small form-factor devices (e.g., wearable's). Further, the disclosed system develops an automatic speech recognition system that is used in customer service to progression repetitive phone requirements, or in healthcare and legal for documentation processes. Furthermore, the disclosed system facilitates in helping companies /universities/colleges to improve communications and decode them in a data format that is easy to accomplish and search.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

We Claim

1. A system for cross-linguistic automatic speech recognition, the system comprises:

an input module for receiving corpus of speech from various sources including phone file and input speech signal of various languages; a training module connected to the input module for training corpus of speech and thereby creating a dictionary file, wherein the dictionary file maps every word to a sequence of sound units associated with each signal; a dynamic feature extraction unit associated with the training module for extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal, wherein phone file is a record of individual sound units used for creating a word; an acoustic model in connection with the dynamic feature extraction unit for making an utterance, wherein the acoustic model is enclosed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme; a language model in association with the acoustic model for decoding various languages in a particular language, wherein the language model presents word-level verbal communication and construction, which is further characterized by a predetermined amount of pluggable completions; and a processing unit equipped with machine learning for generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages.

2. The system as claimed in claim 1, wherein a set of transcripts is provided for a database (in a single file) with two dictionaries, wherein legitimate words in the language are mapped into sequence of sound units (or sub-word units) in a first dictionary, wherein non speech sounds are mapped to the corresponding non-speech or speech-like sound units in a second dictionary.

3. The system as claimed in claim 1, wherein sounds of lexis in a speech are collected from a set of sounds, i.e. phones, which might be guarded as sub-word units, wherein a sound model is constructed by captivating a wide corpus of words with the help of particular training algorithms, generation of arithmetical depictions for every phoneme in a verbal communication is made out.

4. The system as claimed in claim 1, wherein the language model is to map the series of acoustic units on behalf of articulation.

5. The system as claimed in claim 1, wherein a spontaneous speech model is designed for the multilingual system containing a dataset of various languages from various sources including presentations, live debates, interviews, telephonic conversations and one to one communications of the human being.

6. A method for cross-linguistic automatic speech recognition, the method comprises:

receiving corpus of speech from various sources including phone file and input speech signal of various languages; training corpus of speech and thereby creating a dictionary file, wherein a dictionary file maps every word to a sequence of sound units associated with each signal; extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal; making an utterance by employing an acoustic model, wherein the acoustic model is enclosed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme; decoding various languages in a particular language; and generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages.

7. The method as claimed in claim 6, wherein the method for cross-linguistic automatic speech recognition further comprises:

receiving corpus of speech from various sources including phone file and input speech signal of various languages; acquiring dictionary file of multilingual speech from a database for comparing and separating phone file and input speech signal in a particular language from multilingual speech; training corpus of speech using a training module and thereupon creating a dictionary file of the trained corpus of speech; extracting phones from the dictional file and extracting unique word from the transcription of phone file and input speech signal using dynamic feature extraction mechanism; and generating and classifying of robust spontaneous speech model for multilingual speech system by employing machine learning model for generating transcription in different languages.

8. The method as claimed in claim 6, wherein real-time training is performed on the corpus of speech using machine learning model to improve speech recognition.

Dynamic Feature Input Module Training Module Extraction Unit 102 104 106

Acoustic Model Language Model Processing Unit 108 110 112

Spontaneous Speech Model 114

Figure 1

0

receiving corpus of speech from various sources including phone file and input speech signal of various languages 2 202

204 2 training corpus of speech and thereby creating a dictionary file, wherein wher a dictionary file maps every word to a sequence of sound units associated with each signal

2 206 extracting phones from the dictional file and extracting uniquee word from the transcription of phone file and input speech signal

2 208 making an utterance, wherein the acoustic model is enclosed ed by arithmetical demonstrations for single discrete significance, wherein each of the arithmetical demonstrations is assigned with a tag related to a phoneme

2 210 decoding various languages uages in a particular language

generating and classifying of robust spontaneous speech model odel fo for multilingual speech system by employing machine learning 2 212 model for generating transcription in different languages

Figure 2

302

304 308

306 312

314

316

Figure 3