WO2019184942A1 - 语言语义的音频交换方法和音频交换系统、编码图形 - Google Patents

语言语义的音频交换方法和音频交换系统、编码图形 Download PDF

Info

Publication number
WO2019184942A1
WO2019184942A1 PCT/CN2019/079834 CN2019079834W WO2019184942A1 WO 2019184942 A1 WO2019184942 A1 WO 2019184942A1 CN 2019079834 W CN2019079834 W CN 2019079834W WO 2019184942 A1 WO2019184942 A1 WO 2019184942A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
speech
phoneme
basic
audio
Prior art date
Application number
PCT/CN2019/079834
Other languages
English (en)
French (fr)
Inventor
孔繁泽
Original Assignee
孔繁泽
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 孔繁泽 filed Critical 孔繁泽
Publication of WO2019184942A1 publication Critical patent/WO2019184942A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of information exchange, and in particular relates to a language exchange audio exchange method, an audio exchange system, and an encoded graphic.
  • the current language translation is mainly composed of speech recognition, semantic analysis and sentence synthesis.
  • the speech recognition uses high-sensitivity sensors to extract the audio signal set corresponding to the sentence text from the frequency domain or time domain speech signal stream of the initial language.
  • the model uses the hidden Markov model (HMM), self-learning model, artificial neural network (ANN) and other models to identify and quantify the text sequence and semantic meaning in the audio signal set to determine the expression content as much as possible.
  • the sentence synthesis is based on the expression content.
  • the identification and quantified data form an audio signal set or a sequence of text sequences in the target language.
  • the complexity of the semantic analysis model requires a large amount of computing resources. For the application of mobile terminals, a distributed computing architecture is needed, and the reliable bandwidth of the Internet is used to access the computing resources of the server, so the real-time translation and Accuracy is limited.
  • an apparatus for implementing speech-to-text conversion using digital encoding wherein a phoneme storage unit is used to store first language phoneme feature data; and a phoneme conversion unit is used to pass the received phoneme signal sequence through a first language phoneme feature.
  • the device illustrates the basis of the coding mapping between words and speech. How to use the coding mapping basis to reduce the resource consumption of graphics and audio conversion of the same semantics between languages requires creative improvement.
  • the embodiments of the present application are directed to providing an audio exchange method and an audio exchange system for language semantics, so as to solve the technical problem that the semantic complexity of the language interpreting in the prior art leads to poor data response and real-time performance.
  • the audio exchange method of the language semantics in the embodiment of the present application forms a speech mapping structure of each language by using a minimum phoneme sequence, and performs semantic inter-language conversion through each speech mapping structure.
  • the language semantic audio exchange system of the embodiment of the present application is configured to form a speech mapping structure of each language by using a minimum phoneme sequence, and perform semantic inter-language conversion through each speech mapping structure.
  • the basic speech coding pattern of the embodiment of the present application is used for graphical display of language phonemes, including a basic frame, the basic frame includes a first alignment column, a second adapter column, and an adapter bar, the first The adapter column and the second adapter column respectively provide an adaptation bit group, the adaptation bit group includes a plurality of adaptation bits, and the two ends of the adapter bar are respectively connected to one of the adaptation columns Bit.
  • the audio exchange method and the audio exchange system of the language semantics of the embodiment of the present application, and the minimum phoneme forming the minimum shortest segment of the audio in the language composition are used as the basic data exchange unit for semantic conversion between languages, and the minimum phoneme is used as the code exchange code.
  • the foundation changes the basic structure of speech recognition, simplifies the coding length and coding efficiency of audio content in the language, optimizes the data exchange efficiency during language translation, reduces the real-time response delay of remote data, and improves the basic data structure and foundation. Data has a positive impact on the storage capacity of the local mobile.
  • FIG. 1 is a schematic diagram of a data processing process of an audio exchange method for language semantics according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram showing an encoding process of an audio exchange method for language semantics according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a voice mapping structure of an audio exchange method for language semantics according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a voice mapping structure of an audio exchange method for language semantics according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram showing language conversion of a language semantic audio exchange method according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an audio exchange system for language semantics according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram showing the structure of a basic speech coding pattern in the audio exchange method of the language semantics according to the embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the speech structure of each language is formed by using the smallest phoneme sequence, and the inter-language conversion of semantics is completed by each speech mapping structure.
  • Semantic conversion refers to the conversion of different pictures and expressions of the same semantics.
  • the pronunciation of the regional common language expression semantics is deterministic, and the pronunciation of vocabulary and sentences can be summarized into different combinations of syllables.
  • Using a basic set of minimum phonemes to form each syllable can eliminate the audio redundant signal and interference information with the low signal load characteristics of the smallest phoneme, providing a more compact coding basis for complex data exchange and reducing the code length.
  • the minimum number of phonemes and the audio characteristics of the basic elements of the pronunciation can be determined, the number is less than 1000, and the total number of the world's 7000 languages is not repeated.
  • the smallest phoneme, in which each Western language uses about 40 minimum phonemes, and Chinese does not exceed 150 minimum phonemes. It is possible to use a fixed-length code of a hundred-digit range or a range of thousands to establish an index such as a decimal three-digit number or Four digits, for example, a binary 10-digit or 20-digit number.
  • the audio exchange method of the language semantics in the embodiment of the present application utilizes the smallest phoneme forming the minimum shortest segment of the audio composition as the basic data exchange unit for semantic conversion between languages, and uses the minimum phoneme as the coding basis of data exchange, and changes the speech recognition.
  • the basic structure simplifies the coding length and coding efficiency of the audio content in the language, so that the complex audio features formed by the composite information of the tonal, scale and sound domain in the language segment are avoided during the encoding process of the language audio, and the speech recognition rate is ensured.
  • the mapping structure of speech coding and text coding formed by the minimum phoneme optimizes the data exchange efficiency in language translation. It has a positive impact on reducing the real-time response delay of remote data and improving the storage capacity of the underlying data structure and the underlying data on the local mobile end.
  • FIG. 1 is a schematic diagram of a data processing process of a language semantic audio exchange method according to an embodiment of the present application. As shown in Figure 1, it includes:
  • Step 100 Serialize all the smallest phonemes.
  • the serialization process may include recognition of syllables, phonemes, scales, tones in the language, quantitative mathematical descriptions of the identified syllables, phonemes, scales, tones, such as audio feature data in the time domain or frequency domain, and quantitative mathematical description data. Structured storage, such as indexing one by one.
  • Step 200 Form inter-text inter-phone mapping data of each language by using a subset of all the smallest phonemes.
  • the pronunciation basis of each language is determined by a subset of all the smallest phonemes, and the combination of the smallest phonemes in the subset forms a speech identifier of the pronunciation of the words in the language, and then uses the voice recognition to form the mapping data of the corresponding structure between the text and the speech identifier.
  • the mapping data includes a data structure that stores data.
  • the mapping data may include mapping data between text and speech, and mapping data between speech.
  • Step 300 Form inter-voice mapping data of each language by using language semantics.
  • mapping data includes the data structure of the stored data. It can also include mapping data between text and speech.
  • Step 400 Form a semantic language conversion by using corresponding inter-voice mapping data and text-to-speech mapping data.
  • the audio exchange method of the language semantics in the embodiment of the present application ensures the coherence and correctness of the text-to-speech conversion of a language by mapping data between words and speeches, and the combination of inter-voice mapping data and text-to-speech mapping data makes inter-language Conversion diversity can achieve higher language-based data interaction efficiency in the conversion process while ensuring the quality of conversion between languages.
  • a further encryption effect can be formed by mapping changes between the inter-voice mapping data and the text-to-speech mapping data.
  • FIG. 2 is a schematic diagram of an encoding process of a language semantic audio exchange method according to an embodiment of the present application. As shown in FIG. 2, based on the above embodiment, step 100 includes:
  • Step 110 Collect the smallest phoneme of each common language by voice recognition.
  • the speech of the language can be decomposed into a structural decomposition of the phonemes from sentence pronunciation to word pronunciation to word syllable to syllable.
  • Those skilled in the art will appreciate that the use of computer technology for audio acquisition and temporal or frequency domain feature analysis of audio segments can determine the audio characteristics of words, words, phrases, and determine the smallest phoneme features included therein.
  • Step 120 Form the smallest phoneme into a unified phoneme sequence.
  • speech recognition techniques can identify and determine the smallest phoneme audio features employed in each language.
  • the determined audio features of each of the smallest phonemes are uniformly labeled to form a unified phoneme sequence of all the smallest phonemes.
  • the unified phoneme sequence enables the speech of the language to be accurately deconstructed into a determined combination of at least one smallest phoneme, and the combination can be determined to obtain a corresponding coding sequence by a unified phoneme sequence.
  • the initials are formed by the initials and the finals.
  • the initials are formed by a single minimum phoneme or several single minimum phonemes.
  • the finals are formed by one or several minimum phonemes.
  • Similar English uses vowels and consonants to form syllables. The vowels are single.
  • the smallest phoneme or several single minimum phonemes are formed.
  • the consonants are formed by one or several smallest phonemes.
  • the parts of the unified phoneme sequence formed can be as follows:
  • the single smallest phoneme in the unified phoneme sequence in the table has a unique encoding in the unified phoneme sequence.
  • a unique code can be formed by using a 10 bit (bit) length for a minimum of less than 1000 phonemes.
  • the audio exchange method of the language semantics in the embodiment of the present application forms a unified phoneme sequence as a basic information carrier for text or voice conversion of the same or similar semantics between different languages, avoiding excessive carrying of other types of composite audio carriers (such as syllables).
  • Information interference caused by redundant information is beneficial to optimize the accuracy and recognition efficiency of speech recognition.
  • the minimum phoneme uses a unified phoneme sequence to further update the unified phoneme sequence as the language evolves, keeping the simultaneous changes in the speech of each language.
  • step 200 of the audio exchange method for language semantics in an embodiment of the present application includes:
  • Step 210 Form a first basic speech coding sequence corresponding to a pronunciation of a word or a word in the first language by using a part of the phonemes in the unified phoneme sequence.
  • a part of the phoneme includes all the smallest phonemes in a language, and this part of the phoneme can be used to form a syllable to form a pronunciation of a single word or word in the language.
  • the basic speech coding of each word or word in the first language is formed, thereby forming a basic speech coding sequence for all (or primary) words or words.
  • the word “mama” has a pinyin of "ma”, including phonemes “m” and "a”.
  • the encoding of "m” in the unified phoneme sequence is 120, and the encoding of "a” in the unified phoneme sequence.
  • the code of "Mama” in the basic speech coding sequence of Chinese is 120010.
  • encoding compression methods may also be used, for example, the encoding of the phonemes included in the "Mama" word is accumulated, and the encoded code is 130. Or use the basic voice coding graphical approach.
  • Step 220 Form a first voice mapping structure corresponding to a phrase or sentence pronunciation in the first language by using the first basic speech coding sequence.
  • the speech mapping structure of the phrase or sentence may form a speech mapping structure that is based on the basic speech coding sequence extension to form a phrase or sentence.
  • the voice mapping structure can adopt a data structure with address characteristics and addressability, such as a static or dynamic queue, array, heap, stack, linked list, tree or graph, etc., in a single form or a combination, which can be implemented by using static or dynamic pointers.
  • a data structure with address characteristics and addressability such as a static or dynamic queue, array, heap, stack, linked list, tree or graph, etc.
  • static or dynamic queue such as a static or dynamic queue, array, heap, stack, linked list, tree or graph, etc.
  • each data structure involved in the speech mapping structure may exist or be juxtaposed.
  • the above-mentioned data structure and pointer can be used to form a mapping structure of words, words, words, sentences and semantics between related semantic meanings, and a partial speech mapping structure is established by semantic meaning.
  • FIG. 3 is a schematic diagram of a voice mapping structure of a language semantic audio exchange method according to an embodiment of the present application.
  • Figure 3 for Chinese, the words “fa”, “ming”, “chuang”, and “made” are used as examples. Each word is used as the smallest semantic unit, and the corresponding basic phoneme is established by using the phoneme corresponding to the pronunciation. Coding, the basic speech coding of each word is discrete.
  • the storage of a single word in a linked list structure (for example only) can ensure high-speed single-word encoding (ie, phoneme feature) filtering efficiency.
  • Each word with semantic meaning such as "invention” and “creation” formed in a single word is stored in another linked list structure, and the basic speech coding of each word is formed by basic speech coding of the included words, and the basic speech coding of each word has Discrete.
  • Each semantically meaningful phrase formed in a single word or word is stored in an array structure (only as an example), which ensures the efficiency of rapid addressing and data structure update changes, and the basic speech coding of each phrase is discrete.
  • the address pointer in the data structure is used to form a mapping tree or a mapping structure diagram of words, words, and phrase correlations according to the semantic relevance of words, words, and phrases, so that a mapping relationship is formed between the speech and the semantics, and the mapping association may be static. Or part of it can be dynamically updated.
  • each word or word, or phrase
  • the data unit of each word can be expanded, for example, expanded into a queue, used to store words (or words, or phrases) of different semantics of the same pronunciation, and speech mapping
  • the structure is multidimensional.
  • the audio exchange method of the language semantics in the embodiment of the present application adopts the data storage structure of the voice mapping text so that the main part of the voice mapping structure is a static structure, and the structure optimization can be formed by the computing capability of the server side or the cloud, and less calculation is performed on the client side. Resources can do a small amount of dynamic updates and supplements. Due to the use of the basic speech coding sequence formed by phonemes in the pronunciation, the complexity and data volume of the speech mapping structure for semantics are greatly reduced, so that the data storage and data processing of the speech mapping structure can be performed on the client side in a low latency state. The server completes the response.
  • step 200 of the audio exchange method for language semantics in an embodiment of the present application further includes:
  • Step 230 Form a second basic speech coding sequence of a single word or word pronunciation in the second language by using another partial phoneme in the unified phoneme sequence.
  • Another portion of the phoneme may be compared to a portion of the phonemes in step 130 above, may include partially identical phonemes, or the same phoneme may be identified by words or symbols in different languages.
  • Step 240 Form a second voice mapping structure corresponding to the phrase or sentence pronunciation in the second language by using the second basic speech coding sequence.
  • the words (or symbols) of the same semantics in different languages have the same possibility of pronunciation, and the same pronunciation of different words of the same semantics produces coding differences with the formation of the speech mapping structure of the two languages.
  • FIG. 4 is a schematic diagram of a voice mapping structure of a language semantic audio exchange method according to an embodiment of the present application.
  • each word is used as the smallest semantic unit, and the corresponding basic phonetic code is established by using the phoneme corresponding to the pronunciation, and the basic speech coding of each word is discrete.
  • the storage of words in a database's form structure ensures high-speed word encoding (ie, phoneme feature) filtering efficiency.
  • Each phrase with semantic meaning formed by words is stored in a form structure of the database (only as an example), which can ensure the efficiency of rapid addressing and data structure update changes, and the basic speech coding of each phrase has discreteness.
  • the address pointer in the data structure is used to form a mapping tree or a mapping structure diagram of words and phrases according to the semantic relevance of words and phrases, so that a mapping relationship between speech and semantics is formed, and the mapping association may be static or partially Dynamically updated.
  • the data unit of each word or phrase can be expanded into a queue for storing words or phrases of different semantics of the same pronunciation, and the speech mapping structure is multi-dimensionalized.
  • the audio exchange method of the language semantics in the embodiment of the present application adopts the data storage structure of the voice mapping text so that the main part of the voice mapping structure is a static structure, and the structure optimization can be formed by the computing capability of the server side or the cloud, and less calculation is performed on the client side. Resources can do a small amount of dynamic updates and supplements. Due to the use of the basic speech coding sequence of the phoneme in the pronunciation, the complexity and data volume of the speech mapping structure for semantics are greatly reduced, so that the data storage and data processing of the speech mapping structure can be performed in the client and the service in a low latency state. The end completes the response.
  • step 300 in the audio exchange method of the language semantics of the embodiment of the present application further includes:
  • Step 310 Form a speech primary conversion structure between the respective languages by using the same or similar semantic information through the voice mapping structures of the first language and the second language (ie, the first and second).
  • the speech mapping structure of the two languages is used between the languages that need to be translated to form the same or similar meaning of the single or word-to-speech primary conversion structure based on the same or similar semantic information, and store the words, words, phrases or sentences between the two languages.
  • the basic speech coding, speech primary conversion structure can be stored in a "key:key-value" structure to respond to the filtering efficiency of a large number of concurrent requests.
  • English basic speech coding and basic Chinese speech coding can be mutually key and key values for bidirectional translation.
  • step 300 in the audio exchange method of the language semantics of the embodiment of the present application further includes:
  • Step 320 Form a speech advanced conversion structure between the corresponding (ie, the first and second) speech mapping structures by using the grammar rules of the first language and the second language.
  • the grammar rules of each language include a high-level conversion structure of words between words or words based on the root and part of speech of a word or word.
  • the speech advanced conversion structure can be stored in a "key:key-value" structure in response to the filtering efficiency of a large number of concurrent requests.
  • the basic speech coding of a single word, word or vocabulary with similar semantics according to different grammars in the two languages can be relatively aggregated, the coding correlation is improved, and the filtering efficiency and the efficiency of the computer translation algorithm in the translation process are improved.
  • FIG. 5 is a schematic diagram of language conversion of a language semantic audio exchange method according to an embodiment of the present application. As shown in FIG. 5, step 400 includes:
  • Step 410 Acquire a sequential phoneme set of audio input segments of the first language by using voice recognition
  • Step 420 Determine a first basic speech coding of the sequential phoneme set by using the first basic speech coding sequence in the first language;
  • Step 430 Determine a continuous speech coding of the sequential phoneme set by using the first speech mapping structure of the first language and the first basic speech coding sequence;
  • Step 440 Obtain a second basic speech coding of the second language by using a speech primary conversion structure between the corresponding languages;
  • Step 450 Obtain continuous speech coding of the second language by using a speech advanced conversion structure and a second basic speech coding sequence between the corresponding languages;
  • Step 460 Form a voice pronunciation according to continuous speech coding in the second language.
  • the audio exchange method of the language semantics in the embodiment of the present application uses the formed phoneme sequence-basic speech coding sequence-speech mapping structure and the conversion structure formed between languages to complete the reversible conversion between speech and text between two languages, which is beneficial to speech.
  • the conversion obtains the corresponding alternative text combination accurately or relatively accurately.
  • the data and data structure have limited storage size, low retrieval difficulty, and are suitable for local storage and processing.
  • the real-time and bandwidth requirements of the server-side data request response are not high.
  • FIG. 6 is a schematic structural diagram of an audio exchange system for language semantics according to an embodiment of the present application.
  • the audio exchange system in the embodiment of the present application is configured to form a voice mapping structure of each language by using a minimum phoneme sequence, and perform semantic language conversion by each voice mapping structure.
  • the audio switching system in this embodiment of the present application includes:
  • the intra-language phoneme mapping forming means 1200 is configured to form inter-text inter-image mapping data of each language by a subset of all the smallest phonemes.
  • the inter-language phoneme mapping forming means 1300 is configured to form inter-voice mapping data of each language by language semantics.
  • the language conversion device 1400 is configured to form a semantic language conversion by using the corresponding inter-voice mapping data and the inter-text inter-map data.
  • the serialization device 1100 in the audio switching system of the embodiment of the present application includes:
  • the phoneme recognition module 1110 is configured to collect a minimum phoneme of each common language by voice recognition.
  • the phoneme encoding module 1120 is configured to form a minimum phoneme into a unified phoneme sequence.
  • the in-language phoneme mapping forming apparatus 1200 in the audio switching system of the embodiment of the present application includes:
  • the first speech coding establishment module 1210 is configured to form a first basic speech coding sequence corresponding to a pronunciation of a single word or a word in the first language by using a part of the phonemes in the unified phoneme sequence.
  • the first voice mapping establishing module 1220 is configured to form a first voice mapping structure corresponding to a phrase or a sentence pronunciation in the first language by using the first basic voice coding sequence.
  • the second speech coding establishing module 1230 is configured to form a second basic speech coding sequence of a single word or a word pronunciation in the second language by using another partial phoneme in the unified phoneme sequence.
  • the second voice mapping establishing module 1240 is configured to form a second voice mapping structure corresponding to the phrase or sentence pronunciation in the second language by using the second basic speech coding sequence.
  • the inter-lingual phoneme mapping forming apparatus 1300 in the audio switching system of the embodiment of the present application includes:
  • the language structure primary conversion module 1310 is configured to form a speech primary conversion structure between the respective languages by using the same or similar semantic information through the first and second language (ie, first and second) speech mapping structures of the first language and the second language.
  • the language structure advanced conversion module 1320 is configured to form a speech advanced conversion structure between the corresponding (ie, the first and second) speech mapping structures by using the grammar rules of the first language and the second language.
  • the language conversion device 1400 in the audio exchange system of the embodiment of the present application includes:
  • a phoneme recognition module 1410 configured to acquire a sequential phoneme set of audio input segments of the first language by using voice recognition
  • the first basic code recognition module 1420 is configured to determine, by using the first basic speech coding sequence in the first language, the first basic speech coding of the set of sequential phonemes;
  • the first continuous speech encoding module 1430 is configured to determine a continuous speech encoding of the sequential phoneme set by using the first speech mapping structure and the first basic speech encoding sequence in the first language;
  • a second basic code recognition module 1440 configured to obtain a second basic voice code of the second language by using a voice primary conversion structure between the corresponding languages;
  • a second continuous speech encoding module 1450 configured to obtain continuous speech encoding of the second language by using a speech advanced transform structure and a second basic speech encoding sequence between the corresponding languages;
  • the continuous code conversion module 1460 is configured to form a voice pronunciation according to continuous speech coding in the second language.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and the like, and can store a program check code. Medium.
  • a basic speech coding sequence in which a single word or a word pronunciation in a language is formed by using a partial minimum phoneme in a unified phoneme sequence, wherein the basic speech coding can form an additional graphic A symbol that corresponds to a single word or word of the corresponding pronunciation.
  • the basic speech coding can be used to convert the pronunciation recognition of single words or words formed by phonemes into visual recognition, which is conducive to the communication between computer visual recognition and computer speech recognition, so that the same semantic speech conversion between languages can have the basis of computer visual recognition.
  • FIG. 7 is a schematic diagram showing the structure of a basic speech coding pattern in the audio exchange method of the language semantics according to the embodiment of the present application.
  • the graphic structure comprises an H-shaped basic frame 01 comprising a first adapter column 10 (bar pattern) juxtaposed in parallel and a second adapter column 20 (bar shape)
  • the pattern further includes an adaptor bar 30 (bar pattern) having two ends connected to the first adapter post and the second adapter post, respectively.
  • the first adapter column (left side in the figure) is provided with a first adapter group 11
  • the second adapter column (right side in the figure) is provided with a second adapter group 21
  • the adapter rod 30 A third adapter group 31 is disposed on the end of the adapter rod 30, and the adapter pole 30 is connected to the adapter position of the corresponding one of the adapter poles.
  • the figure shows five).
  • the adjacent adaptation bits in the same adaptation bit group are used to adjust the length of the adaptation column, and the specific adjustment of the adaptation column is formed by the coincidence of the adaptation bits, so that the length of the corresponding adapter column is correspondingly changed, and the overlap can be adapted.
  • the coordination includes at least two.
  • the end of the adapter rod 30 can be attached to a suitable weight of the corresponding one of the adapter posts.
  • the phoneme coding formed by the phoneme syllables constituting a single word or a word or the syllable code formed by the phoneme may be reflected on the connection shape change of the first adapter column and the second adapter bar, and fixed by the adaptation bit.
  • the coincidence changes in position and adaptation bits form a sufficient permutation combination to reflect the encoded content of the syllable.
  • an embodiment of the present application may further include an auxiliary adaptation symbol 40 connected to the adaptation bit, the auxiliary adaptation symbol 40 including a vector line segment 41 having a vector direction and no vector.
  • the vector line segment 41 may be a line segment or a bad arc
  • the standard symbol 42 may be a circle or a ring
  • the vector line segment may have one or more
  • the standard symbol may have one or more.
  • the additional vector line segment and the standard symbol are connected with the adaptation bit, and the additional timbral-related tonality, tone and other additional audio features can be combined with the syllable coding to increase the information load of the syllable coding.
  • part b is the corresponding pattern of the word "post" and “wait” speech coding
  • part c is the word "mouth” and " ⁇ ” speech coding.
  • the mother of each of the above-mentioned single-word pronunciation syllables exhibits a length change of the first adapter column on the left side of the basic frame and a mating structure of the vector line segment 41, and the finals are represented on the second adapter column on the right side of the basic frame
  • the basic framework and the auxiliary adaptation symbols are smoothed to maintain the aesthetics of the graphics and to ensure the quality of the computer visual recognition.
  • the basic frame 01 can be converted from an H-shape to an n-shape using the coincident adaptation bits and the connection position of the adapter bar 30 and the adaptation bit, as shown in part e of Fig. 7, using The coincident fitting position and the connection position of the adapter rod 30 and the fitting position, the basic frame 01 can be converted from an H shape to a U shape.
  • the first and second adapting columns around the basic frame directly mark the encoding of the smallest phoneme, and the number of encoded numbers and the corresponding fitting column are adapted.
  • Bit correspondence Using the direct coding display of the smallest phoneme in a language syllable, the phonetic alphabet-phoneme coding-speech of the language is directly visually expressed, so that the basic speech coding graphics of the two languages can realize computer vision conversion, while the speech conversion is performed, The recognition rate of language recognition is ensured by computer graphics recognition.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device provided by FIG. 8 is used to perform the audio exchange method of the language semantics mentioned in the above embodiments.
  • the electronic device includes a processor 51, a memory 52, and a bus 53.
  • the processor 51 is configured to call the code stored in the memory 52 through the bus 53 to form a voice mapping structure of each language by using a minimum phoneme sequence, and perform semantic inter-language conversion through each voice mapping structure.
  • the electronic device includes, but is not limited to, an electronic device such as a mobile phone or a tablet computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种语言语义的音频交换方法、系统和音频编码图形,以解决现有技术中语言互译时因语义复杂导致数据响应出现差错和实时性差的技术问题。方法包括利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。利用语言构成中形成音频最小短段的最小音素作为各语言间语义转换的基本数据交换单元,利用最小音素作为数据交换的编码基础,改变了语音识别的基础结构,优化了语言中音频内容的编码复杂性和准确率,使得语言音频的编码过程中避免被耦合了语言片段中音调、音阶、音域等复合信息形成的复杂音频特征,保证了语音识别率。利用最小音素形成的语音编码与文字编码的映射结构使得语言翻译时的数据交换效率得到提高。

Description

语言语义的音频交换方法和音频交换系统、编码图形
本申请要求2018年3月28日提交的申请号为No.2018102644603的中国申请的优先权,通过引用将其全部内容并入本文。
技术领域
本申请涉及信息交换领域,具体涉及一种语言语义的音频交换方法和音频交换系统、编码图形。
发明背景
目前的语言翻译主要由语音识别、语义分析和语句合成几部分组成,语音识别采用高灵敏度传感器,从初始语言的频域或时域语音信号流中提取与语句中文字相应的音频信号集合,语义分析利用隐马尔可夫模型(HMM)、自学习模型、人工神经网络(ANN)等模型对音频信号集合中的文字序列和语义含义进行识别和量化以尽可能确定表达内容,语句合成根据表达内容的识别和量化数据形成目标语言的音频信号集合或文字序列集合。在这一过程中受语义分析模型复杂度的影响需要海量的计算资源,对于移动终端的应用需要采用分布式的计算架构,利用互联网的可靠带宽接入服务端的计算资源,因此翻译的实时性和准确性受到限制。
在专利文献CN104637482B中,公开了一种利用数字编码实现语音向文字转换的装置,其中利用音素存储单元存储第一语言音素特征数据;利用音素转换单元将接收的音素信号序列通过第一语言音素特征数据转换为第一语言音素;利用数字编码单元为第一语言音素进行唯一编码,形成第一语言音素编码序列;利用第一语言音素编码序列形成第一语言的字发音编码序列和词汇发音编码序列;利用字词存储单元存储第一语言的字、词汇或图形及所对应的编码序列;利用字词转换单元根据编码序列的对应关系生成第一语言的字、词汇、图形和/或其组合。该装置说明字词和语音间存在编码映射的基础。如何利用编码映射基础降低语言间相同语义的图文音频转换的资源消耗需要创造性改进。
发明内容
有鉴于此,本申请实施例致力于提供一种语言语义的音频交换方法和音频交换系统,以解决现有技术中语言互译时语义复杂导致数据响应和实时性差的技术问题。
本申请实施例的语言语义的音频交换方法,利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
本申请实施例的语言语义的音频交换系统,其特征在于,包括:
存储器,用于存储上述的语言语义的音频交换方法的程序代码;
处理器,用于运行所述程序代码。
本申请实施例的语言语义的音频交换系统,用于利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
本申请实施例的基本语音编码图形,用于语言音素的图形化显示,包括基本框架,所述基本框架包括并列的第一适配柱、第二适配柱和适配杆,所述第一适配柱和所述第二适配柱分别设置适配位组,所述适配位组包括若干适配位,所述适配杆的两端各自连接一个适配柱的一个所述适配位。
本申请实施例的语言语义的音频交换方法和音频交换系统、编码图形利用语言构成中形成音频最小短段的最小音素作为各语言间语义转换的基本数据交换单元,利用最小音素作为数据交换的编码基础,改变了语音识别的基础结构,简化了语言中音频内容的编码长度和编码效率,使得语言翻译时的数据交换效率得到优化,对降低远端数据实时响应时延,提高基础数据结构和基础数据在本地移动端的存储容量具有积极影响。
附图简要说明
图1所示为本申请一实施例语言语义的音频交换方法的数据处理过程示意图。
图2所示为本申请一实施例语言语义的音频交换方法的编码过程示意图。
图3所示为本申请一实施例语言语义的音频交换方法的语音映射结构示意图。
图4所示为本申请一实施例语言语义的音频交换方法的语音映射结构示意图。
图5所示为本申请一实施例语言语义的音频交换方法进行语言转换的示意图。
图6所示为本申请一实施例语言语义的音频交换系统的架构示意图。
图7所示为本申请实施例语言语义的音频交换方法中一种基本语音编码图形的图形结构示意图。
图8所示为本申请一实施例提供的电子设备的结构示意图。
实施本发明的方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实 施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例的语言语义的音频交换方法,包括:
利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
语言间相同语义的表达在图文和发音上存在实质差异,语义转换是指相同语义的不同图文和发音表达形式的转换。
地域性的通用语言表达语义的文字(作为图形符号的一种)的发音具有确定性,词汇和语句的发音规律可以归纳为音节的不同组合。而采用一组基本的最小音素构成每个音节可以利用最小音素的低信号载荷特点排除音频冗余信号和干扰信息,为复杂数据交换提供更精简的编码基础,降低编码长度。
根据本领域技术人员对各地域性的通用语言的统计比对,作为发音基本要素的最小音素数量和其音频特征可以确定,数量小于1000个,世界7000左右种语言中共计800个左右不重复的最小音素,其中每种西方语言大约使用40个左右最小音素,汉语不超过150个左右最小音素,完全可以采用百位数值范围或千位数值范围的定长编码建立索引例如是十进制三位数或四位数,例如是二进制10位数或20位数。
本申请实施例的语言语义的音频交换方法利用语言构成中形成音频最小短段的最小音素作为各语言间语义转换的基本数据交换单元,利用最小音素作为数据交换的编码基础,改变了语音识别的基础结构,简化了语言中音频内容的编码长度和编码效率,使得语言音频的编码过程中避免被耦合了语言片段中音调、音阶、音域等复合信息形成的复杂音频特征,保证了语音识别率,利用最小音素形成的语音编码与文字编码的映射结构使得语言翻译时的数据交换效率得到优化。对降低远端数据实时响应时延,提高基础数据结构和基础数据在本地移动端的存储容量具有积极影响。
图1为本申请一实施例语言语义的音频交换方法的数据处理过程示意图。如图1所示,包括:
步骤100:序列化所有最小音素。
序列化过程可以包括对语言中音节、音素、音阶、语调的识别,对识别的音节、音素、音阶、语调的定量数学描述,如时域或频域的音频特征数据,对定量数学描述数据的结构化存储,如逐个编码形成索引。
步骤200:通过所有最小音素的子集形成各语言的文字语音间映射数据。
每种语言的发音基础由一个所有最小音素的子集确定,通过子集中最小音素的组合形成一种语言中文字发音的语音标识,进而利用语音标识形成文字与语音标识间对应结构的映射数据,映射数据包括存储数据的数据结构。映射数据可以 包括文字与语音间的映射数据、以及语音间的映射数据。
步骤300:通过语言语义形成各语言的语音间映射数据。
利用语义的客观性建立语言间对应含义的语音的映射数据,映射数据包括存储数据的数据结构。也可以包括文字与语音间的映射数据。
步骤400:利用对应的语音间映射数据和文字语音间映射数据形成语义的语言转换。
本申请实施例的语言语义的音频交换方法通过文字语音间映射数据保证了一种语言的文字-语音转换的连贯性和正确性,语音间映射数据与文字语音间映射数据的结合使得语言间的转换多样性可以在保证语言间的转换质量的同时实现转换过程中较高的语言基础数据交互效率。同时通过语音间映射数据与文字语音间映射数据的映射变化可以形成进一步的加密效果。
图2为本申请一实施例语言语义的音频交换方法的编码过程示意图。如图2所示,在上述实施例基础上,步骤100包括:
步骤110:通过语音识别采集各通用语言的最小音素。
基于人类生理特征和语言演进,语言的语音可以分解为由语句发音至词语发音至词语音节至音节构成音素的结构分解。本领域技术人员可以理解利用计算机技术进行音频采集和音频片段的时域或频域特征分析可以确定字、词、短语的音频特征,确定其中包括的最小音素特征。
步骤120:将最小音素形成统一音素序列。
本领域技术人员可以理解经过语音识别技术,结合必要数据量的语音分析和统计可以将各语言中采用的最小音素音频特征识别并确定。将确定的每个最小音素的音频特征统一标注编码,形成全部最小音素的统一音素序列。统一音素序列使得语言的语音可以准确地解构为由至少一个最小音素形成的确定组合,确定组合可以通过统一音素序列获得对应的编码序列。
例如:汉语中利用声母与韵母形成音节,声母由单一最小音素或几个单一最小音素形成,韵母由一个或几个最小音素形成,相似的英语中利用元音与辅音形成音节,元音由单一最小音素或几个单一最小音素形成,辅音由一个或几个最小音素形成,形成的统一音素序列的部分可以如下表所示:
Figure PCTCN2019079834-appb-000001
表中统一音素序列中的单一最小音素在统一音素序列中具有唯一编码。对于小于1000个的最小音素采用10bit(比特)长度就可以形成唯一编码。
本申请实施例语言语义的音频交换方法形成统一音素序列作为相同或相近语义在不同语言间的文字或语音转换的基本信息载体,避免了其他类型的复合音频载体(如音节)所携带的过多冗余信息形成的信息干扰,有利于优化语音识别的准确性和识别效率。最小音素采用统一音素序列可以随着语言演进进一步对统一音素序列进行更新,保持对各语言语音的同步变化。
如图2所示,本申请一实施例语言语义的音频交换方法中步骤200包括:
步骤210:利用统一音素序列中一部分音素形成与第一语言中单字或单词的发音对应的第一基本语音编码序列。
一部分音素包括一种语言发音的所有最小音素,利用这一部分音素可以形成音节进而形成该语言单字或单词的读音。基于最小音素在统一音素序列中的编码,形成第一语言中每个单字或单词的基本语音编码,进而形成所有(或主要的)单字或单词的基本语音编码序列。
例如:在汉语中“妈”字,其拼音为“ma”,包括音素“m”和“a”,“m”在统一音素序列中的编码为120,“a”在统一音素序列中的编码为010,则“妈”字在汉语的基本语音编码序列中的编码为120010。
在本申请一实施例中也可以采用其他编码压缩方式,例如将“妈”字包括的音素的编码进行累加,形成的编码为130。或者采用基本语音编码图形化的方式。
本领域技术人员可以理解举例中的基本语音编码序列中的编码形式存在冗余,受最小音素编码长度影响,采用标准字节的基本语音编码序列可以利用压缩 编码技术保持编码的唯一性和较小编码长度。
本领域技术人员可以理解具有相同发音的不同单字或单词可以具有相同的基本语音编码,单字或单词的不同发音可以使同一单字或单词具有不同的基本语音编码。
步骤220:利用第一基本语音编码序列形成与第一语言中短语或语句发音对应的第一语音映射结构。
在单字或单词确定的基本语音编码序列基础上,短语或语句的语音映射结构可以形成基于基本语音编码序列扩展形成短语或语句的语音映射结构。
语音映射结构可以采用具有地址特征并可寻址的数据结构,例如静态或动态的队列、数组、堆、堆栈、链表、树或图等的单一形式或组合形式,可以利用静态或动态指针可以实现不同数据结构形式的地址运算,在语音映射结构中涉及的各数据结构可以存在包含或并列。
在本申请一实施例中,利用上述数据结构和指针可以形成具有相关语义含义的字、词、语、句间的语音和语义的映射结构,通过与语义含义建立部分语音映射结构。
图3为本申请一实施例语言语义的音频交换方法的语音映射结构示意图。如图3所示,对于汉语,以“发”字、“明”字、“创”字、“造”字为例,每个字作为最小语义单元,利用对应发音的音素建立对应的基本语音编码,各字的基本语音编码间具有离散性。单字以链表结构(仅作为一种举例)存储可以保证高速的单字编码(即音素特征)过滤效率。以单字形成的每个具有语义含义的单词如“发明”、“创造”以另一个链表结构存储,各单词的基本语音编码利用所包含单字的基本语音编码形成,各单词的基本语音编码间具有离散性。以单字或单词形成的每个具有语义含义的短语以数组结构(仅作为一种举例)存储,可以保证快速寻址和数据结构更新变化的效率,各短语的基本语音编码间具有离散性。
利用数据结构中的地址指针根据字、单词、短语的语义相关性形成字、单词、短语相关性的映射结构树或映射结构图,使得语音与语义间形成映射关联,这种映射关联可以是静态的或部分可动态更新的。
在基本语音编码数据结构中,每一个字的(或者单词,或者短语的)数据单元可以扩展,例如扩展为队列,用于存储相同发音不同语义的字(或者单词,或者短语),将语音映射结构多维化。
本申请实施例语言语义的音频交换方法采用语音映射文字的数据存储结构使得语音映射结构的主要部分为静态结构,可以通过服务器端或云端的计算能力形成结构优化,在客户端利用较少的计算资源可以完成少量动态更新和补充。由于利用了发音中音素形成的基本语音编码序列,大大降低了针对语义的语音映射结构的复杂度和数据量,使得语音映射结构的数据存储和数据处理可以在低时延状 态下在客户端和服务端完成响应。
如图2所示,本申请一实施例语言语义的音频交换方法中步骤200还包括:
步骤230:利用统一音素序列中另一部分音素形成第二语言中单字或单词发音的第二基本语音编码序列。
另一部分音素与上述步骤130中一部分音素相比较,可以包括部分相同的音素,或者相同的音素以不同语言中的字或符号标识。
例如:在英文中“and”其音标为“
Figure PCTCN2019079834-appb-000002
nd”,包括音素“
Figure PCTCN2019079834-appb-000003
”、“n”和“d”,“
Figure PCTCN2019079834-appb-000004
”、“n”和“d”在统一音素序列中的编码为018、220和200,则“and”单词在英语的基本语音编码序列中的编码为018220200。
本领域技术人员可以理解举例中的基本语音编码序列中的编码形式存在冗余,可以利用压缩编码技术保持编码的唯一性和较小编码长度。
本领域技术人员可以理解具有相同发音的不同单字或单词可以具有相同的基本语音编码,单字或单词的不同发音可以使同一单字或单词具有不同的基本语音编码。
步骤240:利用第二基本语音编码序列形成与第二语言中短语或语句发音对应的第二语音映射结构。
在不同的语言中相同语义的文字(或符号)具有相同发音的可能,相同语义的不同文字的相同发音随着两种语言的语音映射结构的形成而产生编码差异。
图4为本申请一实施例语言语义的音频交换方法的语音映射结构示意图。如图4所示,对于英语,以“invention”、“creation”为例,每个单词作为最小语义单元,利用对应发音的音素建立对应的基本语音编码,各单词的基本语音编码间具有离散性。单词以数据库的表单结构(仅作为一种举例)存储可以保证高速的单词编码(即音素特征)过滤效率。以单词形成的每个具有语义含义的短语以数据库的表单结构(仅作为一种举例)存储,可以保证快速寻址和数据结构更新变化的效率,各短语的基本语音编码间具有离散性。
利用数据结构中的地址指针根据单词、短语的语义相关性形成单词、短语相关性的映射结构树或映射结构图,使得语音与语义间形成映射关联,这种映射关联可以是静态的或部分可动态更新的。
在基本语音编码数据结构中,每一个单词或者短语的数据单元可以扩展为队列,用于存储相同发音不同语义的单词或者短语,将语音映射结构多维化。
本申请实施例语言语义的音频交换方法采用语音映射文字的数据存储结构使得语音映射结构的主要部分为静态结构,可以通过服务器端或云端的计算能力形成结构优化,在客户端利用较少的计算资源可以完成少量动态更新和补充。由于利用了发音中音素的基本语音编码序列,大大降低了针对语义的语音映射结构的复杂度和数据量,使得语音映射结构的数据存储和数据处理可以在低时延状态下 在客户端和服务端完成响应。
如图2所示,本申请一实施例语言语义的音频交换方法中步骤300还包括:
步骤310:利用相同或相近的语义信息通过各第一语言和第二语言的(即第一和第二)语音映射结构形成相应语言间的语音初级转换结构。
在需要翻译的语言间利用两种语言的语音映射结构基于相同或相近的语义信息形成相同或相近含义的单字或单词间的语音初级转换结构,存储两种语言的单字、单词、短语或语句间的基本语音编码,语音初级转换结构可以采用“键:键值”的结构存储,以响应大量并发请求的过滤效率。
例如采用
语义:英语基本语音编码:汉语基本语音编码
发明创造:092072069:710169555614
英语基本语音编码与汉语基本语音编码可以互为键与键值,用于双向翻译。
如图2所示,本申请一实施例语言语义的音频交换方法中步骤300还包括:
步骤320:利用第一语言和第二语言的语法规则形成相应(即第一和第二)语音映射结构间的语音高级转换结构。
各语言的语法规则包括根据单字或单词的词根和词性建立的单字或单词间的语音高级转换结构。根据语音初级转换结构,语音高级转换结构可以采用“键:键值”的结构存储,以响应大量并发请求的过滤效率。
例如采用
语义:语法:英语基本语音编码
英文“创造(名词)”0001:092072069;
英文“创造(动词)”0002:092072069;
英文“创造(副词)”0003:092072069;
语义:语法:汉语基本语音编码
中文“创造(名词)”0001:710169555614;
中文“创造(动词)”0002:710169555614;
中文“创造(副词)”0003:710169555614;
将两种语言中根据不同语法形成具有相似语义的单字、单词或词汇的基本语音编码可以相对聚集,编码相关性提高,提高翻译过程中的过滤效率和计算机翻译算法效率。
图5为本申请一实施例语言语义的音频交换方法进行语言转换的示意图。如图5所示,步骤400包括:
步骤410:利用语音识别获取第一语言的音频输入片段的顺序音素集合;
步骤420:利用第一语言的第一基本语音编码序列确定顺序音素集合的第一基本语音编码;
步骤430:利用第一语言的第一语音映射结构和第一基本语音编码序列确定顺序音素集合的连续语音编码;
步骤440:利用对应语言间的语音初级转换结构获得第二语言的第二基本语音编码;
步骤450:利用对应语言间的语音高级转换结构和第二基本语音编码序列获得第二语言的连续语音编码;
步骤460:根据第二语言的连续语音编码形成语音发音。
本申请实施例语言语义的音频交换方法进行语言转换时利用形成的音素序列-基本语音编码序列-语音映射结构和语言间形成的转换结构完成两种语言间语音和文字间可逆转换,有利于语音转换准确或相对准确地获得对应的备选文字组合。数据及数据结构的存储尺寸有限,检索难度较低,适于本地存储和处理,整个过程对服务端数据请求响应的实时性和带宽要求不高。
图6为本申请一实施例语言语义的音频交换系统的架构示意图。如图6所示,本申请实施例的音频交换系统,用于利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
如图6所示,本申请实施例的音频交换系统包括:
序列化装置1100,用于序列化所有最小音素。
语言内音素映射形成装置1200,用于通过所有最小音素的子集形成各语言的文字语音间映射数据。
语言间音素映射形成装置1300,用于通过语言语义形成各语言的语音间映射数据。
语言转换装置1400,用于利用对应的语音间映射数据和文字语音间映射数据形成语义的语言转换。
如图6所示,本申请实施例的音频交换系统中序列化装置1100包括:
音素识别模块1110,用于通过语音识别采集各通用语言的最小音素。
音素编码模块1120,用于将最小音素形成统一音素序列。
如图6所示,本申请实施例的音频交换系统中语言内音素映射形成装置1200包括:
第一语音编码建立模块1210,用于利用统一音素序列中一部分音素形成与第一语言中单字或单词的发音对应的第一基本语音编码序列。
第一语音映射建立模块1220,用于利用第一基本语音编码序列形成与第一语言中短语或语句发音对应的第一语音映射结构。
第二语音编码建立模块1230,用于利用统一音素序列中另一部分音素形成第二语言中单字或单词发音的第二基本语音编码序列。
第二语音映射建立模块1240,用于利用第二基本语音编码序列形成与第二语 言中短语或语句发音对应的第二语音映射结构。
如图6所示,本申请实施例的音频交换系统中语言间音素映射形成装置1300包括:
语言结构初级转换模块1310,用于利用相同或相近的语义信息通过各第一语言和第二语言的(即第一和第二)语音映射结构形成相应语言间的语音初级转换结构。
语言结构高级转换模块1320,用于利用第一语言和第二语言的语法规则形成相应(即第一和第二)语音映射结构间的语音高级转换结构。
如图6所示,本申请实施例的音频交换系统中语言转换装置1400包括:
音素识别模块1410,用于利用语音识别获取第一语言的音频输入片段的顺序音素集合;
第一基本编码识别模块1420,用于利用第一语言的第一基本语音编码序列确定顺序音素集合的第一基本语音编码;
第一连续语音编码模块1430,用于利用第一语言的第一语音映射结构和第一基本语音编码序列确定顺序音素集合的连续语音编码;
第二基本编码识别模块1440,用于利用对应语言间的语音初级转换结构获得第二语言的第二基本语音编码;
第二连续语音编码模块1450,用于利用对应语言间的语音高级转换结构和第二基本语音编码序列获得第二语言的连续语音编码;
连续编码转换模块1460,用于根据第二语言的连续语音编码形成语音发音。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序校验码的介质。
在本申请一实施例语言语义的音频交换方法中,对于利用统一音素序列中的部分最小音素形成一种语言中单字或单词发音的基本语音编码序列,其中的基本语音编码可以形成额外的图形化符号,与相应发音的单字或单词对应。利用基本语音编码的图形化可以将音素形成的单字或单词的发音识别转换为视觉识别,有利于计算机视觉识别与计算机语音识别的沟通,使得语言间相同语义的语音转换可以具有计算机视觉识别的基础。
图7所示为本申请实施例语言语义的音频交换方法中一种基本语音编码图形的图形结构示意图。如图7的a部分所示,图形结构包括一个H形的基本框架01,基本框架包括并列呈竖直平行的第一适配柱10(条形图案)和第二适配柱20(条形图案),还包括一个两端分别与第一适配柱和第二适配柱连接的适配杆30(条形图案)。
第一适配柱(图中为左侧)上设置有第一适配位组11,第二适配柱(图中为右侧)上设置有第二适配位组21,适配杆30上设置有第三适配位组31,适配杆30的端部连接在对应一侧适配柱的适配位上,适配柱的适配位组中至少包括三个适配位(附图中给出的是5个)。
同一适配位组中相邻的适配位用于调节适配柱的长度,通过适配位重合形成适配柱的特定调整,使得相应适配柱的长度形成对应的改变,可以重合的适配位至少包括两个。适配杆30的端部可以连接在对应一侧适配柱的重合适配位上。
在实际应用中,可以将组成单字或单词的读音音节的音素编码或音素形成的音节编码反映在第一适配柱、第二适和适配杆的连接形状变化上,利用适配位的固定位置和适配位的重合变化形成足够的排列组合反映音节的编码内容。
如图7的b部分和c部分所示,在本申请一实施例中还可以包括与适配位连接的辅助适配符号40,辅助适配符号40包括具有矢量方向的矢量线段41和没有矢量方向的标准符号42。矢量线段41可以是线段或劣弧,标准符号42可以是圆形或环形,矢量线段可以有一个或多个,标准符号可以有一个或多个。
在实际应用中附加的矢量线段和标准符号与适配位连接后可以将与音节相关的语调、语气等附加音频特征与音节编码结合,增加音节编码的信息载荷。
实际应用中,例如对于汉语,如图7的b部分和c部分所示,b部分是单字“后”和“候”语音编码的对应图形,c部分是单字“口”和“寇”语音编码的对应图形,上述每个单字的发音音节的生母表现在基本框架左侧的第一适配柱的长短变化和矢量线段41的配合结构,韵母表现在基本框架右侧的第二适配柱的长短变化和矢量线段41与标准符号42的配合结构。基本框架与辅助适配符号经平滑处理既可以保持图形美观又可以保证计算机视觉识别质量。
如图7的d部分所示,利用重合的适配位和适配杆30与适配位的连接位置,基本框架01可以从H形转换为n形,如图7的e部分所示,利用重合的适配位和适配杆30与适配位的连接位置,基本框架01可以从H形转换为U形。
如图7的d部分所示,围绕基本框架(H形、n形或U形)的第一、第二适配柱直接标记最小音素的编码,编码数字个数与对应适配柱的适配位对应。利用一种语言音节中最小音素的直接编码显示,将语言的表音字母-音素编码-语音直接做视觉表达,使得两种语言的基本语音编码图形可以实现计算机视觉转换,在语音转换的同时,利用计算机图形识别保证语言识别的识别率。
图8所示为本申请一实施例提供的电子设备的结构示意图。图8提供的电子设备用于执行上述实施例中所提及的语言语义的音频交换方法。如图8所示,该电子设备包括处理器51、存储器52和总线53。
处理器51,用于通过总线53调用存储器52中存储的代码,以利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
应当理解,该电子设备包括但不限于为手机、平板电脑等电子设备。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (19)

  1. 一种语言语义的音频交换方法,其特征在于,利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
  2. 根据权利要求1所述的语言语义的音频交换方法,其特征在于,所述利用最小音素序列形成各语言的语音映射结构包括:
    序列化所有最小音素;
    通过所述所有最小音素的子集形成所述各语言的文字语音间映射数据;
    通过语言语义形成所述各语言的语音间映射数据。
  3. 根据权利要求2所述的语言语义的音频交换方法,其特征在于,所述通过各语音映射结构完成语义的语言间转换包括:
    利用对应的所述语音间映射数据和所述文字语音间映射数据形成语义的语言转换。
  4. 根据权利要求2或3所述的语言语义的音频交换方法,其特征在于,所述序列化所有最小音素包括:
    通过语音识别采集各通用语言的所述最小音素;
    将所述最小音素形成统一音素序列。
  5. 根据权利要求4所述的语言语义的音频交换方法,其特征在于,所述通过所述所有最小音素的子集形成所述各语言的文字语音间映射数据包括:
    利用所述统一音素序列中一部分音素形成与第一语言中单字或单词的发音对应的第一基本语音编码序列;
    利用所述第一基本语音编码序列形成与第一语言中短语或语句发音对应的第一语音映射结构;
    利用所述统一音素序列中另一部分音素形成第二语言中单字或单词发音的第二基本语音编码序列;
    利用所述第二基本语音编码序列形成与第二语言中短语或语句发音对应的第二语音映射结构。
  6. 根据权利要求5所述的语言语义的音频交换方法,其特征在于,所述通过语言语义形成所述各语言的语音间映射数据包括:
    利用相同或相近的语义信息通过所述第一语言和所述第二语言的语音映射结构形成相应语言间的语音初级转换结构;
    利用各语言的语法规则形成所述第一语言和所述第二语言的语音映射结构间的语音高级转换结构。
  7. 根据权利要求3所述的语言语义的音频交换方法,其特征在于,所述利用 对应的语音间映射数据和文字语音间映射数据形成语义的语言转换包括:
    利用语音识别获取第一语言的音频输入片段的顺序音素集合;
    利用第一语言的第一基本语音编码序列确定顺序音素集合的第一基本语音编码;
    利用第一语言的第一语音映射结构和第一基本语音编码序列确定顺序音素集合的连续语音编码;
    利用对应语言间的语音初级转换结构获得第二语言的第二基本语音编码;
    利用对应语言间的语音高级转换结构和第二基本语音编码序列获得第二语言的连续语音编码;
    根据第二语言的连续语音编码形成语音发音。
  8. 根据权利要求1所述的语言语义的音频交换方法,其特征在于,所述最小音素序列采用百位数值范围或千位数值范围的定长编码建立索引。
  9. 一种语言语义的音频交换系统,其特征在于,包括:
    存储器,用于存储如权利要求1至8任一所述的语言语义的音频交换方法的程序代码;
    处理器,用于运行所述程序代码。
  10. 一种语言语义的音频交换系统,用于利用最小音素序列形成各语言的语音映射结构,通过各语音映射结构完成语义的语言间转换。
  11. 一种基本语音编码图形,用于语言音素的图形化显示,其特征在于,包括基本框架,所述基本框架包括并列的第一适配柱、第二适配柱和适配杆,所述第一适配柱和所述第二适配柱分别设置适配位组,所述适配位组包括多个适配位,所述适配杆的两端分别连接所述第一适配柱的适配柱和所述第二适配柱的适配位。
  12. 根据权利要求11所述的基本语音编码图形,其特征在于,所述第一适配柱、所述第二适配柱和所述适配杆之间包括多种连接形状,所述多种连接形状表示组成单字或单词的读音音节的音素编码或音素形成的音节编码。
  13. 根据权利要求11或12所述的基本语音编码图形,其特征在于,同一所述适配位组中至少两个相邻的所述适配位重合。
  14. 根据权利要求11所述的基本语音编码图形,其特征在于,还包括与所述适配位连接的辅助适配符号,所述辅助适配符号用于表示附加音频特征。
  15. 根据权利要求14所述的基本语音编码图形,其特征在于,所述辅助适配符号包括矢量线段,所述矢量线段具有矢量方向。
  16. 根据权利要求14所述的基本语音编码图形,其特征在于,所述辅助适配符号包括标准符号,所述标准符号不具有矢量方向。
  17. 根据权利要求14所述的基本语音编码图形,其特征在于,所述附加音频 特征包括语气、语调中的至少一种。
  18. 根据权利要求11所述的基本语音编码图形,其特征在于,所述第一适配柱的适配位组中包括的所述适配位的数量为至少三个。
  19. 根据权利要求11所述的基本语音编码图形,其特征在于,所述第二适配柱的适配位组中包括的所述适配位的数量为至少三个。
PCT/CN2019/079834 2018-03-28 2019-03-27 语言语义的音频交换方法和音频交换系统、编码图形 WO2019184942A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810264460.3A CN108597493B (zh) 2018-03-28 2018-03-28 语言语义的音频交换方法和音频交换系统
CN201810264460.3 2018-03-28

Publications (1)

Publication Number Publication Date
WO2019184942A1 true WO2019184942A1 (zh) 2019-10-03

Family

ID=63624812

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/079834 WO2019184942A1 (zh) 2018-03-28 2019-03-27 语言语义的音频交换方法和音频交换系统、编码图形

Country Status (2)

Country Link
CN (2) CN109754780B (zh)
WO (1) WO2019184942A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754780B (zh) * 2018-03-28 2020-08-04 孔繁泽 基本语音编码图形和音频交换方法
CN110991148B (zh) * 2019-12-03 2024-02-09 孔繁泽 信息处理方法及装置、信息交互方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060229864A1 (en) * 2005-04-07 2006-10-12 Nokia Corporation Method, device, and computer program product for multi-lingual speech recognition
US20070083369A1 (en) * 2005-10-06 2007-04-12 Mcculler Patrick Generating words and names using N-grams of phonemes
CN102063899A (zh) * 2010-10-27 2011-05-18 南京邮电大学 一种非平行文本条件下的语音转换方法
CN104637482A (zh) * 2015-01-19 2015-05-20 孔繁泽 一种语音识别方法、装置、系统以及语言交换系统
US20180061417A1 (en) * 2016-08-30 2018-03-01 Tata Consultancy Services Limited System and method for transcription of spoken words using multilingual mismatched crowd
CN108597493A (zh) * 2018-03-28 2018-09-28 孔繁泽 语言语义的音频交换方法和音频交换系统、编码图形

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8219391B2 (en) * 2005-02-15 2012-07-10 Raytheon Bbn Technologies Corp. Speech analyzing system with speech codebook
CN101131689B (zh) * 2006-08-22 2010-08-18 苗玉水 汉语外语句型转换双向机器翻译方法
KR20080046552A (ko) * 2006-11-22 2008-05-27 가구모토 주니치 스피치 코드를 가지는 프린트, 기록 재현을 위한 방법 및장치, 그리고 상용 모드
WO2012061588A2 (en) * 2010-11-04 2012-05-10 Legendum Pro Vita, Llc Methods and systems for transcribing or transliterating to an iconophonological orthography

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060229864A1 (en) * 2005-04-07 2006-10-12 Nokia Corporation Method, device, and computer program product for multi-lingual speech recognition
US20070083369A1 (en) * 2005-10-06 2007-04-12 Mcculler Patrick Generating words and names using N-grams of phonemes
CN102063899A (zh) * 2010-10-27 2011-05-18 南京邮电大学 一种非平行文本条件下的语音转换方法
CN104637482A (zh) * 2015-01-19 2015-05-20 孔繁泽 一种语音识别方法、装置、系统以及语言交换系统
US20180061417A1 (en) * 2016-08-30 2018-03-01 Tata Consultancy Services Limited System and method for transcription of spoken words using multilingual mismatched crowd
CN108597493A (zh) * 2018-03-28 2018-09-28 孔繁泽 语言语义的音频交换方法和音频交换系统、编码图形
CN109754780A (zh) * 2018-03-28 2019-05-14 孔繁泽 基本语音编码图形和音频交换方法

Also Published As

Publication number Publication date
CN108597493A (zh) 2018-09-28
CN109754780B (zh) 2020-08-04
CN108597493B (zh) 2019-04-12
CN109754780A (zh) 2019-05-14

Similar Documents

Publication Publication Date Title
US11769480B2 (en) Method and apparatus for training model, method and apparatus for synthesizing speech, device and storage medium
JP6802005B2 (ja) 音声認識装置、音声認識方法及び音声認識システム
JP2020112787A (ja) 切断アテンションに基づくリアルタイム音声認識方法、装置、機器及びコンピュータ読み取り可能な記憶媒体
CN113205817B (zh) 语音语义识别方法、系统、设备及介质
US11488577B2 (en) Training method and apparatus for a speech synthesis model, and storage medium
CN111243599B (zh) 语音识别模型构建方法、装置、介质及电子设备
WO2022105472A1 (zh) 一种语音识别方法、装置和电子设备
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
KR20160058531A (ko) 딥 러닝을 이용하는 구문 분석 모델 구축 방법 및 이를 수행하는 장치
CN115039171A (zh) 使用有效文字标准化的语言无关的多语言建模
WO2019184942A1 (zh) 语言语义的音频交换方法和音频交换系统、编码图形
CN111192572A (zh) 语义识别的方法、装置及系统
JP2008243080A (ja) 音声を翻訳する装置、方法およびプログラム
WO2023045186A1 (zh) 意图识别方法、装置、电子设备和存储介质
JP7216065B2 (ja) 音声認識方法及び装置、電子機器並びに記憶媒体
WO2022134164A1 (zh) 翻译方法、装置、设备及存储介质
CN112489634A (zh) 语言的声学模型训练方法、装置、电子设备及计算机介质
WO2023193442A1 (zh) 语音识别方法、装置、设备和介质
KR20240065125A (ko) 희귀 단어 스피치 인식을 위한 대규모 언어 모델 데이터 선택
EP4172985A1 (en) Speech synthesis and speech recognition
CN111428509A (zh) 一种基于拉丁字母的维吾尔语处理方法和系统
KR101543024B1 (ko) 발음 기반의 번역 방법 및 그 장치
JP7403569B2 (ja) 音声認識結果処理方法および装置、電子機器、コンピュータ可読記憶媒体並びにコンピュータプログラム
CN117524193B (zh) 中英混合语音识别系统训练方法、装置、设备及介质
CN113515952B (zh) 一种用于蒙古语对话模型联合建模方法、系统及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19774747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19774747

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19774747

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19774747

Country of ref document: EP

Kind code of ref document: A1