CN112233649A

CN112233649A - Method, device and equipment for dynamically synthesizing machine simultaneous interpretation output audio

Info

Publication number: CN112233649A
Application number: CN202011105784.6A
Authority: CN
Inventors: 王兆育; 苏文畅; 国丽
Original assignee: Anhui Tingjian Technology Co ltd
Current assignee: Anhui Tingjian Technology Co ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2021-01-15
Anticipated expiration: 2040-10-15
Also published as: CN112233649B

Abstract

The invention discloses a method, a device and equipment for dynamically synthesizing machine simultaneous interpretation output audio. Specifically, starting from a current simultaneous interpretation scene, whether the speech rate of the synthesized audio needs to be adjusted or not is firstly determined based on a set rule, when speech rate intervention is necessary, the duration of a current original sound sentence and the possible duration of a corresponding translation text are obtained in real time, the difference between the two and the accumulated time difference of the simultaneous interpretation text at the current stage are obtained, then the relation between the current time difference and/or the accumulated time difference and the corresponding preset tolerance is considered, the translation strategy is dynamically adjusted and/or a speech rate adjustment gain parameter is determined, the speech rate adjustment gain parameter comprises two adjustment factors of directly intervening the translation text or adding a speech rate change coefficient to the translation text without intervening the translation text, and finally the speech synthesis of the simultaneous interpretation audio is completed by using the different adjustment factors. The invention realizes dynamic speech speed adjustment of the current output audio, solves the problem of delay of simultaneous transmission of machines, and effectively improves the output effect of simultaneous transmission of machines.

Description

Method, device and equipment for dynamically synthesizing machine simultaneous interpretation output audio

Technical Field

The invention relates to the field of simultaneous interpretation, in particular to a method, a device and equipment for dynamically synthesizing output audio by machine simultaneous interpretation.

Background

Under the large background of economic globalization, international and multilingual communication becomes frequent, and in international conference communication, simultaneous interpretation becomes a popular translation mode in the world today. Currently, in international conferences, large formal conferences and informal conferences, the requirement for simultaneous transfer translation is huge, the requirements on the professional and academic properties of practitioners in the simultaneous transfer industry are high, the output efficiency is relatively low, and the workload is large, so that the manual simultaneous transfer cost is high and the manual simultaneous transfer cost is not matched with the market supply and demand relationship; in addition, due to the high threshold of the simultaneous transmission, high-level simultaneous transmission interpreters need to have a solid language power base, mature meeting experience and the like, and are difficult to rapidly copy in a short period.

With the continuous development of intelligent voice technology, voice recognition and machine translation make great technical progress, so that the prior art also tries to adopt voice transcription, machine translation and voice synthesis technology to build a machine co-transmission system to solve the problem that high-level manual co-transmission in the current exhibition market is extremely lack.

However, the synchronous interpretation application realized by speech recognition, machine translation and speech synthesis still has many problems to be solved for a specific audience in a plurality of conference scenes, for example, speech recognition is easily affected by environment, so that the recognition accuracy is low, and further the machine translation result is not satisfactory.

In addition, the speech synthesis of the machine translation for audio output is not smooth and intermittent. Particularly, the invention focuses on that the synthesized voice of the simultaneous transmission and the original sound of the conference site have larger time delay, and the time delay can form an accumulative effect along with the length of the speech, when the time delay is obvious, simultaneous transmission audiences are difficult to form a timely response with the speaker, the speech content and the emotion of the speaker can be disjointed with the reaction and interaction of the audiences, so that the speech effect of the conference is greatly reduced, and the participants can not obtain better simultaneous transmission listening experience.

Disclosure of Invention

In view of the foregoing, the present invention provides a method, an apparatus, and a device for dynamically synthesizing machine simultaneous interpretation output audio, and accordingly provides a computer readable storage medium and a computer program product, which are used to dynamically adjust and synthesize output translated text audio for specific problems in a specific simultaneous interpretation scenario, thereby improving the output effect of machine simultaneous interpretation.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a dynamic audio synthesis method for simultaneous machine and acoustic interpretation, including:

judging whether the speech speed of the synthesized audio needs to be adjusted or not according to the collected current simultaneous interpretation scene information and a preset rule;

if so, acquiring the first time length of the current acoustic sentence in real time, and predicting the second time length after the translation text corresponding to the current acoustic sentence synthesizes the audio;

calculating the time difference between the first time length and the second time length, and counting the accumulated time difference of the synchronous transmission stage;

adjusting a translation strategy and/or determining a speech speed regulation gain parameter in real time according to the relation between the time difference and/or the accumulated time difference and the corresponding preset tolerance;

and performing voice synthesis processing according to the translated text after the translation strategy is adjusted and/or the speech speed adjusting gain parameter.

In at least one possible implementation, the current simultaneous interpretation context information includes one or more of:

the current original language type and the translation direction;

personalized information of the speaker; and

pronunciation aspect features of the current acoustic sentence.

In at least one possible implementation manner, the adjusting the translation policy includes:

and performing secondary translation on the current acoustic sentence: and changing the length of the translated text after the secondary translation by using words and/or grammars different from those of the text of the previous translated text.

In at least one possible implementation manner, the determining the speech rate adjustment gain parameter includes:

searching corresponding linguistic data in advance according to the simultaneous interpretation scene;

performing time delay analysis by using the corpus, and determining a preset gain parameter;

and selecting the preset gain parameter or the proportional relation as the speech rate adjusting gain parameter based on the proportional relation between the time difference and/or the accumulated time difference and the preset tolerance.

In at least one possible implementation manner, the method further includes:

determining pronunciation adjusting parameters of the current output audio according to the speech speed adjusting gain parameters or the speech speed adjusting gain parameters and the current simultaneous interpretation scene information;

and combining the pronunciation adjusting parameters to synthesize the current translation text.

In at least one possible implementation manner, determining a pronunciation adjustment parameter of the currently output audio according to the current simultaneous interpretation scene information includes:

obtaining a loudness adjustment parameter of the current output audio according to the loudness of the current acoustic sentence; and/or

Combining the current original sound sentence, and carrying out emotion analysis on the current translation text;

and obtaining the tone adjusting parameters of the current output audio according to the emotion analysis result.

In at least one possible implementation manner, the obtaining, according to the loudness of the current acoustic sentence, a loudness adjustment parameter of the current output audio includes:

continuously obtaining the original volume value of each audio frame of the current original sound sentence;

calculating a volume difference value of each audio frame based on the original volume value;

and determining the volume adjusting parameter of the current output audio by combining the preset default loudness of the current output audio and the volume difference value.

In a second aspect, the present invention provides an apparatus for dynamically synthesizing machine simultaneous interpretation output audio, comprising:

the speech rate adjusting decision module is used for judging whether the speech rate of the synthesized audio needs to be adjusted or not according to the collected current simultaneous interpretation scene information and a predetermined rule;

the audio duration calculation module is used for acquiring the first duration of the current acoustic sentence in real time and predicting the second duration after the translation text corresponding to the current acoustic sentence is synthesized into the audio when the output of the speech speed regulation decision module is yes;

the delay calculation module is used for solving the time difference between the first time length and the second time length and counting the accumulated time difference of the synchronous transmission stage;

the speech rate adjusting parameter determining module is used for adjusting a translation strategy in real time and/or determining a speech rate adjusting gain parameter according to the relation between the time difference and/or the accumulated time difference and the corresponding preset tolerance;

and the speech synthesis module is used for carrying out speech synthesis processing according to the translated text after the translation strategy is adjusted and/or the speech speed adjusting gain parameter.

the current original language type and the translation direction;

personalized information of the speaker; and

pronunciation aspect features of the current acoustic sentence.

In at least one possible implementation manner, the speech rate adjustment parameter determining module includes: a translation policy adjustment unit, where the translation policy adjustment unit specifically includes:

the secondary translation component is used for performing secondary translation on the current acoustic sentence: and changing the length of the translated text after the secondary translation by using words and/or grammars different from those of the text of the previous translated text.

In at least one possible implementation manner, the module for determining a speech rate adjustment parameter specifically includes:

the corpus acquiring unit is used for collecting corresponding corpus in advance according to the simultaneous interpretation scene;

the gain parameter unit is used for carrying out time delay analysis by utilizing the linguistic data and determining a preset gain parameter;

and the speech rate adjusting parameter selecting unit is used for selecting the preset gain parameter or the proportional relation as the speech rate adjusting gain parameter based on the proportional relation between the time difference and/or the accumulated time difference and the preset tolerance.

In at least one possible implementation manner, the apparatus further includes:

a pronunciation adjusting parameter determining module, configured to determine a pronunciation adjusting parameter of a currently output audio according to the speech rate adjusting gain parameter, or the speech rate adjusting gain parameter and the current simultaneous interpretation scene information;

and the voice synthesis module is also used for combining the pronunciation adjusting parameters to synthesize the current translation text.

In at least one possible implementation manner, the pronunciation adjustment parameter determination module includes:

the loudness adjusting unit is used for obtaining loudness adjusting parameters of the current output audio according to the loudness of the current acoustic sentence; and/or

The tone adjusting unit specifically comprises:

the emotion analysis component is used for carrying out emotion analysis on the current translation text in combination with the current acoustic sentence;

and the tone adjusting parameter acquiring component is used for acquiring the tone adjusting parameters of the current output audio according to the emotion analysis result.

In at least one possible implementation, the loudness adjustment unit includes:

the acoustic volume acquisition component is used for continuously acquiring the original volume value of each audio frame of the current acoustic sentence;

the original sound volume difference calculating component is used for calculating the sound volume difference value of each audio frame based on the original sound volume value;

and the volume adjusting parameter determining component is used for determining the volume adjusting parameter of the current output audio by combining the preset default loudness of the current output audio and the volume difference value.

In a third aspect, the present invention provides a machine simultaneous interpretation output audio dynamic synthesis apparatus, comprising:

one or more processors, memory which may employ a non-volatile storage medium, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method as in the first aspect or any possible implementation of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform at least the method as described in the first aspect or any of its possible implementations.

In a fifth aspect, the present invention also provides a computer program product for performing at least the method of the first aspect or any of its possible implementations, when the computer program product is executed by a computer.

In at least one possible implementation manner of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.

The invention is characterized in that a speech rate adjusting factor is introduced when synthesizing audio, the delay of the synthesized audio generated by machine simultaneous transmission and the original speech is shortened as much as possible, the specific scheme is that whether the speech rate of the synthesized audio needs to be adjusted or not is firstly determined based on a set rule from a current simultaneous interpretation scene, when the speech rate intervention is considered to be necessary, the current duration of an original sound sentence and the possible duration of a corresponding translation text are obtained in real time, the difference between the two sentences and the accumulated time difference between the two sentences and the current stage are solved, then the relation between the time difference and/or the accumulated time difference of the current sentence and the respectively corresponding preset tolerance is considered, the adjustment of a translation strategy is dynamically carried out and/or a speech rate adjusting gain parameter is determined, wherein the two adjusting means are included, one is directly intervening the translation text, the other is that the text of the translation text is not interfered and a speech rate changing coefficient is attached to the current translation text, finally, the speech synthesis process is performed according to the two different approaches. The invention combines the specific scene information as the condition for triggering and adjusting the synthesized audio, and utilizes the intervention modes of different layers to realize the dynamic speed adjustment of the current output audio, thereby solving the delay problem of the machine co-transmission and effectively improving the output effect of the machine co-transmission.

Further, in consideration of the listening feeling of the participants, in other embodiments of the present invention, the pronunciation level of the output audio is also considered to be adjusted, so that on one hand, an expression manner closer to the original speech can be obtained, and on the other hand, the aforementioned speech rate adjustment strategy can be assisted, so as to improve the poor listening feeling possibly caused by the speech rate adjustment.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an embodiment of a method for dynamically synthesizing machine simultaneous interpretation output audio according to the present invention;

fig. 2 is a flow chart of an embodiment of a method of obtaining loudness adjustment parameters provided by the present invention;

FIG. 3 is a block diagram of an embodiment of a machine simultaneous interpretation output audio dynamic synthesis apparatus provided by the present invention;

fig. 4 is a schematic diagram of an embodiment of a machine simultaneous interpretation output audio dynamic synthesis device provided by the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

Before explaining the embodiments of the present invention, the design process of the present invention will be described. As described above, the simultaneous machine transmission is limited to simultaneous interpretation of scenes and related technologies, and the delay problem is obvious. The simultaneous interpretation scene is characterized in that the translated audio output is realized in real time along with the continuous output of the sentences by the speakers, and the simultaneous interpretation scene is delayed by the simultaneous interpretation scene instead of alternately interpreting the sentences spoken by the speakers after translation; the inventor analyzes that the main reasons for the delay of the synthesized output audio are the language difference, the personal characteristics and the expression of the speaker. The language difference is a relatively critical factor, the saturation degrees of different languages are different, after the voice of a speaker is converted into an original text, the original text is translated into a translated text by a machine, at this time, the situation that the lengths of the translated text and the original text are inconsistent mostly occurs, and further the length of the audio frequency of the translated text after the voice synthesis is different from the length of the audio frequency of an original sound sentence, for example, 10 seconds are consumed when a sentence of Chinese is read, 15 seconds may be consumed when the English translated audio synthesized by the machine at a similar rhythm, and the deviation is accumulated along with time, so that after a user accesses a machine simultaneous transmission system, the user is likely to be difficult to follow the rhythm of the speaker, and the experience is undoubtedly poor for the speaker and audience.

For example, the current acoustic sentence is "i want to clarify first and then speak the next part", and the reading time of the Chinese sentence is 4 seconds at normal and natural speed; the English translation text is 'I white like to make one point clear before I move on to the next point', the output time of the audio of the English sentence after synthesis processing is 6 seconds, so that the time difference (called time difference in the invention) of 2 seconds exists between the time of the original text and the target translation, and the time difference between the original text of the Nth sentence and the synthesized voice of the corresponding translation text can be obviously amplified along with the continuation of the speech content and the passage of time.

Based on this analysis, the inventors have desired to achieve that the output audio duration of the machine synthesis is as consistent as possible or relatively shorter than the original speaker utterance length, since the delay effect can be mitigated to a greater extent.

In view of the above, the present invention is conceived to match speech rates of a synthesized audio of a translated text with a speech rate of an original pronunciation audio within a limited time in a simultaneous interpretation scenario, and specifically provides an embodiment of at least one dynamic synthesis method of a machine simultaneous interpretation output audio, as shown in fig. 1, which may include the steps of:

and step S1, judging whether the synthetic audio speech rate needs to be adjusted according to the collected current simultaneous interpretation scene information and a preset rule.

The method comprises the steps of firstly evaluating whether the speech rate of the synthesized audio needs to be interfered on the whole under the specific condition of the current machine co-transmission scene, wherein the reason for designing the step is as mentioned above. The present machine co-interpretation scenario information may mainly or mainly surround the foregoing aspects according to actual needs, for example, in some preferred embodiments of the present invention, the present machine co-interpretation scenario information includes one or more of the following:

(1) the current original language type and the direction of translation.

The original language and the target language in the simultaneous transfer scene determine the difference of the saturation. It can be considered that, under the same language family and language family, the saturation of the original language and the saturation of the target language are close, so that the prominent pronunciation time difference cannot be caused, and too much time difference cannot be brought when translating among English, French, German and other languages; translation scenarios in different languages may result in significant time differences, such as but not limited to the aforementioned scenarios of simultaneous transmission of Chinese and English.

(2) Personalized information of the speaker.

Different speakers have personalized characteristics such as age, cognition, social status, gender, speech style, action posture, expression and the like, so that the simultaneous transmission effect can be differentiated, for example, an elderly speaker has slow speech and can act along with some limbs, sufficient translation and synthesis time is reserved for a machine, and the duration of the machine outputting audio can be close to the speech speed of the speaker.

(3) Pronunciation aspect features of the current acoustic sentence.

Based on the acoustic angle, the pronunciation level features may refer to the aspects of prosody, intonation, emotion, timbre, pause, etc., and different pronunciation features may also make the duration of the original pronunciation different, for example, a section of original voice sentence has the characteristic of long tail, or pauses between sentences are more and longer, so the machine-synthesized audio may also be close to the duration of the original voice sentence.

Therefore, before the intervention of the synthesized speech rate is decided, whether the execution is continued downwards is determined according to the established standard (such as the relationship between the current two languages, the personal information of a speaker, the acoustic characteristics of the current sentence and the like) from the overall factors of the current co-transmission scene; if the current simultaneous transmission scene needs speech rate intervention according to the step, the following processing can be executed:

step S2, acquiring the first time length of the current acoustic sentence in real time, and estimating the second time length after the translation text corresponding to the current acoustic sentence is synthesized into audio;

and step S3, calculating the time difference between the first time length and the second time length, and counting the accumulated time difference of the synchronous transmission stage.

The time length and the time difference may be calculated by referring to various existing technologies, for example, obtaining a time stamp (including a start time and an end time of a sentence) of a current acoustic sentence, so as to calculate a first time length of the current acoustic sentence, and obtaining a second time length may be similar to this, but it should be noted that: in the embodiment, the duration of the synthesized audio of the translated text corresponding to the current acoustic sentence is estimated, that is, the second duration is not the duration of the final output audio, but the initial duration of the audio synthesized based on the current translated text is predicted in the processing process, and the prediction mode can have various modes in actual operation, for example, fitting into a section of initial speech and calculating a time endpoint, or using a pre-trained duration prediction model, using the characteristics of the language type, the current translated text and the like as input information, and directly outputting the initial duration by using the duration prediction model.

Regarding the calculation of the time difference and the statistics of the accumulated time difference, a first time duration of the current nth original sentence is set to be Vn, a second time duration of the estimated translation of the nth sentence is set to be Sn, and the time difference of the current nth sentence is:

S_n-V_n

without any processing procedure, the time taken by the speaker to finish the first N-1 sentences is:

the output audio time of the first N-1 translation after the machine simultaneous transmission is as follows:

the time offset, i.e., the cumulative time difference, is:

continuing, step S4, adjusting the translation strategy and/or determining the speech rate adjustment gain parameter in real time according to the relationship between the time difference and/or the accumulated time difference and the corresponding preset tolerance.

For example, based on different application scenarios and experiences, the tolerance of the time offset is set to σ in advance (i.e. the instantaneous long deviation is considered to be reasonable in this range, and it should be noted that different tolerances can be set for the accumulated time difference and the time difference respectively), so there may be two relationships as follows (the accumulated time difference is taken as an example, and the comparison of the time differences is the same, which is not described here again):

(1) if it is

(2) If it is

In addition, for the adjusting means of the speech rate, the embodiment essentially proposes two different dimensional modes, one is to directly interfere with the translated text, and the other is to attach the speech rate variation coefficient to the current translated text without interfering with the translated text.

Specifically, the adjustment translation policy is to make a change to the generated current translation text, for example, the current acoustic sentence may be translated twice, preferably, in the secondary translation process, different words and/or grammars are adopted from the texts of the previous translated texts, the length of the texts of the translated texts after the secondary translation is changed, and the method is easy to understand, namely, the saturation degree of the translated text is changed through secondary translation, the content of the translated text can be shortened or increased according to the requirement of the co-transmission scene in the actual operation, and may be the whole or partial modification of the translated text, such as translating a proper noun in a shorthand form, the invention is not limited, but it should be pointed out that the triggering time of the secondary translation can be based on the different relations, for example, when the accumulated time difference is larger than sigma, secondary translation is started, and a new translation text is obtained according to different translation strategies.

Regarding the determining of the speech rate adjustment gain parameter, it is understood that the speech rate of the output audio of the current translation is adjusted and changed according to a coefficient (multiple, ratio value, etc.). In actual operation, different corpora can be constructed under different scenes to perform test training and delay analysis, and an optimal multiple t (preset gain parameter) is found, wherein the optimal time refers to that the second time duration is closer to the first time duration. For example, N corpora are collected based on a certain simultaneous interpretation scenario, eachThe speaking time of the speaker in the corpus is T_iThe synthesized audio time of the corresponding translation is S_iThen, the t value satisfying the following objective function is the optimal speech rate multiple for adjusting the synthesized speech rate in the scene:

min t

the reason why the optimal t value is called as a preset gain parameter is that the final speech rate gain needs to be selected in combination with the relationship between the time difference or the accumulated time difference and the tolerance. For example, when the accumulated time difference is less than or equal to σ (the time difference is the same, which is not described in detail), 1 may be used as the final speech rate gain to represent that the speech rate change rate is not changed. If the accumulated time difference is greater than sigma, the preset gain parameter or the proportional relation can be selected as the final speech rate adjustment gain parameter based on the proportional relation between the accumulated time difference and the preset tolerance.

The proportional relationship may be a multiple relationship between the actual deviation and the ideal deviation, and specifically, when a link is selected, if a proportional value between the actual deviation and the ideal deviation exceeds a pre-obtained optimal multiple t (a preset gain parameter), the proportional value between the actual deviation and the ideal deviation may be used as a final speech rate adjustment gain parameter; if the ratio of the actual deviation to the ideal deviation is less than or equal to the preset gain parameter, the preset gain parameter may be used as the final speech rate adjustment gain parameter. Of course, the above rule for determining the final speech rate adjustment gain parameter may be determined according to the needs, and in some embodiments, the above rule may be opposite to the selection manner described above, and the present invention is not limited thereto. It should be noted that, during the subsequent synthesis operation, the action object of the final speech rate adjustment gain parameter may also be the whole or part of the current translation, for example, the speech rate adjustment parameter is only attached to the first half or the second half of the "I' wuold lipid to make one point simple expression I move on to the next point".

And finally, step S5, performing speech synthesis processing according to the translated text after the translation strategy is adjusted and/or the speech rate adjustment gain parameter.

The synthesis method used herein can refer to the prior art, and the emphasis is that the synthesized object can be converted into a new translated text after re-translation, and/or the speech rate adjustment gain parameter can be merged into the new translated text during the synthesis operation. After the synthesis operation is completed, the synthesized audio can be played to the listening and watching terminals of the participating users concurrently, which is not the key point of the invention and is not repeated.

It should be added that the present invention is further proposed on the basis of the foregoing embodiments, and in order to more effectively improve the listening experience, a pronunciation adjustment parameter may be incorporated in the synthesis stage of the output audio, that is, in a preferred embodiment of the present invention, the synthesis method may further include: and determining pronunciation adjusting parameters of the current output audio according to the speech speed adjusting gain parameters or the speech speed adjusting gain parameters and the current simultaneous interpretation scene information, and combining the pronunciation adjusting parameters when synthesizing the current interpretation text (including the new interpretation text translated again), so that the synthesized output audio is delayed and reduced, and can be more natural in acoustic expression and closer to the pronunciation characteristics of a speaker.

It is well known in the art that machine synthesis differs from real speech in naturalness, but the present invention designs this preferred solution not only to improve the effect of synthesizing audio, but also to incorporate the aforementioned concept of processing the delay. The inventor considers that the synthesized audio after the speech rate adjustment may cause the unnatural listening sensation to be highlighted (for example, the speech rate adjustment is too fast or too slow, especially, the grammar is different due to the language type difference, or the speaker's personality information or pronunciation acoustic characteristics are different) on the basis of the unnatural listening sensation existing in the machine synthesis itself, so the inventor considers designing the solution.

Specifically, the speech rate adjustment gain parameter obtained in the foregoing embodiment may be combined, or the speech rate adjustment gain parameter may be combined with the current co-interpretation context information, where two aspects of the current co-interpretation context information are schematically described herein.

(1) The loudness adjustment parameter of the current output audio can be obtained according to the loudness of the current acoustic sentence.

For example, the original speech may be emphasized in the key location and the volume is increased, so that the personalized information of the speaker can be used to obtain the expression information more conforming to the original speech sentence before translation. Specifically, in conjunction with the embodiment shown in fig. 2, obtaining the loudness adjustment parameter of the current output audio according to the loudness of the current acoustic sentence includes:

step S10, continuously obtaining the original volume value of each audio frame of the current acoustic sentence;

step S20, calculating the volume difference of each audio frame based on the original volume value;

and step S30, determining the volume adjusting parameter of the current output audio by combining the preset default loudness of the current output audio and the volume difference value.

In practical operation, this example may be that the volume level of each frame of the current acoustic sentence is continuously obtained by the existing volume calculation method and is denoted by d0, d1, … … dn; and the difference between each frame and the previous frame can be calculated to be 0, d1-d0, d2-d1, … …, dn-d (n-1).

The original default volume value of the machine synthesized audio is usually preset according to the normal condition, and if the normal volume of the default volume is represented by s, the method for determining the volume adjustment parameter of the currently output audio can refer to: s, the number of the first and second groups,

that is, each frame of the synthesized audio is assigned an updated volume value, such that intervening at the frame level volume maintains the actual volume relatively consistent with the synthesized audio, thereby making the output audio relatively more consistent with the speaker's pronunciation.

(2) And performing emotion analysis on the current translation text by combining the current acoustic sentence, and obtaining the tone adjusting parameter of the current output audio according to an emotion analysis result.

Besides the volume adjustment dimension, the synthesis effect of the translated text can be dynamically adjusted by combining the existing emotion analysis and keyword extraction technology. Specifically, the emotion type of the sentence can be obtained after emotion analysis: and the key words of the sentence are extracted by utilizing a key word extraction technology. To improve the listener's perception, the mood adjustment parameters may include, but are not limited to, pitch and severity markers as a means: 1. if the emotion is positive, the pitch of the voice synthesized by the voice is properly increased; 2. if the emotion is negative, the pitch of the synthesized voice is properly reduced; 3. and adding a re-reading mark to the extracted key words to indicate that the translation is emphasized somewhere.

And finally, the tone adjusting parameters and the speech speed adjusting parameters submitted in the front are acted in a synthesis processing link together so as to make up and improve the auditory sense effect of the output audio.

In summary, the idea of the present invention is to introduce a speech rate adjustment factor when synthesizing audio, and shorten the delay between the synthesized audio generated by machine co-transmission and the original speech as much as possible, in a specific scheme, starting from a current co-phonetic translation scenario, based on a predetermined rule, it is first determined whether the speech rate of the synthesized audio needs to be adjusted, when speech rate intervention is deemed necessary, the current duration of an original speech sentence and the possible duration of a corresponding translation text are obtained in real time, a difference between the two durations and an accumulated time difference between co-transmissions to the current stage are obtained, then the relationship between the time difference and/or the accumulated time difference of the current sentence and the corresponding preset tolerance is examined, thereby dynamically adjusting a translation policy and/or determining a speech rate adjustment gain parameter, where two adjustment means are included, one means is to directly intervene in the translation text, and the other means is to accompany the current translation text with a speech rate change coefficient without intervening in the translation text, finally, the speech synthesis process is performed according to the two different approaches. The invention combines the specific scene information as the condition for triggering and adjusting the synthesized audio, and utilizes the intervention modes of different layers to realize the dynamic speed adjustment of the current output audio, thereby solving the delay problem of the machine co-transmission and effectively improving the output effect of the machine co-transmission.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a device for synthesizing machine simultaneous interpretation output audio dynamically, as shown in fig. 3, which may specifically include the following components:

the speech rate adjusting decision module 1 is used for judging whether the synthetic audio speech rate needs to be adjusted according to the collected current simultaneous interpretation scene information and a predetermined rule;

the audio duration calculation module 2 is configured to, when the output of the speech speed adjustment decision module is yes, obtain a first duration of a current acoustic sentence in real time, and predict a second duration after a translated text corresponding to the current acoustic sentence is synthesized into an audio;

the delay calculation module 3 is configured to calculate a time difference between the first duration and the second duration, and count an accumulated time difference of the current synchronous transmission stage;

the speech rate adjusting parameter determining module 4 is used for adjusting a translation strategy in real time and/or determining a speech rate adjusting gain parameter according to the relation between the time difference and/or the accumulated time difference and the corresponding preset tolerance;

and the speech synthesis module 5 is used for performing speech synthesis processing according to the translated text after the translation strategy is adjusted and/or the speech speed adjustment gain parameter.

the current original language type and the translation direction;

personalized information of the speaker; and

pronunciation aspect features of the current acoustic sentence.

In at least one possible implementation manner, the apparatus further includes:

The tone adjusting unit specifically comprises:

In at least one possible implementation, the loudness adjustment unit includes:

It should be understood that the division of the components in the machine translation output audio dynamic synthesis apparatus shown in fig. 3 is merely a logical division, and the actual implementation may be wholly or partially integrated into a physical entity or may be physically separated. And these components may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and their preferred embodiments, it will be appreciated by those skilled in the art that in practice, the invention may be practiced in a variety of embodiments, and that the invention is illustrated schematically in the following vectors:

(1) a machine simultaneous interpretation output audio dynamic synthesis apparatus may include:

one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 4 is a schematic structural diagram of an embodiment of a machine simultaneous interpretation output audio dynamic synthesis device provided by the present invention, wherein the device may be an electronic device or a circuit device built in the electronic device. The electronic device can be a PC, a server, a translator (device), a recording pen, a mobile intelligent terminal (mobile phone, tablet, reader, watch, bracelet, glasses and the like), a microphone, an earphone and the like. The embodiment is not limited to the specific form of the machine simultaneous interpretation output audio dynamic synthesis device.

As shown in particular in fig. 4, the machine simultaneous interpretation output audio dynamic synthesis device 900 includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.

In addition, to further improve the functionality of the machine simultaneous interpretation output audio dynamic synthesis device 900, the device 900 may further comprise one or more of an input unit 960, a display unit 970, an audio circuit 980, a camera 990, a sensor 901, etc., which may further comprise a speaker 982, a microphone 984, etc. The display unit 970 may include a display screen, among others.

Further, the machine translation output audio dynamics synthesis apparatus 900 may also include a power supply 950 for providing power to various devices or circuits within the apparatus 900.

It is to be understood that the machine simultaneous interpretation output audio dynamic synthesis apparatus 900 shown in fig. 4 is capable of implementing the respective processes of the methods provided by the foregoing embodiments. The operations and/or functions of the various components of the apparatus 900 may each be configured to implement the corresponding flow in the above-described method embodiments. Reference is made in detail to the foregoing description of embodiments of the method, apparatus, etc., and a detailed description is omitted here as appropriate to avoid redundancy.

It should be understood that the processor 910 of the machine simultaneous interpretation output audio dynamic synthesis device 900 shown in fig. 4 may be a system on a chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium, on which a computer program or the above-mentioned apparatus is stored, which, when executed, causes the computer to perform the steps/functions of the above-mentioned embodiments or equivalent implementations.

In the several embodiments provided by the present invention, any function, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of software products, which are described below, or portions thereof, which substantially contribute to the art.

(3) A computer program product (which may include the above apparatus) which, when run on a terminal device, causes the terminal device to perform the method of machine simultaneous interpretation output audio dynamic synthesis of the foregoing embodiments or an equivalent implementation.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program products may include, but are not limited to, refer to APP; in the foregoing, the device/terminal may be a computer device, and the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), Random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other. In particular, for embodiments of devices, apparatuses, etc., since they are substantially similar to the method embodiments, reference may be made to some of the descriptions of the method embodiments for their relevant points. The above-described embodiments of devices, apparatuses, etc. are merely illustrative, and modules, units, etc. described as separate components may or may not be physically separate, and may be located in one place or distributed in multiple places, for example, on nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the above-mentioned embodiment. Can be understood and carried out by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.

Claims

1. A dynamic synthesis method of machine simultaneous interpretation output audio is characterized by comprising the following steps:

2. The method of machine-simultaneous transliteration output audio dynamic synthesis according to claim 1, wherein the current simultaneous transliteration scene information includes one or more of:

the current original language type and the translation direction;

personalized information of the speaker; and

pronunciation aspect features of the current acoustic sentence.

3. The method of machine simultaneous interpretation output audio dynamic synthesis of claim 1, wherein the adjusting the translation strategy comprises:

4. The method of claim 1, wherein determining the speech rate adjustment gain parameter comprises:

5. The method for machine-simultaneous interpretation output audio dynamic synthesis according to any of claims 1 to 4, further comprising:

6. The method of claim 5, wherein determining articulation adjustment parameters for the currently output audio based on the current co-transliteration scene information comprises:

7. The method of claim 6, wherein obtaining the loudness adjustment parameter of the current output audio according to the loudness of the current acoustic sentence comprises:

8. A device for dynamically synthesizing machine simultaneous interpretation output audio, comprising:

9. The device for machine simultaneous interpretation output audio dynamic synthesis according to claim 8, further comprising:

10. A machine simultaneous interpretation output audio dynamics synthesis apparatus, comprising:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the machine co-interpretation output audio dynamic synthesis method of any of claims 1-7.