CN112233649B

CN112233649B - Method, device and equipment for dynamically synthesizing simultaneous interpretation output audio of machine

Info

Publication number: CN112233649B
Application number: CN202011105784.6A
Authority: CN
Inventors: 王兆育; 苏文畅; 国丽
Original assignee: Anhui Tingjian Technology Co ltd
Current assignee: Anhui Tingjian Technology Co ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2024-04-30
Anticipated expiration: 2040-10-15
Also published as: CN112233649A

Abstract

The invention discloses a method, a device and equipment for dynamically synthesizing simultaneous interpretation output audio of a machine. Specifically, from the current simultaneous interpretation scene, whether to adjust the speech speed of synthesized audio is firstly determined based on a set rule, when the speech speed intervention is necessary, the time length of the current acoustic sentence and the possible time length of the corresponding translated text are obtained in real time, the difference between the current acoustic sentence and the possible time length of the corresponding translated text is obtained, the accumulated time difference in the same transmission stage is obtained, then the relation between the current time difference and/or the accumulated time difference and the corresponding preset tolerance is examined, so that the translation strategy is dynamically adjusted and/or the speech speed adjustment gain parameter is determined, wherein the method comprises the steps of directly intervening the translated text or not intervening the translated text and attaching two adjustment factors of the speech speed change coefficient to the translated text, and finally, the speech synthesis of simultaneous transmission audio is completed by utilizing the different adjustment factors. The invention realizes the dynamic speech speed adjustment of the current output audio, solves the problem of simultaneous transmission delay of the machine, and effectively improves the simultaneous transmission output effect of the machine.

Description

Method, device and equipment for dynamically synthesizing simultaneous interpretation output audio of machine

Technical Field

The invention relates to the field of simultaneous interpretation, in particular to a method, a device and equipment for dynamically synthesizing simultaneous interpretation output audio of a machine.

Background

In the large background of economy globalization, international and multilingual communication is frequent, and simultaneous interpretation is a popular translation mode in the world today in the international conference communication. Currently, in international conferences, large-scale formal conferences and informal conferences, the simultaneous transmission translation requirement is huge, the professional and academic requirements on practitioners in the simultaneous transmission industry are higher, the output efficiency is relatively lower, and meanwhile, the workload is higher, so that the manual simultaneous transmission cost is high and is not matched with the market supply and demand relationship; in addition, due to the high threshold of simultaneous transmission, high-order simultaneous interpretation staff needs to have a firm language work bottom, mature conference experience and the like, and is difficult to quickly copy in a short period.

With the continuous development of intelligent voice technology, voice recognition and machine translation achieve great technological progress, so that the current field is also trying to build a machine synchronous transmission system by adopting voice transcription, machine translation and voice synthesis technology to solve the problem of extremely lack of high-level manual synchronous transmission in the current convergence market.

However, the synchronous interpretation application realized by the voice recognition, the machine translation and the voice synthesis has a plurality of problems to be solved urgently for specific audiences in numerous conference scenes, for example, the voice recognition is easily affected by the environment, so that the recognition accuracy is low, and the machine translation result is further unsatisfactory.

In addition, the speech synthesis of the machine translation is not smooth and is intermittent when the machine translation is output in audio. In particular, the invention focuses on the fact that larger delay occurs in the simultaneous transmission of synthesized voice and the original sound of a conference site, the delay forms an accumulated effect along with the speaking length, when the delay is obvious, simultaneous transmission listeners are difficult to form a timely call with a speaker, the speaking content and the emotion of the speaker are disjointed with the reaction and interaction of the listener, the conference speaking effect is greatly reduced, and the participants are difficult to obtain a better simultaneous transmission listening experience.

Disclosure of Invention

In view of the foregoing, the present invention is directed to a method, an apparatus, and a device for dynamically synthesizing output audio of simultaneous interpretation of a machine, and accordingly, a computer readable storage medium and a computer program product for dynamically adjusting and synthesizing output audio of a translation aiming at a specific problem in a specific simultaneous interpretation scene are provided, so as to improve the output effect of simultaneous interpretation of a machine.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a method for dynamically synthesizing machine simultaneous interpretation output audio, comprising:

judging whether the speech speed of the synthesized audio needs to be regulated according to the acquired current simultaneous interpretation scene information and the established rule;

If so, acquiring a first time length of the current original sound sentence in real time, and estimating a second time length after the audio is synthesized by the translated text corresponding to the current original sound sentence;

calculating the time difference between the first time length and the second time length, and counting the accumulated time difference of the current simultaneous transmission stage;

According to the time difference and/or the relation between the accumulated time difference and the corresponding preset tolerance, a translation strategy is adjusted in real time and/or a speech speed adjusting gain parameter is determined;

and performing voice synthesis processing according to the translated text after the translation strategy adjustment and/or the speed adjustment gain parameter.

In at least one possible implementation thereof, the current contemporaneous interpretation of the scene information includes one or more of:

the current original language type and translation direction;

Personalized information of the speaker; and

Pronunciation level features of the current acoustic statement.

In at least one possible implementation manner, the adjusting the translation strategy includes:

Performing secondary translation on the current original sound sentence: the length of the translated text after the second translation is changed by using words and/or grammar that are different from the previous translated text.

In at least one possible implementation manner, the determining the speech rate adjustment gain parameter includes:

corresponding corpus is collected in advance according to the simultaneous interpretation scene;

performing delay analysis by using the corpus, and determining a preset gain parameter;

And selecting the preset gain parameter or the proportional relation as the language speed adjusting gain parameter based on the time difference and/or the proportional relation of the accumulated time difference and the preset tolerance.

In at least one possible implementation manner, the method further includes:

Determining pronunciation adjusting parameters of the current output audio according to the pronunciation adjusting gain parameters or the pronunciation adjusting gain parameters and the current simultaneous interpretation scene information;

And combining the pronunciation adjusting parameters to synthesize the text of the current translation.

In at least one possible implementation manner, determining the pronunciation adjustment parameter of the current output audio according to the current simultaneous interpretation scene information includes:

according to the loudness of the current acoustic statement, obtaining a loudness adjustment parameter of the current output audio; and/or

Carrying out emotion analysis on the text of the current translation by combining the current original sound sentence;

And obtaining the mood adjusting parameters of the current output audio according to the emotion analysis result.

In at least one possible implementation manner, the obtaining the loudness adjustment parameter of the current output audio according to the loudness of the current acoustic sentence includes:

continuously obtaining an original sound value of each audio frame of the current original sound sentence;

calculating a volume difference value of each audio frame based on the original volume value;

and determining the volume adjustment parameter of the current output audio by combining the preset default loudness of the current output audio and the volume difference value.

In a second aspect, the present invention provides a machine simultaneous interpretation output audio dynamic synthesis apparatus, comprising:

the speech speed adjusting decision module is used for judging whether the speech speed of the synthesized audio needs to be adjusted according to the acquired current simultaneous interpretation scene information and the established rule;

The audio time length calculation module is used for acquiring the first time length of the current original sound sentence in real time and estimating the second time length after the translated text corresponding to the current original sound sentence synthesizes audio when the output of the language speed adjustment decision module is yes;

the delay calculation module is used for calculating the time difference between the first time length and the second time length and counting the accumulated time difference of the current simultaneous transmission stage;

The speech speed adjusting parameter determining module is used for adjusting a translation strategy and/or determining a speech speed adjusting gain parameter in real time according to the time difference and/or the relation between the accumulated time difference and the corresponding preset tolerance;

and the voice synthesis module is used for carrying out voice synthesis processing according to the translated text after the translation strategy adjustment and/or the voice speed adjustment gain parameter.

the current original language type and translation direction;

Personalized information of the speaker; and

Pronunciation level features of the current acoustic statement.

In at least one possible implementation manner, the language speed adjustment parameter determining module includes: the translation strategy adjusting unit specifically comprises:

The secondary translation component is used for carrying out secondary translation on the current original sound sentence: the length of the translated text after the second translation is changed by using words and/or grammar that are different from the previous translated text.

In at least one possible implementation manner, the speed adjusting parameter determining module specifically includes:

the corpus acquisition unit is used for gathering corresponding corpus in advance according to the simultaneous interpretation scene;

The gain parameter unit is used for carrying out delay analysis by utilizing the corpus and determining a preset gain parameter;

And the voice speed regulation parameter selection unit is used for selecting the preset gain parameter or the proportional relation as the voice speed regulation gain parameter based on the time difference and/or the proportional relation between the accumulated time difference and the preset tolerance.

In at least one possible implementation manner, the apparatus further includes:

The pronunciation adjusting parameter determining module is used for determining pronunciation adjusting parameters of the current output audio according to the pronunciation adjusting gain parameters or the pronunciation adjusting gain parameters and the current simultaneous interpretation scene information;

The voice synthesis module is also used for synthesizing the current translation text by combining the pronunciation adjustment parameters.

In at least one possible implementation manner, the pronunciation adjustment parameter determining module includes:

the loudness adjusting unit is used for obtaining loudness adjusting parameters of the current output audio according to the loudness of the current original sound statement; and/or

The mood adjusting unit, the mood adjusting unit specifically includes:

the emotion analysis component is used for carrying out emotion analysis on the text of the current translation by combining the current original sound sentence;

and the mood adjusting parameter acquisition component is used for acquiring mood adjusting parameters of the current output audio according to the emotion analysis result.

In at least one possible implementation thereof, the loudness adjustment unit includes:

An original sound volume acquisition component for continuously acquiring an original sound volume value of each audio frame of the current original sound sentence;

an original sound volume difference calculation component for calculating a sound volume difference value of each of the audio frames based on the original sound volume value;

And the volume adjustment parameter determining component is used for determining the volume adjustment parameter of the current output audio by combining the preset default loudness of the current output audio and the volume difference value.

In a third aspect, the present invention provides a machine simultaneous interpretation output audio dynamic synthesis apparatus comprising:

One or more processors, a memory, and one or more computer programs, the memory may employ a non-volatile storage medium, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the device, cause the device to perform the method as in the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform at least the method as in the first aspect or any of the possible implementations of the first aspect.

In a fifth aspect, the invention also provides a computer program product for performing at least the method of the first aspect or any of the possible implementations of the first aspect, when the computer program product is executed by a computer.

In at least one possible implementation manner of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.

The invention is characterized in that a speech speed adjustment factor is introduced when synthesizing audio, the delay of the synthesized audio generated by simultaneous transmission of a machine and original speaking voice is shortened as much as possible, the specific scheme is that from the current simultaneous interpretation scene, whether the speech speed of the synthesized audio needs to be adjusted is firstly clarified based on a set rule, when the speech speed is considered to be necessary to be interfered, the current time length of an original sound sentence and the possible time length of a corresponding translated text are obtained in real time, the difference between the two and the accumulated time difference between the two and the simultaneous transmission to the current stage are obtained, then the time difference and/or the accumulated time difference of the current sentence and the corresponding relation between the respective preset tolerance are examined, thereby dynamically adjusting the translation strategy and/or determining the speech speed adjustment gain parameter, wherein two adjustment means are included, one is directly interfered with the translated text, the other is not interfered with the translated text, and the speech speed change coefficient is attached to the current translated text, and finally the speech synthesis processing is carried out according to the two different means. The invention combines specific scene information as the condition for triggering and adjusting the synthesized audio, and realizes the dynamic speech speed adjustment of the current output audio by utilizing different-level intervention modes, thereby solving the problem of delay of simultaneous transmission of the machine and effectively improving the output effect of the simultaneous transmission of the machine.

Further, considering the listening feeling of the participants, in other embodiments of the present invention, the output audio is considered to be adjusted in the pronunciation level, so that on one hand, an expression mode closer to the original speech can be obtained, and on the other hand, the speech speed adjusting strategy can be assisted, so as to improve the poor listening feeling possibly caused by the speech speed adjustment.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an embodiment of a method for dynamic synthesis of machine simultaneous interpretation output audio provided by the present invention;

FIG. 2 is a flow chart of an embodiment of a method of obtaining loudness adjustment parameters provided by the present invention;

FIG. 3 is a block diagram of an embodiment of a machine simultaneous interpretation output audio dynamic synthesis device provided by the present invention;

Fig. 4 is a schematic diagram of an embodiment of a machine simultaneous interpretation output audio dynamic synthesis device provided by the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Before describing the specific scheme of the invention, the design process of the invention is first introduced. As described above, the machine simultaneous interpretation is limited to simultaneous interpretation scenes and the related technologies adopted, and the delay problem is obvious. The simultaneous interpretation scene is characterized in that as a speaker continuously outputs sentences, the translated audio output is realized in real time, instead of alternately interpreting the translated sentences of the speaker, namely, one sentence is interpreted after the translation, so that the simultaneous interpretation scene is delayed; the inventor analysis considers that the main reasons for the more remarkable delay of the synthesized output audio are language difference, personal characteristics of a speaker and expression modes. The difference of languages is a critical factor, the saturation of different languages is different, when the speech of a speaker is converted into an original text, the machine translates the original text into a translated text, at this time, the situation that the length of the translated text is inconsistent with that of the original text mostly occurs, and then the length of the audio of the translated text after the speech synthesis is different from that of the audio of an original sentence, for example, 10 seconds are consumed for reading a sentence of Chinese, 15 seconds are likely to be consumed for translating the audio of English synthesized by the machine in a similar rhythm, and the deviation can accumulate with time, so that after the user accesses a machine simultaneous transmission system, the rhythm of the speaker is likely to be difficult to keep up, and the voice is definitely an unfavorable participant experience for the speaker and a listener.

For example, the current original sound sentence is "I want to clarify first and then speak the next part", and the speaking duration of the Chinese sentence is 4 seconds under the normal and natural speech speed; the English translated text is I would like to make one point clear before I move on to the next point, the audio output time of the English sentence after synthesis processing is 6 seconds, so that 2 seconds of time difference (called time difference in the invention) exists between the original text time and the target translated text, and the time difference of the synthesized voice of the original text of the Nth sentence and the corresponding translated text is obviously amplified along with the duration of speaking content and the time lapse.

From this analysis, the inventors expect that the length of the output audio to achieve machine synthesis is as consistent or relatively short as possible with the length of the original speaker pronunciation, as the delay effect can be mitigated to a greater extent.

In view of this, the concept of the present invention is to match the speech rate of the translation synthesized audio with that of the original pronunciation audio in a limited time in the simultaneous interpretation scene, and specifically, an embodiment of at least one machine simultaneous interpretation output audio dynamic synthesis method is provided, as shown in fig. 1, may include the steps of:

and S1, judging whether the speech speed of the synthesized audio needs to be regulated according to the acquired current simultaneous interpretation scene information and the established rule.

In the embodiment, the station is located in a specific situation of the current machine simultaneous transmission scene, firstly, whether the speech speed of the synthesized audio needs to be interfered is evaluated on the whole, and the reason for designing the step is that the method analyzes various factors causing delay problems at the beginning of design as mentioned above, so that automatic decision can be made from the real scene situation firstly, and unnecessary operation can be avoided to cause unexpected delay effects instead if the speech speed is not interfered. The specific current machine co-transmitting scene information may be determined according to actual needs, mainly or surrounding the foregoing aspects, for example, as set forth in some preferred embodiments of the present invention, where the current co-acoustically transmitting scene information includes one or more of the following:

(1) The current original language type and translation direction.

The difference of the saturation is determined by the original language and the target language in the simultaneous transmission scene. It can be considered that under the same language system and branch, the saturation of the original language is close to that of the target language, so that the obvious pronunciation time difference cannot be caused, and when the languages such as English, french, german and the like are translated, too much time difference cannot be caused; translation scenes in different languages may result in significant time differences, such as, but not limited to, the chinese-english simultaneous transmission scenes mentioned above.

(2) Personalized information of the speaker.

Different speakers have personalized characteristics such as age, learning, social status, gender, speech style, action posture, expression and the like, so that the simultaneous effect is differentiated, for example, a speaker who is elderly can speak slowly and can accompany some limb actions, and thus sufficient translation and synthesis time is reserved for a machine, and the audio output time of the machine can be close to the speech speed of the speaker.

(3) Pronunciation level features of the current acoustic statement.

Based on the acoustic angle, the pronunciation level features may refer to prosody, intonation, emotion, timbre, pause and other aspects, and different pronunciation features may also cause the duration of the original pronunciation to be different, for example, a section of original sound sentence has the characteristic of lengthening tail sound, or the pause between sentences is more and longer, so that the duration of the machine synthesized audio can be similar to that of the original sound sentence.

Before deciding to intervene in the synthesized speech speed, the method starts from the integral factors of the current simultaneous transmission scene, and determines whether to continue to execute downwards according to the established standard (such as what relation the current two languages are, what personal information the speaker has, what acoustic characteristics the current sentence are, etc.); if the current simultaneous transmission scene is judged to need to be subjected to speech rate intervention through the step, the following processing can be executed:

S2, acquiring a first time length of a current original sound sentence in real time, and estimating a second time length after the audio is synthesized by a translated text corresponding to the current original sound sentence;

and S3, calculating the time difference between the first time length and the second time length, and counting the accumulated time difference in the simultaneous transmission stage.

The time length and the time difference can be calculated by referring to various prior arts, for example, the time stamp (including the start time and the end time of the sentence) of the current acoustic sentence is obtained, the first time length of the current acoustic sentence can be calculated, and the second time length can be obtained in a similar manner, but it should be noted that: in this embodiment, it is proposed to estimate the duration of the synthesized audio of the translated text corresponding to the current acoustic sentence, that is, the second duration is not the duration of the final output audio, but the initial duration of the audio synthesized based on the current translated text is predicted in the processing process, and the prediction mode may have multiple modes in actual operation, for example, a period of initial speech is synthesized and a time endpoint is calculated, or a pre-trained duration prediction model is utilized, features such as a language type and the current translated text are used as input information, and the initial duration is directly output by the duration prediction model.

Regarding the calculation of the time difference and the statistics of the accumulated time difference, the first duration of the current nth sentence acoustic sentence may be set to be Vn, the estimated second duration of the translation of the nth sentence is Sn, and the time difference of the current nth sentence is:

S_n-V_n

without any processing procedure, the time for the speaker to speak the previous N-1 sentences is:

The output audio time of the front N-1 sentence of translation after the simultaneous transmission and synthesis by the machine is as follows:

the two time offsets, i.e. the accumulated time difference, are:

And step S4, according to the time difference and/or the relation between the accumulated time difference and the corresponding preset tolerance, the translation strategy is adjusted in real time and/or the speech speed adjusting gain parameter is determined.

For example, based on different application scenarios and experiences, the tolerance of the time offset is preset to be σ (the instant long deviation is considered to be reasonable in the range, it should be noted that different tolerance can be set for the accumulated time difference and the time difference respectively), so there may be the following two relations (here, the accumulated time difference is exemplified, the time difference is compared with the theory, and the description is omitted):

(1) If it is

(2) If it is

In addition, for the means of adjusting the speech speed, the present embodiment essentially proposes two different dimensional ways, one is to directly intervene in the translation text, and the other is to attach the speech speed change coefficient to the current translation text without intervening in the translation text.

In particular, the adjustment of the translation policy refers to changing the generated current translation text, for example, the current original sentence may be translated secondarily, preferably, words and/or grammar different from those of the previous translation text are adopted in the secondary translation process, so that the length of the translated text after the secondary translation is changed, simply understood that the saturation of the translation is changed through the secondary translation, the translation content may be shortened or increased according to the requirement of the concurrent scene in actual operation, and the translation text may be modified wholly or partially, for example, a proper noun is translated in shorthand form, which is not limited by the invention, but it should be pointed out that the triggering time of the secondary translation may be based on the different relationships, for example, when the accumulated time difference is greater than sigma, the secondary translation is started, and the new translation text is obtained according to different translation policies.

With respect to the determining speech rate adjustment gain parameter, it is understood that the speech rate of the output audio of the current translation is adjusted and changed by a factor (multiple, ratio value, etc.). In actual operation, different corpuses can be built under different scenes to perform test training and delay analysis, and the optimal multiple t (preset gain parameter) is found, wherein the optimal time can be the second time period more similar to the first time period. For example, based on a simultaneous interpretation scene, N corpora are collected, the speaking time of each corpus speaker is T _i, and the synthesized audio time of the corresponding translation is S _i, then the T value satisfying the following objective function is the best speech speed multiple of the synthesized speech speed adjustment in the scene:

min t

The reason for referring the optimal t value to the preset gain parameter is that the final speech rate gain needs to be selected in combination with the relationship between the time difference or the accumulated time difference and the tolerance. For example, when the accumulated time difference is less than or equal to sigma (the time difference is the same and is not described in detail), 1 can be used as the final speech rate gain to indicate that the speech rate change rate is unchanged. If the accumulated time difference is greater than sigma, the preset gain parameter or the proportional relation can be selected as a final speech speed adjusting gain parameter based on the proportional relation between the accumulated time difference and the preset tolerance.

The proportional relation may be a multiple relation between an actual deviation and an ideal deviation, and when the actual deviation and the ideal deviation are selected, if the proportional value of the actual deviation and the ideal deviation exceeds a pre-obtained optimal multiple t (a preset gain parameter), the proportional value of the actual deviation and the ideal deviation may be used as a final speech speed adjusting gain parameter; if the ratio of the actual deviation to the ideal deviation is less than or equal to the preset gain parameter, the preset gain parameter can be used as the final speech speed adjusting gain parameter. Of course, the above rules for determining the final speed adjustment gain parameter may be determined according to the need, and in some embodiments, may be opposite to the selection manner described above, which is not limited to the present invention. In the subsequent synthesis operation, the acting object of the final speech speed adjustment gain parameter may also be the whole or part of the current translation, for example, the speech speed adjustment parameter may be attached to only the first half sentence or the second half sentence in "I would like to make one point clear before I move on to the next point".

And finally, step S5, performing voice synthesis processing according to the translated text and/or the language speed adjusting gain parameters after the translation strategy is adjusted.

The synthesis method used herein refers to the prior art, and the emphasis is that the synthesized object can be converted into a new translated text after being translated again, and/or the speech adjustment gain parameters can be integrated into the synthesized text during the synthesis operation. When the synthesis operation is completed, the synthesized audio can be played to the listening and watching terminal of the participant user concurrently, which is not the key point of the invention and will not be repeated.

It should be noted that, based on the foregoing embodiments, in order to further improve the listening feel, the present invention may further incorporate a pronunciation adjusting parameter in a synthesis stage of the output audio, that is, in a preferred embodiment of the present invention, the foregoing synthesis method may further include: according to the voice speed adjusting gain parameter, or the voice speed adjusting gain parameter and the current simultaneous interpretation scene information, determining the pronunciation adjusting parameter of the current output audio, and combining the pronunciation adjusting parameter when the current translated text (including the new translated text) is synthesized, so that the synthesized output audio is reduced in delay and can be more natural in acoustic performance and more approximate to the pronunciation characteristics of a speaker.

The difference in naturalness between machine synthesis and real speech is well known in the art, but the present invention has the aim of designing this preferred solution, not only to improve the effect of synthesizing audio, but also to incorporate the previously mentioned concept of processing delay. The inventor considers the synthesized audio after the speech speed adjustment, and on the basis of the natural hearing feeling existing in the machine synthesis, the unnatural hearing feeling can be prominently amplified (for example, the speech speed is excessively fast or excessively slow, and the like, especially, grammar is different due to combination of language type differences, or the personality information of a speaker or a plurality of differences of acoustic characteristics of pronunciation are combined, and the like), so the solution is considered to be designed.

Specifically, the speech rate adjustment gain parameter obtained in the foregoing embodiment may be combined, or the speech rate adjustment gain parameter may be combined with the current contemporaneous interpretation scene information, where two aspects of the current contemporaneous interpretation scene information are described as schematic illustrations.

(1) The loudness adjustment parameter of the current output audio may be obtained from the loudness of the current acoustic statement.

For example, the original voice may be emphasized in the language of the key places and the voice volume is increased, so that the expression information more conforming to the original sound statement before translation can be obtained by using the personalized information of the speaker. Specifically, in connection with the embodiment shown in fig. 2, obtaining the loudness adjustment parameter of the current output audio from the loudness of the current acoustic statement includes:

Step S10, continuously obtaining an original sound value of each audio frame of the current original sound sentence;

step S20, calculating the volume difference value of each audio frame based on the original volume value;

step S30, determining the volume adjustment parameter of the current output audio by combining the preset default loudness of the current output audio and the volume difference value.

In actual operation, the example may be to continuously obtain the volume level of each frame of the current acoustic sentence through the existing volume calculation mode, and the volume level is represented by d0, d1, … … dn; and the difference between each frame and the previous frame is calculated to be 0, d1-d0, d2-d1, … …, dn-d (n-1).

The machine synthesized audio will usually preset an original default volume value according to the normal condition, and assuming that the normal volume of the default volume is represented by s, the manner of determining the volume adjustment parameter of the current output audio can be referred to as follows: s is that the number of the components is equal to s, That is, each frame of synthesized audio is given an updated volume value, so that the actual volume is relatively consistent with the synthesized audio by intervening at the frame level, thereby enabling the output audio to relatively more closely conform to the pronunciation variations of the speaker.

(2) And emotion analysis can be carried out on the current translated text by combining with the current original sound sentence, and according to the emotion analysis result, the mood adjustment parameters of the current output audio are obtained.

Besides volume adjustment dimension, the synthesis effect of the translation text can be dynamically adjusted by combining the existing emotion analysis and keyword extraction technology. Specifically, emotion types of sentences can be obtained after emotion analysis: and extracting the key words of the sentences by using the key word extraction technology. To improve the audience's perception of hearing, the mood adjustment parameters may be, but are not limited to, marked with pitch and severity as a means: 1. if emotion is positive, the pitch of the voice synthesized voice is properly increased; 2. if emotion is depolarization, the pitch of the synthesized voice is properly reduced; 3. the rereading mark is added to the extracted keywords to indicate that a certain position of the translation is emphasized.

Finally, the mood adjusting parameters and the previously submitted mood adjusting parameters are acted together in a synthesizing processing link, so that the hearing effect of the output audio is compensated and improved.

In summary, the concept of the present invention is to introduce a speech speed adjustment factor in the process of synthesizing audio, so as to shorten the delay of the synthesized audio and the original speech generated by simultaneous transmission of the machine as much as possible, specifically, the present invention is to firstly determine whether the speech speed of the synthesized audio needs to be adjusted based on a predetermined rule, when the speech speed is considered necessary to be interfered, acquire the current time length of the original speech sentence and the possible time length of the corresponding translated text in real time, and calculate the difference between the two and the accumulated time difference transmitted to the current stage, and then examine the relationship between the time difference and/or the accumulated time difference of the current sentence and the corresponding preset tolerance respectively, thereby dynamically adjusting the translation strategy and/or determining the speech speed adjustment gain factor. The invention combines specific scene information as the condition for triggering and adjusting the synthesized audio, and realizes the dynamic speech speed adjustment of the current output audio by utilizing different-level intervention modes, thereby solving the problem of delay of simultaneous transmission of the machine and effectively improving the output effect of the simultaneous transmission of the machine.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a machine simultaneous interpretation output audio dynamic synthesis device, as shown in fig. 3, which may specifically include the following components:

The speech speed adjusting decision module 1 is used for judging whether the speech speed of the synthesized audio needs to be adjusted according to the acquired current simultaneous interpretation scene information and the established rules;

the audio time length calculation module 2 is used for acquiring the first time length of the current original sound sentence in real time and estimating the second time length after the translated text corresponding to the current original sound sentence synthesizes audio when the output of the language speed adjustment decision module is yes;

The delay calculation module 3 is used for calculating the time difference between the first time length and the second time length and counting the accumulated time difference of the current simultaneous transmission stage;

The speech speed adjusting parameter determining module 4 is used for adjusting the translation strategy and/or determining the speech speed adjusting gain parameter in real time according to the time difference and/or the relation between the accumulated time difference and the corresponding preset tolerance;

And the voice synthesis module 5 is used for carrying out voice synthesis processing according to the translated text after the translation strategy adjustment and/or the voice speed adjustment gain parameter.

the current original language type and translation direction;

Personalized information of the speaker; and

Pronunciation level features of the current acoustic statement.

In at least one possible implementation manner, the apparatus further includes:

The mood adjusting unit, the mood adjusting unit specifically includes:

It should be understood that the above division of the components in the machine simultaneous interpretation output audio dynamic synthesis apparatus shown in fig. 3 is merely a division of logic functions, and may be fully or partially integrated into a physical entity or may be physically separated. And these components may all be implemented in software in the form of a call through a processing element; or can be realized in hardware; it is also possible that part of the components are implemented in the form of software called by the processing element and part of the components are implemented in the form of hardware. For example, some of the above modules may be individually set up processing elements, or may be integrated in a chip of the electronic device. The implementation of the other components is similar. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors (DIGITAL SINGNAL Processor (DSP), or one or more field programmable gate arrays (Field Programmable GATE ARRAY; FPGA), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and their preferred embodiments, those skilled in the art will appreciate that in practice the present invention is applicable to a variety of embodiments, and the present invention is schematically illustrated by the following carriers:

(1) A machine simultaneous interpretation output audio dynamic synthesis device may include:

One or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the device, cause the device to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 4 is a schematic structural diagram of an embodiment of a machine simultaneous interpretation output audio dynamic synthesis device provided by the present invention, where the device may be an electronic device or a circuit device built in the electronic device. The electronic device may be a PC, a server, a translator, a recording pen, a mobile intelligent terminal (mobile phone, tablet, reader, watch, bracelet, glasses, etc.), a microphone, a headset, an earphone, etc. The specific form of the machine simultaneous interpretation output audio dynamic synthesis device in this embodiment may not be limited.

As particularly shown in fig. 4, the machine simultaneous interpretation output audio dynamic synthesis apparatus 900 includes a processor 910 and a memory 930. Wherein the processor 910 and the memory 930 may communicate with each other via an internal connection, and transfer control and/or data signals, the memory 930 is configured to store a computer program, and the processor 910 is configured to call and execute the computer program from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, more commonly referred to as separate components, and the processor 910 is configured to execute program code stored in the memory 930 to perform the functions described above. In particular implementations, the memory 930 may also be integrated within the processor 910 or separate from the processor 910.

In addition, to further improve the functionality of the machine simultaneous interpretation output audio dynamic synthesis device 900, the device 900 may further comprise one or more of an input unit 960, a display unit 970, audio circuitry 980, a camera 990, a sensor 901, etc., which may further comprise a speaker 982, a microphone 984, etc. Wherein the display unit 970 may include a display screen.

Further, the machine simultaneous interpretation output audio dynamic synthesis apparatus 900 described above may further include a power supply 950 for supplying power to various devices or circuits in the apparatus 900.

It should be appreciated that the machine simultaneous interpretation output audio dynamic synthesis apparatus 900 shown in fig. 4 is capable of implementing the respective processes of the method provided by the foregoing embodiments. The operations and/or functions of the various components in the device 900 may be respectively for implementing the corresponding flows in the method embodiments described above. Reference is specifically made to the foregoing descriptions of embodiments of methods, apparatuses and so forth, and detailed descriptions thereof are appropriately omitted for the purpose of avoiding redundancy.

It should be appreciated that the processor 910 in the machine simultaneous interpretation output audio dynamic synthesis device 900 shown in fig. 4 may be a system on a chip SOC, and the processor 910 may include a central processing unit (Central Processing Unit; hereinafter referred to as "CPU") and may further include other types of processors, for example: an image processor (Graphics Processing Unit; hereinafter referred to as GPU) or the like, as will be described in detail below.

In general, portions of the processors or processing units within the processor 910 may cooperate to implement the preceding method flows, and corresponding software programs for the portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium having stored thereon a computer program or the above-mentioned means, which when executed, causes a computer to perform the steps/functions of the foregoing embodiments or equivalent implementations.

In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, certain aspects of the present invention may be embodied in the form of a software product as described below, in essence, or as a part of, contributing to the prior art.

(3) A computer program product (which may comprise the apparatus described above) which, when run on a terminal device, causes the terminal device to perform the machine simultaneous interpretation output audio dynamic synthesis method of the preceding embodiment or equivalent implementation.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the above-described computer program product may include, but is not limited to, an APP; in connection with the foregoing, the device/terminal may be a computer device, and the hardware structure of the computer device may specifically further include: at least one processor, at least one communication interface, at least one memory and at least one communication bus; the processor, the communication interface and the memory can all communicate with each other through a communication bus. The processor may be a central Processing unit CPU, DSP, microcontroller or digital signal processor, and may further include a GPU, an embedded neural network processor (Neural-network Process Units; hereinafter referred to as NPU) and an image signal processor (IMAGE SIGNAL Processing; hereinafter referred to as ISP), where the processor may further include an ASIC (application specific integrated circuit) or one or more integrated circuits configured to implement embodiments of the present invention, and the processor may further have a function of operating one or more software programs, where the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage medium may include: nonvolatile Memory (nonvolatile Memory), such as a non-removable magnetic disk, a USB flash disk, a removable hard disk, an optical disk, and so forth, and Read-Only Memory (ROM), random access Memory (Random Access Memory; RAM), and so forth.

In the embodiments of the present invention, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, units, and method steps described in the embodiments disclosed herein can be implemented in electronic hardware, computer software, and combinations of electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

And, each embodiment in the specification is described in a progressive manner, and the same and similar parts of each embodiment are mutually referred to. In particular, for embodiments of the apparatus, device, etc., as they are substantially similar to method embodiments, the relevance may be found in part in the description of method embodiments. The above-described embodiments of apparatus, devices, etc. are merely illustrative, in which modules, units, etc. illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed across multiple places, e.g., nodes of a system network. In particular, some or all modules and units in the system can be selected according to actual needs to achieve the purpose of the embodiment scheme. Those skilled in the art will understand and practice the invention without undue burden.

The construction, features and effects of the present invention are described in detail according to the embodiments shown in the drawings, but the above is only a preferred embodiment of the present invention, and it should be understood that the technical features of the above embodiment and the preferred mode thereof can be reasonably combined and matched into various equivalent schemes by those skilled in the art without departing from or changing the design concept and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, but is intended to be within the scope of the invention as long as changes made in the concept of the invention or modifications to the equivalent embodiments do not depart from the spirit of the invention as covered by the specification and drawings.

Claims

1. A method for dynamically synthesizing machine simultaneous interpretation output audio, comprising:

Judging whether the speech speed of the synthesized audio needs to be regulated according to the acquired current simultaneous interpretation scene information and the established rule; the established rules are used to decide whether a speed adjustment is required from one or more of the following: the relation between the languages of the current original language type and the translation direction; personalized information of the speaker; the pronunciation level characteristics of the current acoustic statement;

2. The machine simultaneous interpretation output audio dynamic synthesis method as claimed in claim 1, wherein the adjusting the interpretation strategy includes:

3. The method of claim 1, wherein determining the speech rate adjustment gain parameter comprises:

4. A machine simultaneous interpretation output audio dynamic synthesis method as claimed in any one of claims 1 to 3, characterized in that the method further comprises:

5. The method of dynamic synthesis of machine simultaneous interpretation output audio as claimed in claim 4, wherein determining the pronunciation adjustment parameters of the current output audio based on the current simultaneous interpretation scene information includes:

6. The method of claim 5, wherein obtaining the loudness adjustment parameter of the current output audio based on the loudness of the current acoustic sentence comprises:

7. A machine simultaneous interpretation output audio dynamic synthesis device, comprising:

The speech speed adjusting decision module is used for judging whether the speech speed of the synthesized audio needs to be adjusted according to the acquired current simultaneous interpretation scene information and the established rule; the established rules are used to decide whether a speed adjustment is required from one or more of the following: the relation between the languages of the current original language type and the translation direction; personalized information of the speaker; the pronunciation level characteristics of the current acoustic statement;

8. The machine simultaneous interpretation output audio dynamic synthesis apparatus of claim 7, further comprising:

9. A machine simultaneous interpretation output audio dynamic synthesis apparatus, comprising:

One or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the device, cause the device to perform the machine simultaneous interpretation output audio dynamic synthesis method of any of claims 1-6.