CN113921012A

CN113921012A - Method, system, intelligent device and storage medium for recognizing synthetic speech

Info

Publication number: CN113921012A
Application number: CN202111208088.2A
Authority: CN
Inventors: 王安杰
Original assignee: Shandong Fengpin Information Network Technology Co ltd
Current assignee: Shandong Fengpin Information Network Technology Co ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-11

Abstract

The embodiment of the invention discloses a method, a system, intelligent equipment and a storage medium for recognizing synthesized voice. The method for recognizing the synthesized speech includes: acquiring target voice data to be recognized, and acquiring target character data to be recognized according to the target voice to be recognized; acquiring voice emotion data of each pronunciation phoneme in target voice data to be recognized and character emotion data of each word in target character data to be recognized; and judging whether the voice emotion data and the character emotion data are matched, and if the voice emotion data and the character emotion data are matched, judging that the target voice data to be recognized is non-synthesized voice data. The invention can effectively improve the accuracy and reliability of the synthesized voice recognition.

Description

Method, system, intelligent device and storage medium for recognizing synthetic speech

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, a system, an intelligent device, and a storage medium for recognizing synthesized speech.

Background

Speech synthesis, also known as Text To Speech (TTS) technology, is a technology for simulating a human being by using an electronic computer and some special devices to produce Speech. The other speech synthesis technology is to collect multiple speech segments of the speaker, intercept and splice the speech segments according to the target characters, and generate new synthesized speech.

The judgment of whether the voice data is synthesized voice adopts the modes of judging the continuity of the sentences, whether the pause is normal and the like. As synthesized speech and real speech become closer, how to identify which sounds are real and which are forged is an important issue of current research, and concerns security of recognition technologies such as voiceprint recognition, voice unlocking, and the like.

Disclosure of Invention

In view of the above, it is necessary to provide a method, a system, an intelligent device and a storage medium for recognizing synthesized speech.

The technical scheme adopted by the invention for solving the technical problems is as follows: there is provided a recognition method of synthesized speech, including: acquiring target voice data to be recognized, and acquiring target character data to be recognized according to the target voice to be recognized; acquiring voice emotion data of each pronunciation phoneme in the target voice data to be recognized and character emotion data of each word in the target character data to be recognized; and judging whether the voice emotion data are matched with the character emotion data or not, and if the voice emotion data are matched with the character emotion data, judging that the target voice data to be recognized are non-synthesized voice data.

Wherein, the step of judging whether the voice emotion data and the character emotion data are matched comprises the following steps: acquiring pronunciation emotion continuous data and/or pronunciation emotion turning data of two adjacent pronunciation phonemes and character emotion continuous data and/or character emotion turning data of two adjacent words; and judging whether the pronunciation emotion continuous data and the character emotion continuous data are matched and/or whether the pronunciation emotion turning data and the character emotion turning data are matched.

The step of obtaining the speech emotion data of each pronunciation phoneme in the target speech data to be recognized comprises the following steps: and acquiring at least one pronunciation phoneme in the target voice data to be recognized by a voice recognition technology.

Before the step of judging whether the voice emotion data and the character emotion data are matched, the method comprises the following steps of: and acquiring the time dimension of the target voice to be recognized, and aligning the voice emotion data and the character emotion data in the time dimension.

Wherein, the step of judging whether the voice emotion data and the character emotion data are matched comprises the following steps: acquiring voice target emotion data of the target voice data to be recognized according to the voice emotion data, and acquiring character target emotion data of the target character data to be recognized according to the character emotion data; and judging whether the voice target emotion data is matched with the character target emotion data or not.

The steps of acquiring the voice target emotion data of the target voice data to be recognized according to the voice emotion data and acquiring the character target emotion data of the target character data to be recognized according to the character emotion data comprise: and acquiring the voice emotion weight of each voice emotion data, and acquiring the character emotion weight of each character emotion data.

The step of obtaining the voice emotion weight of each voice emotion data and the step of obtaining the character emotion weight of each character emotion data comprise: and acquiring the voice emotion weight and the character emotion weight through attention operation.

The technical scheme adopted by the invention for solving the technical problems is as follows: there is provided a recognition system for synthesized speech, comprising: the acquisition module is used for acquiring target voice data to be recognized and acquiring target character data to be recognized according to the target voice to be recognized; the emotion module is used for acquiring voice emotion data of each pronunciation phoneme in the target voice data to be recognized and character emotion data of each word in the target character data to be recognized; and the judging module is used for judging whether the voice emotion data are matched with the character emotion data or not, and if the voice emotion data are matched with the character emotion data, judging that the target voice data to be recognized are non-synthesized voice data.

The technical scheme adopted by the invention for solving the technical problems is as follows: there is provided a smart device comprising: a processor coupled to the memory and a memory having a computer program stored therein, the processor executing the computer program to implement the method as described above.

The technical scheme adopted by the invention for solving the technical problems is as follows: there is provided a storage medium storing a computer program executable by a processor to implement the method as described above.

The embodiment of the invention has the following beneficial effects:

acquiring target to-be-recognized character data according to the target to-be-recognized voice, and acquiring voice emotion data of each pronunciation phoneme in the target to-be-recognized voice data and character emotion data of each word in the target to-be-recognized character data; and judging whether the voice emotion data are matched with the character emotion data, if so, judging that the target voice data to be recognized are non-synthesized voice data, judging whether the voice is synthesized from the aspect of emotion, and improving the accuracy and reliability of judgment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a flow chart illustrating a method for recognizing synthesized speech according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of a system for recognizing synthesized speech according to the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a smart device provided by the present invention;

fig. 4 is a schematic structural diagram of an embodiment of a storage medium provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for recognizing a synthesized speech according to an embodiment of the present invention. The method for recognizing the synthesized voice provided by the invention comprises the following steps:

s101: and acquiring target voice data to be recognized, and acquiring target character data to be recognized according to the target voice to be recognized.

In a specific implementation scenario, target voice data to be recognized is obtained, and the target voice data to be recognized may be uploaded by a user, or obtained by parsing in video data, or obtained by performing field recording. And performing character recognition on the target voice data to be recognized to obtain the target character data to be recognized. The method for obtaining the text data according to the voice data may adopt an existing method, and details are not repeated here.

S102: and acquiring voice emotion data of each pronunciation phoneme in the target voice data to be recognized and character emotion data of each word in the target character data to be recognized.

In a specific implementation scenario, at least one pronunciation phoneme of the audio data is obtained by an ASR (Automatic Speech Recognition) technique, and the Speech emotion data of each pronunciation phoneme is obtained. Specifically, at least one pronunciation phoneme may be input into the pre-trained speech emotion recognition network to obtain speech emotion data of each pronunciation phoneme, and in other implementation scenarios, a tone of each pronunciation phoneme may be obtained, and speech emotion data of each pronunciation phoneme may be obtained according to the tone of the pronunciation phoneme.

And splitting words of the target character data to be recognized to obtain at least one word, and obtaining the character emotion data of each word according to the word meaning corresponding to each word. Further, context data are obtained according to the target character data to be recognized, and character emotion data of each word are obtained by combining the context data.

S103: and judging whether the voice emotion data and the character emotion data are matched, if so, executing the step S104.

In a specific implementation scenario, the speech emotion data of each pronunciation phoneme is compared with the text emotion data of the corresponding word to determine whether the two are matched. Specifically, it is possible to determine whether both are identical or whether both are of the same category (e.g., positive emotion or negative emotion). If the two are consistent or of the same category, the matching of the voice emotion data and the character emotion data can be judged.

In other implementation scenarios, it may also be counted whether the speech emotion data of a plurality of continuous pronunciation phonemes is matched with the corresponding character emotion data of a plurality of continuous words, for example, the proportion of each kind of emotion data in the speech emotion data and the proportion of each kind of emotion data in the character emotion data are counted, the two proportions of proportion are compared, whether the difference between the two proportions is within a preset proportion difference threshold range is determined, and if yes, it is determined that the speech emotion data is matched with the character emotion data.

In other implementation scenarios, since the emotion of the normal person expressed in the speech is consistent, such as progressive, gradual or same emotion, even in the case of sad addition, the two emotions are mixed, and there is basically no abrupt emotional great transition. Therefore, whether the voice emotion data and the character emotion data are matched can be judged according to whether the emotion continuous data and/or the pronunciation emotion turning data corresponding to the two adjacent pronunciation factors are matched with the character emotion continuous data and/or the character emotion turning data of the two adjacent words corresponding to the two pronunciation factors. Pronunciation emotion continuous data and/or pronunciation emotion turning data of two adjacent pronunciation phonemes and character emotion continuous data and/or character emotion turning data of two adjacent words can be acquired according to the voice emotion data of each pronunciation phoneme and the character emotion data of each word in the target character data to be recognized. And judging whether the character emotion data of the corresponding two adjacent words are the same or continuous when the voice emotion data of the two adjacent pronunciation factors are turned or continuous, and if so, judging whether the pronunciation emotion continuous data and the character emotion continuous data are matched and/or whether the pronunciation emotion turning data and the character emotion turning data are matched.

In other implementation scenarios, a voice emotion data curve and a character emotion data curve can be respectively drawn according to the voice emotion data of each pronunciation phoneme in the target voice data to be recognized and the character emotion data of each word in the target character data to be recognized, the two curves are overlapped, whether the overlapped part of the two curves is larger than a preset overlapped area threshold value or not is judged, and if yes, the voice emotion data and the character emotion data are matched.

In other implementation scenarios, the text target emotion data can be acquired according to the speech emotion data of each pronunciation phoneme in the target speech data to be recognized and the text emotion data of each word in the target text data to be recognized. Specifically, the speech emotion data of each pronunciation factor can be input into the pre-trained speech target emotion neural network to obtain the speech target emotion data, and the character emotion data of each word can be input into the pre-trained character target emotion neural network to obtain the character target emotion data. In real life, even if the words spoken by the characters include a plurality of emotions, and there are emotional differences between partial sentences, the spoken words have a final target emotion, for example, if the user is sad, and the final overall emotion is sad even though each sentence may include other emotions. However, if the speech is artificially synthesized, even if the speech expressing sadness is spoken, the actual emotion is not present, or the speech is obtained by concatenating, the emotion in the description in the context that may be captured is not sadness, but is captured from a plurality of contexts such as happiness, anger, calmness, and the like, and the final overall emotion may not be obtained, or the final overall emotion obtained may not be sadness. Therefore, whether the voice emotion data and the character emotion data are matched or not can be judged according to whether the voice target emotion data and the character target emotion data are matched or not.

Further, acquiring a voice emotion weight of each voice emotion data, acquiring a character emotion weight of each character emotion data, multiplying each voice emotion data by the corresponding voice emotion weight, inputting a pre-trained voice target emotion neural network, acquiring voice target emotion data, multiplying each character emotion data by the corresponding character emotion weight, inputting the pre-trained character target emotion neural network, and acquiring character target emotion data. For example, a speech emotion weight of each speech emotion data and a text emotion weight of each text emotion data may be acquired through attention calculation. For example, when the user expresses his/her true emotion, the user may speak a few sentences for laying down the emotion first, and the weights of the sentences are low, and when the user speaks a sentence for expressing his/her true meaning, the weights of the sentences are high. The real emotion of the user can be acquired more accurately through different weight calculation. However, when the speech is a synthesized speech, the actual speech emotion cannot be acquired even with a different weight, or the clipped user speech is spliced, and the proportion of the false emotion that does not match the actual expression emotion is enlarged by the weight calculation, so that it can be confirmed that the speech is synthesized.

In other implementation scenarios, the target speech data to be recognized has a time dimension, but the target text data to be recognized does not have a time dimension, and in order to be more accurate and reliable in comparing the speech emotion data with the text emotion data, the speech emotion data and the text emotion data are aligned in the time dimension, that is, the compared speech emotion data and text emotion data are at the same time.

S104: and judging that the target voice data to be recognized is non-synthesized voice data.

In a specific implementation scenario, the speech emotion data and the text emotion data are matched, that is, the emotion of the person obtained according to the speech matches the emotion of the person obtained according to the text, and the target speech data to be recognized is determined to be non-synthesized speech data. However, in many cases, the artificial synthesized speech has a phenomenon that the emotion of a person cannot be perceived, or a spliced speech obtained by splicing a plurality of speech segments may not match the emotion of the person to be actually expressed due to the difference in emotion of the person in different segments (for example, the collected speech segments include various emotions of happiness, injury and anger, and the corresponding characters express sad emotions), that is, the speech emotion data does not match the character emotion data, and the target speech data to be recognized is determined to be synthesized speech data.

As can be seen from the above description, in this embodiment, target to-be-recognized text data is obtained according to a target to-be-recognized voice, and voice emotion data of each pronunciation phoneme in the target to-be-recognized voice data and text emotion data of each word in the target to-be-recognized text data are obtained; and judging whether the voice emotion data and the character emotion data are matched, if so, judging that the target voice data to be recognized is non-synthesized voice data, judging whether the voice is synthesized from the aspect of emotion, and improving the accuracy and reliability of judgment.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech synthesis recognition system according to an embodiment of the present invention. The system 10 for recognizing synthesized speech includes an obtaining module 11, an emotion module 12, and a determining module 13.

The obtaining module 11 is configured to obtain target voice data to be recognized, and obtain target text data to be recognized according to the target voice to be recognized. The emotion module 12 is configured to obtain speech emotion data of each pronunciation phoneme in the target speech data to be recognized and text emotion data of each word in the target text data to be recognized. The judging module 13 is configured to judge whether the speech emotion data and the text emotion data are matched, and if the speech emotion data and the text emotion data are matched, judge that the target speech data to be recognized is non-synthesized speech data.

The judging module 13 is further configured to acquire pronunciation emotion continuous data and/or pronunciation emotion turning data of two adjacent pronunciation phonemes, and character emotion continuous data and/or character emotion turning data of two adjacent words; and judging whether the pronunciation emotion continuous data and the character emotion continuous data are matched and/or whether the pronunciation emotion turning data and the character emotion turning data are matched.

The emotion module 12 is further configured to obtain at least one pronunciation phoneme in the target speech data to be recognized through a speech recognition technology.

The judgment module 13 is further configured to acquire a time dimension of the target speech to be recognized, and align the speech emotion data and the text emotion data in the time dimension.

The judging module 13 is further configured to obtain voice target emotion data of the target to-be-recognized voice data according to the voice emotion data, and obtain text target emotion data of the target to-be-recognized text data according to the text emotion data; and judging whether the voice target emotion data are matched with the character target emotion data or not.

The judging module 13 is further configured to obtain a speech emotion weight of each speech emotion data, and obtain a text emotion weight of each text emotion data.

The judging module 13 is further configured to obtain a speech emotion weight and a text emotion weight through attention calculation.

As can be seen from the above description, in the embodiment, the recognition system for synthesizing speech acquires target to-be-recognized text data according to the target to-be-recognized speech, and acquires speech emotion data of each pronunciation phoneme in the target to-be-recognized speech data and text emotion data of each word in the target to-be-recognized text data; and judging whether the voice emotion data and the character emotion data are matched, if so, judging that the target voice data to be recognized is non-synthesized voice data, judging whether the voice is synthesized from the aspect of emotion, and improving the accuracy and reliability of judgment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of an intelligent device provided in the present invention. The smart device 20 comprises a processor 21, a memory 22. The processor 21 is coupled to a memory 22. The memory 22 has stored therein a computer program which is executed by the processor 21 in operation to implement the method as shown in fig. 1. The detailed methods can be referred to above and are not described herein.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a storage medium according to an embodiment of the present invention. The storage medium 30 stores at least one computer program 31, and the computer program 31 is used for being executed by a processor to implement the method shown in fig. 1, and the detailed method can be referred to above and is not described herein again. In one embodiment, the computer readable storage medium 30 may be a memory chip in a terminal, a hard disk, or other readable and writable storage tool such as a removable hard disk, a flash disk, an optical disk, or the like, and may also be a server or the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a non-volatile computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for recognizing synthesized speech, comprising:

acquiring target voice data to be recognized, and acquiring target character data to be recognized according to the target voice to be recognized;

acquiring voice emotion data of each pronunciation phoneme in the target voice data to be recognized and character emotion data of each word in the target character data to be recognized;

and judging whether the voice emotion data are matched with the character emotion data or not, and if the voice emotion data are matched with the character emotion data, judging that the target voice data to be recognized are non-synthesized voice data.

2. The method of claim 1, wherein the step of determining whether the speech emotion data and the text emotion data match comprises:

acquiring pronunciation emotion continuous data and/or pronunciation emotion turning data of two adjacent pronunciation phonemes and character emotion continuous data and/or character emotion turning data of two adjacent words;

and judging whether the pronunciation emotion continuous data and the character emotion continuous data are matched and/or whether the pronunciation emotion turning data and the character emotion turning data are matched.

3. The method for recognizing synthesized speech according to claim 1, wherein the step of obtaining the speech emotion data of each phoneme of pronunciation in the target speech data to be recognized is preceded by:

and acquiring at least one pronunciation phoneme in the target voice data to be recognized by a voice recognition technology.

4. The method of claim 1, wherein the step of determining whether the speech emotion data and the text emotion data match each other is preceded by:

and acquiring the time dimension of the target voice to be recognized, and aligning the voice emotion data and the character emotion data in the time dimension.

5. The method of claim 1, wherein the step of determining whether the speech emotion data and the text emotion data match comprises:

acquiring voice target emotion data of the target voice data to be recognized according to the voice emotion data, and acquiring character target emotion data of the target character data to be recognized according to the character emotion data;

and judging whether the voice target emotion data is matched with the character target emotion data or not.

6. The method for recognizing the synthesized voice according to claim 5, wherein the step of obtaining the voice target emotion data of the target voice data to be recognized according to the voice emotion data and obtaining the text target emotion data of the text data to be recognized according to the text emotion data comprises:

and acquiring the voice emotion weight of each voice emotion data, and acquiring the character emotion weight of each character emotion data.

7. The method of claim 6, wherein the step of obtaining the speech emotion weight of each speech emotion data and obtaining the text emotion weight of each text emotion data comprises:

and acquiring the voice emotion weight and the character emotion weight through attention operation.

8. A system for recognizing synthesized speech, comprising:

the acquisition module is used for acquiring target voice data to be recognized and acquiring target character data to be recognized according to the target voice to be recognized;

the emotion module is used for acquiring voice emotion data of each pronunciation phoneme in the target voice data to be recognized and character emotion data of each word in the target character data to be recognized;

and the judging module is used for judging whether the voice emotion data are matched with the character emotion data or not, and if the voice emotion data are matched with the character emotion data, judging that the target voice data to be recognized are non-synthesized voice data.

9. A smart device, comprising: a processor coupled to the memory and a memory having a computer program stored therein, the processor executing the computer program to implement the method of any of claims 1-7.

10. A storage medium, characterized in that a computer program is stored, which computer program is executable by a processor to implement the method according to any of claims 1-7.