CN116564290A

CN116564290A - Multi-mode voice pause judging method and device

Info

Publication number: CN116564290A
Application number: CN202310543706.1A
Authority: CN
Inventors: 刘玲
Original assignee: Faw Beijing Software Technology Co ltd; FAW Group Corp
Current assignee: Faw Beijing Software Technology Co ltd; FAW Group Corp
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-08-08

Abstract

The application discloses a multi-mode voice pause judging method and device. The multi-mode voice pause judging method comprises the following steps: acquiring audio to be judged; acquiring text information according to the audio to be judged; acquiring language information and sound velocity information according to the audio to be judged; obtaining a trained pause judging model; extracting and fusing the characteristics of the mood information, the sound speed information and the text information, thereby obtaining fusion characteristics; and inputting the fusion characteristic into the pause judging model so as to obtain a pause judging result. The multi-mode voice pause judging method provided by the application comprehensively judges whether the section of voice is a complete sentence with meaning and correctly understood through text information, mood information and sound speed information, and solves the problems that the prior art does not fully understand the meaning of a user, so that the interaction is invalid and the user experience is relatively poor.

Description

Multi-mode voice pause judging method and device

Technical Field

The present disclosure relates to the field of speech pause recognition technologies, and in particular, to a multi-modal speech pause judging method and a multi-modal speech pause judging device.

Background

The voice interaction in the intelligent cabin is the start of intelligent voice, the integrity and fluency of semantic interaction are the important importance in the interaction, the voice interaction is always delayed after the user speaks, or sentences finish recognition in advance when the user pauses, and the returned semantic understanding is always incomplete or wrong, such as the voice of a person: play + (long silence) + Liu Dehua song; the method is characterized in that the method comprises the steps of helping me to navigate + (silence) +Beijing university and the like, natural pauses exist in user speaking, playing is split into a semantic by voice stopping, the semantic is sent to recognition, understanding of the semantic is incomplete, playing is not known, the help me to navigate is split into a semantic, if the silence time is set to be long enough, delay in interaction is too long, reaction is not sensitive enough, statistics show that 5% -10% of user speaking is not finished, user requests are not finished yet, corresponding replies are given after the semantic understanding is finished, the meaning of the user is not completely understood, and therefore, user experience is relatively poor for invalid interaction.

The traditional scheme is to judge and stop by using the VAD according to the tail mute time length, the mute time length of the scheme determines the end length of the sentence, and is generally set to be a fixed threshold value for the tail mute time length, for example, 300ms-500ms is generally set, if the mute time length exceeds the threshold value, the user is considered to be finished, if the pause time length of the user exceeds the threshold value, the judgment is carried out prematurely, the semantics are incomplete, if a larger threshold value is set, the interaction is delayed, the interaction is slow, and the judgment can be carried out after waiting for the tail mute with enough length.

It is therefore desirable to have a solution that solves or at least alleviates the above-mentioned drawbacks of the prior art.

Disclosure of Invention

The present invention is directed to a multi-mode voice pause judging method, which solves at least one of the above-mentioned problems.

The invention provides the following scheme:

according to one aspect of the present invention, there is provided a multi-modal speech pause judging method, the multi-modal speech pause judging method including:

acquiring audio to be judged;

acquiring text information according to the audio to be judged;

acquiring tone information and sound velocity information according to the audio to be judged;

obtaining a trained pause judging model;

extracting and fusing the characteristics of the mood information, the sound speed information and the text information, thereby obtaining fusion characteristics;

and inputting the fusion characteristic into the pause judging model so as to obtain a pause judging result.

Optionally, the obtaining the mood information according to the audio to be judged includes:

obtaining a trained mood classification model;

extracting the mood characteristics of the audio to be judged;

and inputting the mood characteristics of the audio to be judged into the mood classification model so as to acquire mood information.

Optionally, the acquiring the sound speed information according to the audio to be judged includes:

acquiring a trained sonic classification model;

extracting the sound speed characteristics of the audio to be judged;

and inputting the sound speed characteristics of the audio to be judged into the sound speed classification model so as to acquire sound speed information.

Optionally, the acquiring text information according to the audio to be judged includes:

and identifying text information of the audio to be judged through ASR.

Optionally, before the obtaining the trained pause judging model, the multi-modal speech pause judging method further includes:

acquiring a stop judgment database, wherein the stop judgment database comprises at least one stop judgment rule;

judging whether the acquired text information accords with one judging rule in the judging database or not, if not, judging whether the acquired text information accords with one judging rule in the judging database or not

A trained dwell determination model is obtained.

Optionally, the mood information includes thinking mood information, normal mood information, and no judgment mood information.

Optionally, the speech rate information includes fast speech rate information, medium speed speech rate information, and slow speech rate information.

The application also provides a multi-mode voice pause judging device, which comprises:

the audio acquisition module to be judged is used for acquiring the audio to be judged;

the text information acquisition module is used for acquiring text information according to the audio to be judged;

the mood information acquisition module is used for acquiring mood information according to the audio to be judged;

the sound speed information acquisition module is used for acquiring sound speed information according to the audio to be judged;

the pause judgment model acquisition module is used for acquiring a trained pause judgment model;

the fusion module is used for extracting and fusing the characteristics of the mood information, the sound speed information and the text information so as to acquire fusion characteristics;

and the result acquisition module is used for inputting the fusion characteristic into the pause judging model so as to acquire a pause judging result.

The application also provides an electronic device, which comprises: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the multi-modal speech pause determination method as described above.

The present application also provides a computer readable storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the steps of the multi-modal speech pause judging method as described above.

The multi-mode voice pause judging method provided by the application comprehensively judges whether the section of voice is a complete sentence with meaning and correctly understood through text information, mood information and sound speed information, and solves the problems that the prior art does not fully understand the meaning of a user, so that the interaction is invalid and the user experience is relatively poor.

Drawings

FIG. 1 is a flow chart of a multi-modal speech pause determination method provided by one or more embodiments of the present invention.

Fig. 2 is a block diagram of an electronic device according to a multi-modal voice pause judging method according to one or more embodiments of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The multi-mode voice pause judging method shown in fig. 1 comprises the following steps:

step 1: acquiring audio to be judged;

step 2: acquiring text information according to the audio to be judged;

step 3: acquiring tone information and sound velocity information according to the audio to be judged;

step 4: obtaining a trained pause judging model;

step 5: extracting and fusing the characteristics of the mood information, the sound speed information and the text information, thereby obtaining fusion characteristics;

step 6: and inputting the fusion characteristic into the pause judging model so as to obtain a pause judging result.

For example, the pause judging model architecture of the present application may be as follows:

the reference model is bert (12 layer transducer), ALBERT tiny (4 layer transducer) was distilled for end use, or lstm model could be distilled.

In this embodiment, the pause judging model adopts a BERT pre-training model.

In this embodiment, the obtaining the mood information according to the audio to be determined includes:

obtaining a trained mood classification model;

extracting the mood characteristics of the audio to be judged;

Specifically, according to the audio frequency, training the mood model to judge whether the mood of the speaker is hesitant thinking or normal mood, and judging whether to provide the characteristics.

The language model is used for judging whether the language of the speaker is hesitant or normal, if the language is thinking, the voice is not stopped, waiting is needed, and if the language model is not hesitant, the subsequent judging and stopping model can be sent to, and the last layer is extracted to be used as the characteristic.

The language model can use a deep learning classification model to classify the vehicle-mounted existing corpus into three categories according to the language classification of a speaker, namely thinking language information, normal language information and language information which cannot be judged, and conduct classification training, wherein the last layer is used as a feature extraction layer which is used as a language characteristic value, and the language characteristic value represents thinking language information, normal language information or language information which cannot be judged.

In this embodiment, the obtaining the sound velocity information according to the audio to be determined includes:

acquiring a trained sonic classification model;

extracting the sound speed characteristics of the audio to be judged;

Specifically, according to the audio frequency, training a speech speed model, judging the speech speed problem of the voice, namely, the middle speed, and according to the middle speed, judging whether the voice is stopped or not in a specified time for subsequent recognition, and extracting the characteristics of the voice;

the speech rate model can acquire the speed of speech by using a traditional mode or a deep learning mode, if one audio belongs to a quick speech, the number of the speaking text in the time is small, the speech rate model can be judged to be not stopped, and if the number of the recognition is large enough and the speech rate is slow, the speech rate is judged to be stopped.

The speech rate model can be used for recording different audio corpus and existing vehicle-mounted data for classification according to the speech rate requirements of speed, the speech audio data are classified according to asr, the words of the sentence are recognized in one minute to judge speaking speed, the speech audio data are trained in three types, and the last layer is used as a characteristic extraction layer for extracting characteristic values.

In this embodiment, the obtaining text information according to the audio to be determined includes:

and identifying text information of the audio to be judged through ASR.

In this embodiment, before the obtaining the trained pause judging model, the multi-modal speech pause judging method further includes:

A trained dwell determination model is obtained.

In this embodiment, the audio is sent to ASR to identify the text, and a specific stopping database is set according to the characteristics of the rule system and the characteristics of the interaction of the intelligent cabin to perform rule stopping, and if the rule stopping can give a stopping result, the result can be directly sent to the subsequent interaction of nlu if the result is stopping.

If any stopping rule cannot be met, stopping is performed through the stopping judgment model. If the rule system judges that the stop cannot judge, the recognition of the stop is carried out, if the language is fast, the language is not considered, the language speed characteristics, the language characteristics, the acoustic characteristics and the text characteristics are extracted, the semantic judgment is carried out through the model, if the recognition is stopped, the recognition is carried out in the subsequent interaction of nlu, and if the recognition is not stopped, the operation is continued.

In this embodiment, rule matching is a method of rule matching in the prior art, for example, some templates are used as first matching in the first step of semantic understanding, that is, because infinite templates cannot be enumerated, too many templates have too slow matching speed, so some templates are used as templates in common use, and other templates are used as second semantic understanding, so that the judgment is the same, and the common templates can be directly used for rule matching.

For example, header query: text exact matches, such as: (1) i want to listen (2) navigate to? (3) play.

Query of specific patterns, regular matching, such as: (call |dial).+ -. (phone |WeChat)?

The application also provides a multi-mode voice pause judging device, which comprises an audio acquisition module to be judged, a text information acquisition module, a mood information acquisition module, a sound speed information acquisition module, a pause judging model acquisition module, a fusion module and a result acquisition module,

the audio acquisition module to be judged is used for acquiring audio to be judged;

the pause judging model acquisition module is used for acquiring a trained pause judging model;

Fig. 2 is a block diagram of an electronic device according to one or more embodiments of the present invention.

As shown in fig. 2, the present application further discloses an electronic device, including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; the memory stores a computer program which, when executed by the processor, causes the processor to perform the steps of the multi-modal speech pause determination method.

The present application also provides a computer readable storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the steps of a multi-modal speech pause determination method.

The communication bus mentioned for the above-mentioned electronic devices may be a Peripheral component interconnect standard (Peripheral ComponentInterconnect, PCI) bus or an extended industry standard structure (extended industry StandardArchitecture, EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The electronic device includes a hardware layer, an operating system layer running on top of the hardware layer, and an application layer running on top of the operating system. The hardware layer comprises a CPU (CPU, centralProcessingUnit), a memory management unit (MMU, memoryManagementUnit), a memory and other hardware. The operating system may be any one or more computer operating systems that implement electronic device control via processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an iOS operating system, or a windows operating system, etc. In addition, in the embodiment of the present invention, the electronic device may be a handheld device such as a smart phone, a tablet computer, or an electronic device such as a desktop computer, a portable computer, which is not particularly limited in the embodiment of the present invention.

The execution body controlled by the electronic device in the embodiment of the invention can be the electronic device or a functional module in the electronic device, which can call a program and execute the program. The electronic device may obtain firmware corresponding to the storage medium, where the firmware corresponding to the storage medium is provided by the vendor, and the firmware corresponding to different storage media may be the same or different, which is not limited herein. After the electronic device obtains the firmware corresponding to the storage medium, the firmware corresponding to the storage medium can be written into the storage medium, specifically, the firmware corresponding to the storage medium is burned into the storage medium. The process of burning the firmware into the storage medium may be implemented by using the prior art, and will not be described in detail in the embodiment of the present invention.

The electronic device may further obtain a reset command corresponding to the storage medium, where the reset command corresponding to the storage medium is provided by the provider, and the reset commands corresponding to different storage media may be the same or different, which is not limited herein.

At this time, the storage medium of the electronic device is a storage medium in which the corresponding firmware is written, and the electronic device may respond to a reset command corresponding to the storage medium in which the corresponding firmware is written, so that the electronic device resets the storage medium in which the corresponding firmware is written according to the reset command corresponding to the storage medium. The process of resetting the storage medium according to the reset command may be implemented in the prior art, and will not be described in detail in the embodiments of the present invention.

For convenience of description, the above devices are described as being functionally divided into various units and modules. Of course, the functions of each unit, module, etc. may be implemented in one or more pieces of software and/or hardware when implementing the present application.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated by one of ordinary skill in the art that the methodologies are not limited by the order of acts, as some acts may, in accordance with the methodologies, take place in other order or concurrently. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform the methods described in the embodiments or some parts of the embodiments of the present application.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The multi-modal voice pause judging method is characterized by comprising the following steps of:

acquiring audio to be judged;

acquiring text information according to the audio to be judged;

obtaining a trained pause judging model;

2. The method for multi-modal speech pause determination as in claim 1, wherein said obtaining mood information from said audio to be determined comprises:

obtaining a trained mood classification model;

extracting the mood characteristics of the audio to be judged;

3. The multi-modal speech pause judging method according to claim 2, wherein the acquiring sound velocity information according to the audio to be judged includes:

acquiring a trained sonic classification model;

extracting the sound speed characteristics of the audio to be judged;

4. The multi-modal speech pause judging method according to claim 3, wherein the acquiring text information according to the audio to be judged includes:

and identifying text information of the audio to be judged through ASR.

5. The multi-modal speech pause judging method of claim 4, wherein prior to said obtaining a trained pause judging model, said multi-modal speech pause judging method further comprises:

A trained dwell determination model is obtained.

6. The method of claim 5, wherein the mood information includes thinking mood information, normal mood information, and non-judgment mood information.

7. The multi-modal speech pause judging method as claimed in claim 6, wherein the speech rate information includes fast speech rate information, medium speech rate information and slow speech rate information.

8. A multi-modal speech pause judging device, characterized in that the multi-modal speech pause judging device comprises:

9. An electronic device, the electronic device comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a computer program is stored in a memory, which when executed by a processor causes the processor to perform the steps of the multi-modal speech pause determination method according to any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the steps of the multi-modal speech pause determination method according to any one of claims 1 to 7.