CN114582347A

CN114582347A - Method, apparatus, device and medium for determining speech semantics based on wake word speech rate

Info

Publication number: CN114582347A
Application number: CN202210225951.3A
Authority: CN
Inventors: 杨翠; 宋琪; 李霄寒
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-03

Abstract

The application relates to a method, a device, electronic equipment and a storage medium for determining voice semantics based on awakening word speed, wherein the method comprises the following steps: determining that the equipment is in a state to be awakened, acquiring an awakening word, and judging whether to awaken the equipment according to the awakening word; when the judgment result is that the equipment is awakened by the awakening word, determining the total duration of the awakening word; determining the rear endpoint duration of the current voice endpoint detection according to the total duration of the awakening words; acquiring voice, and setting the detection time length of a rear end point of the cloud end to be equal to the detection time length of the rear end point of the current voice end point; and sending the audio corresponding to the voice to a cloud end, and determining the semantic meaning corresponding to the voice by the cloud end according to the detection duration of the rear endpoint of the cloud end. According to the method and the device, the cloud sentence-breaking nodes are dynamically adjusted according to the speech speed of the awakening words and the habits that the speech speed is almost consistent when the user pronounces the awakening words and the specific voice instruction content, so that the effect of dynamically determining voice endpoints is achieved, the sentence-breaking where the voice endpoints are determined, the semantics are further determined, and the accuracy is improved.

Description

Method, apparatus, device and medium for determining speech semantics based on wake word speech rate

Technical Field

The present application relates to the field of speech semantic technology, and in particular, to a method, an apparatus, a device, and a medium for determining speech semantics based on a wakeup word rate.

Background

Patent WO2020024885a1, although solving a similar problem as the present patent application, mainly performs speech rate analysis and dynamic setting of sentence-breaking nodes by the previous speech command. The biggest defect of the scheme is that in an actual application scene, a previous voice instruction and a current voice instruction cannot be guaranteed to be sent by the same user, and two words spoken by the same person may not be the same speech speed, so that the speech speed of the previous word cannot be directly used as a judgment basis in current voice interaction.

Under the voice interaction scene, because the speech rate habits of each user are different, even the speech rate conditions of the same user under different emotions are different. Therefore, in the conventional method of setting the pause time threshold, for example, whether the pause length in the middle of a long voice is greater than a preset threshold is judged, if the pause length is greater than the preset threshold, sentence breaking can be performed on the voice.

Disclosure of Invention

Based on the above problems, the present application provides a method, an apparatus, a device, and a medium for determining a speech semantic based on a wake word speed.

In a first aspect, an embodiment of the present application provides a method for determining a speech semantic based on a wake word speed, including:

determining that the equipment is in a state to be awakened, acquiring an awakening word, and judging whether to awaken the equipment according to the awakening word;

when the judgment result is that the equipment is awakened by the awakening word, determining the total duration of the awakening word;

determining the rear endpoint duration of the current voice endpoint detection according to the total duration of the awakening words;

acquiring voice, and setting the detection time length of a rear end point of the cloud end to be equal to the detection time length of the rear end point of the current voice end point;

and sending the audio corresponding to the voice to the cloud end, and determining the semantic corresponding to the voice by the cloud end according to the detection duration of the rear endpoint of the cloud end.

Further, in the method for determining a speech semantic based on a wake word rate, before determining that a device is in a state to be woken up, the method further includes:

starting the application program corresponding to the equipment and initializing the voice correlation engine.

Further, in the method for determining voice semantics based on awakening word speed, after the cloud determines semantics corresponding to the voice according to the detection duration of the cloud rear endpoint, the method further includes:

and acquiring the semantic meaning of the voice from the cloud.

Further, in the method for determining a speech semantic based on a wake word speed, determining whether to wake up a device according to the wake word includes:

judging whether the awakening words are consistent with preset awakening words or not, and determining awakening equipment when the judgment result shows that the awakening words are consistent with the preset awakening words;

and when the judgment result is that the awakening words are inconsistent with the preset awakening words, not awakening the equipment.

Further, in the method for determining a speech semantic based on a wakeup word rate, determining a total duration of the wakeup words includes:

determining the starting time of the awakening word and the ending time of the awakening word according to the voice awakening engine;

and determining the total duration of the awakening word according to the starting time of the awakening word and the ending time of the awakening word.

Further, in the method for determining a speech semantic based on a wakeup word rate, determining a rear endpoint duration of current speech endpoint detection according to a total duration of the wakeup words includes:

acquiring an awakening record of an awakening word, and determining the average duration of the awakening word and the rear endpoint duration of voice endpoint detection corresponding to the average duration according to the awakening record;

and determining the rear end point duration of the current voice end point detection according to the average duration of the awakening words, the rear end point duration of the voice end point detection corresponding to the average duration and the total duration of the awakening words.

Further, in the method for determining a speech semantic based on a wake word rate, the cloud determines the speech semantic according to the cloud rear endpoint detection duration, and the method includes:

determining the voice endpoint according to the cloud end endpoint detection duration;

the semantics of the speech are determined according to automatic speech recognition techniques and natural language understanding techniques.

In a second aspect, an embodiment of the present application further provides an apparatus for determining a speech semantic based on a speech rate of a wakeup word, including:

a first determination module: the device is used for determining that the device is in a state to be awakened, acquiring an awakening word and judging whether to awaken the device according to the awakening word;

a second determination module: the device is used for determining the total duration of the awakening word when the judgment result is that the device is awakened by the awakening word;

a third determination module: the voice endpoint detection module is used for determining the rear endpoint duration of the current voice endpoint detection according to the total duration of the awakening words;

the acquisition module and the setting module: the system comprises a cloud end point detection module, a voice acquisition module, a voice processing module and a voice processing module, wherein the cloud end point detection time length is set to be equal to the back end point time length of the current voice end point detection;

a sending module and a fourth determining module: the system comprises a cloud end and a voice recognition module, wherein the cloud end is used for sending audio corresponding to voice to the cloud end, and the cloud end determines semantics corresponding to the voice according to the detection duration of the rear endpoint of the cloud end.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory;

the processor is used for executing any one of the above methods for determining the speech semantics based on the speech speed of the awakening word by calling the program or the instruction stored in the memory.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions, and the program or instructions cause a computer to execute any one of the above methods for determining a speech semantic based on a speech speed of a wakeup word.

The embodiment of the application has the advantages that: the application relates to a method, a device, electronic equipment and a storage medium for determining voice semantics based on awakening word speed, wherein the method comprises the following steps: determining that the equipment is in a state to be awakened, acquiring an awakening word, and judging whether to awaken the equipment according to the awakening word; when the judgment result is that the equipment is awakened by the awakening word, determining the total duration of the awakening word; determining the rear endpoint duration of the current voice endpoint detection according to the total duration of the awakening words; acquiring voice, and setting the detection time length of a rear end point of the cloud end to be equal to the detection time length of the rear end point of the current voice end point; and sending the audio corresponding to the voice to a cloud end, and determining the semantics corresponding to the voice by the cloud end according to the detection duration of the rear endpoint of the cloud end. According to the method and the device, the cloud sentence-breaking nodes are dynamically adjusted according to the speech speed of the awakening words and the habit that the speech speed is almost consistent when the user pronounces the awakening words and the specific voice instruction content, so that the effect of dynamically determining voice endpoints is achieved, the sentence-breaking is determined where according to the voice endpoints, the semantics is further determined, and the sentence-breaking accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the descriptions of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following descriptions are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first schematic diagram illustrating a method for determining speech semantics based on a speech rate of a wakeup word according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a second method for determining speech semantics based on a speech rate of a wakeup word according to an embodiment of the present application;

fig. 3 is a schematic diagram of a third method for determining a speech semantic based on a speech rate of a wakeup word according to an embodiment of the present application;

fig. 4 is a schematic diagram of an apparatus for determining speech semantics based on a speech rate of a wakeup word according to an embodiment of the present application;

fig. 5 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of embodiment in many different forms than that described herein and those skilled in the art will be able to make similar modifications without departing from the spirit of the application and therefore should not be limited to the specific embodiments disclosed below.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Fig. 1 is a first schematic diagram of a method for determining a speech semantic based on a wake word speed according to an embodiment of the present application.

In a first aspect, an embodiment of the present application provides a method for determining a speech semantic based on a speech rate of a wakeup word, which, with reference to fig. 1, includes:

s101: and determining that the equipment is in a state to be awakened, acquiring an awakening word, and judging whether to awaken the equipment according to the awakening word.

Specifically, in this embodiment of the application, the device may be any intelligent terminal device, such as a tianmao eidolon, the state to be wakened may be that the intelligent terminal device is turned on, and when receiving a corresponding wake-up word, the intelligent terminal device can be wakened up, and the wake-up word may be set by the user according to the preference of the user, or may be set by the manufacturer, such as "small degree", and whether to wake up the device is determined according to the wake-up word, such as "small degree" when receiving a voice wake-up word spoken by the user, and whether to wake up the intelligent terminal device is determined according to the "small degree" of the wake-up word.

S102: and when the judgment result is that the equipment is awakened by the awakening word, determining the total duration of the awakening word.

Specifically, in the embodiment of the present application, when the determination result is that the intelligent terminal device is awakened by the awakening word "small degree", the total duration of the awakening word is determined, illustratively, the total duration of the awakening word from the beginning of one word "small" to the end of one word "degree", and the following description is provided in conjunction with specific steps to determine the total duration of the awakening word.

S103: and determining the rear endpoint duration of the current voice endpoint detection according to the total duration of the awakening words.

Specifically, in the embodiment of the present application, for example, the rear endpoint duration of "degree" is determined according to the total duration of the wakeup word from the beginning of a word "small" to the end of a word "degree", and the determination of the rear endpoint duration detected by the current voice endpoint is described below with reference to specific steps.

S104: and acquiring voice, and setting the detection time length of the rear end point of the cloud end equal to the detection time length of the rear end point of the current voice end point.

Specifically, in the embodiment of the application, after the intelligent terminal device is awakened by the awakening word, the intelligent terminal device obtains the voice spoken by the user, such as "weather forecast in beijing city", "open curtain", "play song legend of royal pof", and the like, and after the voice spoken by the user is obtained, the cloud rear end point detection duration is set to be equal to the rear end point duration detected by the current voice end point.

It should be understood that the speech speed habits of the user are almost consistent when reciting the awakening words and the specific voice instruction content. Therefore, the cloud sentence-breaking node is dynamically adjusted by using the speech speed of the awakening words, and the problem of inconsistent speaking speech speeds of different users is effectively solved.

S105: and sending the audio corresponding to the voice to the cloud end, and determining the semantic corresponding to the voice by the cloud end according to the detection duration of the rear endpoint of the cloud end.

Specifically, in the embodiment of the application, the audio corresponding to the voices such as "weather forecast in Beijing city", "curtain open", "song legend playing Wangfei" is sent to the cloud, the cloud first determines the nodes of the sentence break, and then determines the semantics corresponding to the voices according to the voices after the sentence break.

Specifically, in the embodiment of the present application, before the intelligent terminal device is started, the voice related engine is initialized by starting the application program corresponding to the intelligent terminal device, so as to prepare for waking up the device by the wake-up word.

Further, in the method for determining a voice semantic based on a wake word rate, after the cloud determines a semantic corresponding to the voice according to the cloud rear endpoint detection duration, the method further includes:

and acquiring the semantic meaning of the voice from the cloud.

Specifically, in the embodiment of the application, the cloud end determines the nodes of the sentence breaks, and then after determining the semantics corresponding to the voice according to the voice after the sentence breaks, the intelligent terminal device can directly acquire the semantics corresponding to the voice from the cloud end.

and judging whether the awakening words are consistent with the preset awakening words or not, and determining the awakening equipment when the judgment result shows that the awakening words are consistent with the preset awakening words.

Specifically, in the embodiment of the present application, if it is determined whether the intelligent terminal device is based on the wakeup word "degree of smallness", if the preset wakeup word is also "degree of smallness", the intelligent terminal device is woken up, and if the preset wakeup word is "kid your good", the intelligent terminal device is not woken up.

Fig. 2 is a schematic diagram illustrating a second method for determining speech semantics based on a wake word speed according to an embodiment of the present application.

Further, in the method for determining a speech semantic based on a wakeup word rate, determining a total duration of the wakeup words includes, in conjunction with fig. 2, two steps S201 and S202:

s201: and determining the starting time of the awakening word and the ending time of the awakening word according to the voice awakening engine.

S202: and determining the total duration of the awakening word according to the starting time of the awakening word and the ending time of the awakening word.

Specifically, in the embodiment of the present application, after the intelligent terminal device is awakened by the awakening word "degree small", the voice awakening engine determines the start time of the "degree small" and the end time of the "degree small", and further determines the total duration of the awakening word "degree small" according to the start time of the "degree small" and the end time of the "degree small".

Fig. 3 is a third schematic diagram of a method for determining a speech semantic based on a word rate of a wakeup word according to an embodiment of the present application.

and acquiring a wake-up record of the wake-up word, and determining the average duration of the wake-up word and the rear endpoint duration of the voice endpoint detection corresponding to the average duration according to the wake-up record.

Specifically, in this embodiment of the application, the wake-up record may be a history record of multiple times of awakening of the corresponding intelligent terminal device by the wake-up word, the average duration of the wake-up word and the rear-end-point duration of the voice end-point detection corresponding to the average duration are determined according to the history record of multiple times of awakening of the corresponding intelligent terminal device, and if the average duration of the wake-up word "small-degree" is equal to 1S, the rear-end-point duration of the voice end-point detection corresponding to the average duration is 0.3S, and the total duration of the current wake-up word "small-degree" is 0.8S, the rear-end-point duration of the current voice end-point detection is calculated according to the average duration of the wake-up word, the rear-end-point duration of the voice end-point detection corresponding to the average duration, and the total duration of the wake-up word.

Further, in the method for determining speech semantics based on awakening word speed, the cloud determines speech semantics according to the cloud rear end point detection duration, and with reference to fig. 3, the method includes two steps S301 and S302:

s301: and determining the voice endpoint according to the cloud end point detection duration.

S302: the semantics of the speech are determined according to automatic speech recognition techniques and natural language understanding techniques.

Specifically, in the embodiment of the application, the cloud sentence-breaking node is dynamically adjusted according to the awakening word speed, so that the problem of inconsistent speaking speeds of different users is effectively solved, dynamic sentence-breaking is realized, the semantics of the voice is determined according to the automatic speech recognition technology ASR and the natural language understanding technology NLU, and the intelligent terminal device can directly acquire the semantics corresponding to the voice from the cloud and perform corresponding processing, such as actions of opening a curtain, playing songs of royal federate and the like.

Fig. 4 is a schematic diagram of an apparatus for determining speech semantics based on a wake word speed according to an embodiment of the present application.

In a second aspect, with reference to fig. 4, an embodiment of the present application further provides an apparatus for determining a speech semantic based on a speech rate of a wakeup word, including:

the first determination module 401: the device is used for determining that the device is in a state to be awakened, acquiring an awakening word and judging whether to awaken the device according to the awakening word.

Specifically, in this embodiment of the application, the device may be any intelligent terminal device, such as a tianmao eidolon, the first determining module 401 determines that the device is in a state to be woken up, and the state to be woken up may be that the intelligent terminal device is turned on, and when a corresponding wake-up word is received, the intelligent terminal device can be woken up, and the wake-up word may be set by the user according to a preference of the user or may be set by the manufacturer, such as "small degree", and determines whether to wake up the device according to the wake-up word, such as that the voice wake-up word spoken by the user is received is "small degree", and determines whether to wake up the intelligent terminal device according to the "small degree" of the wake-up word.

The second determination module 402: and the time length determining unit is used for determining the total time length of the awakening word when the judgment result is that the equipment is awakened by the awakening word.

Specifically, in this embodiment of the application, when the determination result is that the intelligent terminal device is awakened by the awakening word "small degree", the second determining module 402 determines the total duration of the awakening word, which is, for example, the total duration from the beginning of one word "small" to the end of one word "degree", and the following introduces specific steps to determine the total duration of the awakening word.

The third determination module 403: and the voice endpoint detection device is used for determining the rear endpoint duration of the current voice endpoint detection according to the total duration of the awakening words.

Specifically, in this embodiment of the application, for example, the third determining module 403 determines the time length of the rear endpoint of "degree" according to the total time length from the beginning of a word "small" to the end of a word "degree" of the wakeup word, and the determination of the time length of the rear endpoint detected by the current voice endpoint is described below with reference to specific steps.

Acquisition module 404 and setting module 405: the method is used for acquiring voice and setting the detection time length of the cloud rear end point to be equal to the detection time length of the rear end point of the current voice end point.

Specifically, in this embodiment of the application, after the intelligent terminal device is awakened by the awakening word, the obtaining module 404 of the intelligent terminal device obtains the voice spoken by the user, such as "weather forecast in beijing city", "open curtain", "play song legend of royal pof", and the like, and after the voice spoken by the user is obtained, the setting module 405 sets the cloud rear endpoint detection duration to be equal to the rear endpoint duration detected by the current voice endpoint.

The sending module 406 and the fourth determining module 407: the system comprises a cloud end and a voice recognition module, wherein the cloud end is used for sending audio corresponding to voice to the cloud end, and the cloud end determines semantics corresponding to the voice according to the detection duration of the rear endpoint of the cloud end.

Specifically, in the embodiment of the application, the sending module 406 sends the audio corresponding to the voice such as "weather forecast in beijing city", "curtain open", "song legend playing in royal federate" to the cloud, and the fourth determining module 407 in the cloud first determines the node of the sentence break, and then determines the semantic meaning corresponding to the voice according to the voice after sentence break.

Fig. 5 is a schematic block diagram of an electronic device provided by an embodiment of the disclosure.

As shown in fig. 5, the electronic apparatus includes: at least one processor 501, at least one memory 502, and at least one communication interface 503. The various components in the electronic device are coupled together by a bus system 504. A communication interface 503 for information transmission with an external device. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled as bus system 504 in fig. 5.

It will be appreciated that the memory 502 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. A program for implementing any one of the methods for determining speech semantics based on wake word speech rate provided by the embodiments of the present application may be included in an application program.

In this embodiment of the present application, the processor 501 is configured to call a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in an application program, and the processor 501 is configured to execute steps of various embodiments of a method for determining a speech semantic based on a speech rate of a wakeup word provided in this embodiment of the present application.

Any one of the methods for determining the speech semantics based on the wake word speed provided by the embodiment of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of any method in the method for determining the speech semantics based on the awakening word speed provided by the embodiment of the application can be directly implemented by a hardware decoding processor, or implemented by combining hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and completes the steps of a method for determining the speech semantics based on the awakening word speed by combining the hardware.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for determining speech semantics based on wake word speech rate, comprising:

and sending the audio corresponding to the voice to the cloud end, and determining the semantic meaning corresponding to the voice by the cloud end according to the detection duration of the rear endpoint of the cloud end.

2. The method for determining speech semantics based on wake word pace according to claim 1, wherein before determining that the device is in a state to be woken up, the method further comprises:

and starting an application program corresponding to the equipment and initializing a voice related engine.

3. The method according to claim 1, wherein after determining the semantic meaning corresponding to the voice according to the cloud rear endpoint detection duration, the cloud further comprises:

and acquiring the semantic meaning of the voice from the cloud.

4. The method for determining speech semantics of claim 1, wherein said determining whether to wake up the device according to the wake up word comprises:

and when the judgment result shows that the awakening words are inconsistent with the preset awakening words, not awakening the equipment.

5. The method for determining speech semantics based on wake word pace according to claim 1, wherein determining the total duration of the wake words comprises:

and determining the total duration of the awakening words according to the starting time of the awakening words and the ending time of the awakening words.

6. The method according to claim 1, wherein determining the duration of the endpoint posterior point detected by the current voice endpoint according to the total duration of the wakeup word comprises:

and determining the rear endpoint duration detected by the current voice endpoint according to the average duration of the awakening words, the rear endpoint duration detected by the voice endpoint corresponding to the average duration and the total duration of the awakening words.

7. The method of claim 1, wherein the cloud determines the semantic meaning of the voice according to the cloud back-end point detection duration, and comprises:

8. An apparatus for determining speech semantics based on a wake word pace, comprising:

the acquisition module and the setting module: the system is used for acquiring voice and setting the detection time length of a cloud rear end point to be equal to the detection time length of a rear end point of the current voice end point;

a sending module and a fourth determining module: the cloud end is used for sending the audio corresponding to the voice to the cloud end, and the cloud end determines the semantic meaning corresponding to the voice according to the detection duration of the rear endpoint of the cloud end.

9. An electronic device, comprising: a processor and a memory;

the processor is used for executing a method for determining speech semantics based on wake word pace according to any one of claims 1 to 7 by calling a program or instructions stored in the memory.

10. A computer-readable storage medium storing a program or instructions for causing a computer to perform a method for determining speech semantics based on a wake word rate according to any one of claims 1 to 7.