CN115762502A

CN115762502A - Wake-up-free voice recognition method, device, equipment and storage medium

Info

Publication number: CN115762502A
Application number: CN202210887096.2A
Authority: CN
Inventors: 何川延
Original assignee: Huizhou Desay SV Automotive Co Ltd
Current assignee: Huizhou Desay SV Automotive Co Ltd
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2023-03-07

Abstract

The invention discloses a wake-up-free voice recognition method, a wake-up-free voice recognition device, wake-up-free voice recognition equipment and a storage medium. The method comprises the following steps: the method comprises the steps of acquiring visual information of a user through a visual sensor, and determining sight watching starting and stopping time corresponding to a sight watching event according to the visual information; acquiring voice information of a user through a voice sensor, and determining voice detection start-stop time corresponding to a voice detection event according to the voice information; determining the time overlap ratio of a sight line watching event and a voice detection event according to the sight line watching start-stop time and the voice detection start-stop time; and voice recognition is carried out on the voice information according to the time coincidence degree to determine a voice recognition result, so that a wake-up-free voice recognition method is realized, and the use experience of a user is improved.

Description

Wake-up-free voice recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a wake-up-free speech recognition method, device, apparatus, and storage medium.

Background

With the rapid development of artificial intelligence technology, speech recognition plays an important role in the control field of artificial intelligence AI devices such as vehicle-mounted devices.

Currently, before an AI device is controlled by voice, a voice command is issued by speaking a wakeup word or turning on a voice function start interface, key or icon to start voice recognition and control functions.

However, in the case where the user is not familiar with the voice function of the AI device or the voice instruction is urgent, this is very unfriendly to the user's use experience, and also goes against the original purpose of voice recognition.

Disclosure of Invention

The invention provides a wake-up-free voice recognition method, a wake-up-free voice recognition device, a wake-up-free voice recognition equipment and a storage medium, which are used for solving the problem that a voice recognition function needs to be awakened before voice recognition, realizing the wake-up-free voice recognition method and improving the use experience of a user.

According to an aspect of the present invention, there is provided a wake-free speech recognition method, including:

the method comprises the steps that visual information of a user is collected through a visual sensor, and sight staring and stopping time corresponding to a sight staring event is determined according to the visual information;

acquiring voice information of a user through a voice sensor, and determining voice detection start-stop time corresponding to a voice detection event according to the voice information;

determining the time contact ratio of the sight line watching event and the voice detection event according to the sight line watching start-stop time and the voice detection start-stop time;

and performing voice recognition on the voice information according to the time contact ratio to determine a voice recognition result.

According to another aspect of the present invention, there is provided a wake-up free speech recognition apparatus, comprising:

the visual identification module is used for acquiring visual information of a user through a visual sensor and determining sight gazing starting and stopping time corresponding to a sight gazing event according to the visual information;

the voice detection module is used for acquiring voice information of a user through a voice sensor and determining voice detection start-stop time corresponding to a voice detection event according to the voice information;

the contact ratio determining module is used for determining the time contact ratio of the sight line watching event and the voice detection event according to the sight line watching start-stop time and the voice detection start-stop time;

and the recognition result determining module is used for performing voice recognition on the voice information according to the time coincidence degree to determine a voice recognition result.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a wake-up free speech recognition method according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a wake-free speech recognition method according to any embodiment of the present invention when executed.

According to the technical scheme of the embodiment of the invention, visual information of a user is acquired through a visual sensor, and sight watching starting and stopping time corresponding to a sight watching event is determined according to the visual information; acquiring voice information of a user through a voice sensor, and determining voice detection start-stop time corresponding to a voice detection event according to the voice information; determining the time overlap ratio of a sight line watching event and a voice detection event according to the sight line watching start-stop time and the voice detection start-stop time; the voice recognition is carried out on the voice information according to the time contact ratio to determine the voice recognition result, the problem that the voice recognition function needs to be awakened before voice recognition is solved, the awakening-free voice recognition method is achieved, and the use experience of a user is improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a wake-up free speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a wake-up free speech recognition method according to a second embodiment of the present invention;

3A-3C are schematic diagrams of time coincidence of a gaze fixation event and a voice detection event;

FIG. 4 is a flowchart of another wake-free speech recognition method according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a wake-up free speech recognition apparatus according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device implementing a wake-up free speech recognition method according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a wake-up-free speech recognition method according to an embodiment of the present invention, where the method is applicable to a situation where a speech recognition function of an electronic device is started under a wake-up-free condition to perform speech recognition on a speech uttered by a user, and the method may be executed by a wake-up-free speech recognition device, where the wake-up-free speech recognition device may be implemented in a hardware and/or software manner, and the wake-up-free speech recognition device may be configured in the electronic device. As shown in fig. 1, the method includes:

and S110, acquiring visual information of the user through a visual sensor, and determining sight watching starting and stopping time corresponding to the sight watching event according to the visual information.

The visual sensor may be any device with an image capturing function, such as an image capturing device. The visual information is information collected by a visual sensor and can reflect that the sight line of the user falls into the area, and may include: the user's gaze falls within the area and gaze angle, etc.

It should be noted that the information collected by the visual sensor may include other facial information, gesture information, or behavior information in addition to the visual information, and therefore, a series of preprocessing operations such as detection, recognition, classification, screening, and filtering on the information collected by the visual sensor are required to determine the visual information. The embodiment of the present invention does not limit the specific method of the above-mentioned pretreatment operation.

The gaze fixation event refers to an event related to user gaze recognition, such as a fixation start event and a fixation end event corresponding to a target angle (i.e., a target area watched by the user) of the user's gaze. A gaze start event may be considered an event in which the gaze of the user starts to gaze at the target area; the gaze end event may be considered an event of the user ending the gaze target area. The target area may be an area where a control object corresponding to the voice information is located, or may be a preset area, such as a center control screen of the voice recognition device.

Specifically, visual identification is performed on visual information of the user acquired by a visual sensor, a sight gaze event, namely a gaze starting event and a gaze ending event of the user in a sight gaze target area, is determined, and sight gaze starting and ending times corresponding to the gaze starting event and the gaze ending event are obtained.

And S120, acquiring voice information of the user through the voice sensor, and determining voice detection start-stop time corresponding to the voice detection event according to the voice information.

The voice sensor may be any device with a sound collection function, such as a microphone. The voice information is voice instruction information issued by the user, such as "call to king's particular person" or "raise air conditioner temperature, etc., collected by the visual sensor.

It should be noted that the voice information collected by the voice sensor may include ambient noise or blank audio for a period of time, in addition to the voice instruction information issued by the user. Therefore, a series of preprocessing operations such as denoising, screening, and filtering need to be performed on the voice information collected by the voice sensor to determine the voice information, and the voice information may also be classified to determine the control object corresponding to the voice information. The embodiment of the present invention does not limit the specific method of the above-mentioned pretreatment operation.

The voice detection event refers to an event formed by performing voice audio detection on received voice sent by a user, and may include, for example: a speech detection start event and a speech detection end event. A voice detection start event may be considered as an event in which a user is detected to start uttering voice audio; the voice detection end event may be considered as an event that detects that the user stops uttering voice audio.

Specifically, voice detection is performed on voice information of a user collected by a voice sensor, voice detection events, namely a voice detection start event and a voice detection end event, are determined, and voice detection start and stop times corresponding to the voice detection start event and the voice detection end event are obtained.

And S130, determining the time overlap ratio of the sight line watching event and the voice detection event according to the sight line watching start-stop time and the voice detection start-stop time.

The time coincidence degree of the sight-line watching event and the voice detection event is used for representing the coincidence degree of two events of sending a voice command and watching a sight line to a target area in time. For example, an event that the user utters a voice "open left window" and an event that the user's gaze is gazing at the left window, if the two events have a coincidence in time, it means that the user utters the voice "open left window" while the user's gaze is also gazing at the left window.

Specifically, according to the sight gaze start-stop time corresponding to the sight gaze event and the voice detection start-stop time corresponding to the voice detection event, the time contact ratio of the sight gaze event and the voice detection event can be determined, and the time contact ratio is used for reflecting whether the voice and the sight of the user are synchronous or not.

And S140, performing voice recognition on the voice information according to the time coincidence degree to determine a voice recognition result.

Specifically, if the time overlap ratio between the sight line watching event and the voice detection event reaches a certain threshold, it may be determined that the voice information sent by the user and the target area of the sight line direction pointed by the sight line are consistent, and it is determined that the voice information sent by the user is the voice information to be recognized, so that the voice recognition function is started, and the voice recognition is performed on the voice information to determine the voice recognition result. If the time coincidence degree of the sight line watching event and the voice detection event does not reach a certain threshold value, the voice information sent by the user is not consistent with the target area of the sight line watching direction, and the voice sent by the user is not the voice information which needs to be identified, so that silence can be kept without feedback.

Therefore, when a user needs to send a voice instruction, the user does not need to send a preset awakening instruction first, and only needs to watch a control object of the voice instruction while sending the voice instruction, so that awakening-free voice recognition can be realized.

It should be noted that the wake-up-free speech recognition method provided by the embodiment of the present invention can be used in combination with any one or more existing speech recognition methods to enrich the speech recognition function of the device, meet the requirements of the user in different usage scenarios, and improve the experience of the user in using speech recognition control. The scene of the wake-up-free voice recognition method provided by the embodiment of the invention includes but is not limited to the control of vehicle-mounted equipment, and can also be used for the control of intelligent home or intelligent terminal equipment.

The embodiment of the invention collects the visual information of a user through a visual sensor, and determines the sight watching starting and stopping time corresponding to the sight watching event according to the visual information; acquiring voice information of a user through a voice sensor, and determining voice detection start-stop time corresponding to a voice detection event according to the voice information; determining the time contact ratio of the sight line watching event and the voice detection event according to the sight line watching start-stop time and the voice detection start-stop time; the voice recognition is carried out on the voice information according to the time contact ratio to determine the voice recognition result, the problem that the voice recognition function needs to be awakened before voice recognition is solved, the awakening-free voice recognition method is achieved based on the visual information and the voice information, the convenience of voice control is enhanced, and the use experience of a user is improved.

Optionally, the determining, according to the visual information, the gaze fixation start-stop time corresponding to the gaze fixation event in step S110 includes:

s111, performing visual identification on visual information to determine a sight line watching event, wherein the sight line watching event comprises: when the sight angle of the user is a target angle, a corresponding gazing start event and a corresponding gazing end event are obtained, and the target angle is an angle of a control object corresponding to the voice information;

the user sight angle is an angle between a sight line of the user and a reference line, and the user sight angle can be used for representing an area watched by the sight line. The target angle is an angle between the execution object corresponding to the voice message and the base station line, and the target angle may be a specific value or a preset range. The target angle of each control object may be determined according to an actual scene, and the embodiment of the present invention is not limited.

Exemplarily, in a vehicle cabin scene, setting a direction of a driving plane to a right window as a reference line, and if voice information is classified to determine that a control object is a left window, setting a target angle to be 180 degrees; if the voice information is classified and the control object is the vehicle-mounted host, the target angle is 15-30 degrees.

Specifically, the visual information is visually recognized to determine a user gaze angle, if the user gaze angle is a target angle corresponding to a control object corresponding to the voice information, the user is considered to be gazing at the control object corresponding to the voice information, an event that the user starts to gazing at the control object is determined as a gazing start event, and an event that the user finishes gazing at the control object is determined as a gazing end event.

And S112, acquiring the gazing starting time corresponding to the gazing starting event and the gazing ending time corresponding to the gazing ending event.

Specifically, a gazing start time corresponding to a gazing start event formed by a user starting to gaze the control object and a gazing end time corresponding to a gazing end event formed by a user ending to gaze the control object are obtained. The time period for which the user gazes at the control object may be determined from the gaze start time and the gaze end time.

Optionally, the determining, in step S120, the voice detection start-stop time corresponding to the voice detection event according to the voice information includes:

s121, carrying out voice detection on the voice information to determine a voice detection event, wherein the voice detection event comprises: a speech detection start event and a speech detection end event.

Specifically, when voice information sent by a user is received, the voice information of the user starts to be detected, an event that the voice information sent by the user starts is detected to be a voice detection start event, and an event that the voice information sent by the user finishes lasting for a preset time is detected to be a voice detection end event.

And S122, acquiring voice detection starting time corresponding to the voice detection starting event and voice detection ending time corresponding to the voice detection ending event.

Specifically, the voice detection start time corresponding to a voice detection start event formed by detecting that a user starts to send voice information and the voice detection end time corresponding to a voice detection end event formed by detecting that a user ends to send voice information are obtained. The time period during which the user utters the voice can be determined based on the voice detection start time and the voice detection end time.

Example two

Fig. 2 is a flowchart of a wake-up free speech recognition method according to a second embodiment of the present invention, and this embodiment further details steps S130 and S140 in the above-described embodiment. As shown in fig. 2, the method includes:

s210, visual information of the user is collected through a visual sensor, and sight watching starting and stopping time corresponding to the sight watching event is determined according to the visual information.

S220, voice information of the user is collected through the voice sensor, and voice detection starting and stopping time corresponding to the voice detection event is determined according to the voice information.

And S230, determining the length of the coincidence time of the sight line watching event and the voice detection event according to the sight line watching start-stop time and the voice detection start-stop time.

Illustratively, as shown in fig. 3A, the gaze fixation start-stop time is Te1 to Te2, and the voice detection start-stop time is Tv1 to Tv2: te1< Tv1< Tv2< Te2, the coincidence time length of the sight line fixation event and the voice detection event is Tv2-Tv1. As shown in fig. 3B, the gaze fixation start-stop time is Te3 to Te4, and the voice detection start-stop time is Tv1 to Tv2: tv1< Te3< Te4< Tv2, the length of the coincidence time of the gaze fixation event and the speech detection event is Te4-Te3. As shown in fig. 3C, the gaze start-stop times are Te5 to Te6 and Te7 to Te8, and the voice detection start-stop times are Tv1 to Tv2: te5< Tv1< Te6< Te7< Tv2< Te8, the length of the coincidence time of the gaze event and the speech detection event is (Te 6-Tv 1) + (Tv 2-Te 7).

S240, determining the voice time length according to the voice detection start-stop time, and determining the ratio of the coincidence time length to the voice time length as the time coincidence degree.

Illustratively, as shown in fig. 3A to fig. 3B, the voice time lengths are all the difference between the voice detection termination time and the voice detection start time, i.e., tv2-Tv1. For a continuous gaze without a break, the temporal overlap ratio of the gaze event and the voice detection event shown in fig. 3A is: (Tv 2-Tv 1)/(Tv 2-Tv 1) =100%; the time overlap ratio of the gaze fixation event and the voice detection event shown in fig. 3B is: (Tv 4-Tv 3)/(Tv 2-Tv 1). For a continuous gaze in the presence of a discontinuity, the time overlap ratio of gaze events and speech detection events shown in FIG. 3C is ((Te 6-Tv 1) + (Tv 2-Te 7))/(Tv 2-Tv 1).

And S250, if the time coincidence degree is greater than a first preset threshold value, performing voice recognition on the voice information to determine a voice recognition result.

The first preset threshold may be understood as an expected time overlap ratio, and may be set according to an actual requirement, or may be set by a user according to an actual usage scenario. The embodiments of the present invention are not limited thereto.

Specifically, if the time coincidence degree is greater than the first preset threshold, it is determined that the voice information sent by the user and the control object of the sight direction noted by the sight line are consistent, indicating that the voice information sent by the user needs to be recognized. Thus, speech recognition is performed on the speech information to determine a speech recognition result. The speech recognition result can be the semantic intention of the user, and can also be semantic recognition feedback information, such as 'not clear the user intention, please say once again' and the like.

For example, if the first preset threshold is set to 80%, the time overlap ratio between the gaze fixation event and the voice detection event shown in fig. 3A is 100%, and voice recognition is performed on the voice information to determine a voice recognition result. If the degree of time overlap between the gaze fixation event and the voice detection event shown in fig. 3B is 70%, the voice information is not subjected to voice recognition, silence is maintained, and input of the voice information and the visual information is awaited.

The embodiment of the invention collects the visual information of a user through a visual sensor, and determines the sight watching starting and stopping time corresponding to the sight watching event according to the visual information; acquiring voice information of a user through a voice sensor, and determining voice detection start-stop time corresponding to a voice detection event according to the voice information; determining the length of the coincidence time of the sight-line watching event and the voice detection event according to the sight-line watching starting and ending time and the voice detection starting and ending time; determining the voice time length according to the voice detection start-stop time; determining the ratio of the coincidence time length to the voice time length as the time coincidence degree; if the time contact ratio is larger than a first preset threshold value, voice recognition is carried out on the voice information to determine a voice recognition result, a wake-up-free voice recognition method is achieved based on the visual information and the voice information, the convenience of voice control is enhanced, and the use experience of a user is improved.

Optionally, step S230 includes: carrying out voice recognition on the voice information to determine a voice recognition result, wherein the voice recognition result comprises the following steps:

s231, carrying out voice recognition on the voice information to determine semantic intentions and determining semantic understanding indexes of the semantic intentions.

The semantic understanding index can be understood as the certainty of the semantic intention determined after the voice recognition is carried out on the voice information of the user, the value range of the semantic understanding index can be 0-1, the higher the index is, the more accurate the semantic understanding is, and otherwise, the semantic understanding is less accurate.

Specifically, any semantic recognition mode may be adopted to perform speech recognition on the speech information to determine the semantic intention, which is not limited in the embodiment of the present invention, for example, a pre-established semantic recognition model is used to perform semantic recognition, or deep learning is used to perform iterative training and recognition. After the semantic intention corresponding to the voice information is determined, the semantic understanding index is determined according to the semantic intention obtained through recognition, and the method for determining the semantic understanding index is not limited in the embodiment of the invention.

Illustratively, if the determined semantic intentions are unique, the semantic understanding index is 1, and if the determined semantic intentions are N, the semantic understanding index is 1/N.

Illustratively, if the voice message sent by the user is "to a place", the semantic intention is "navigate to a place", and if the query is unique to a place, the semantic understanding index is 1.

S232, if the semantic understanding index is larger than or equal to a second preset threshold value, determining the semantic intention as a voice recognition result; and if the semantic understanding index is smaller than a second preset threshold value, using preset information corresponding to the semantic understanding index as a voice recognition result.

The second preset threshold may be set according to a requirement, and the embodiment of the present invention does not limit this.

Specifically, if the semantic understanding index is greater than or equal to the second preset threshold, the semantic intention obtained through recognition can be considered to be an accurate intention corresponding to the voice uttered by the user, and therefore, the semantic intention is determined as a voice recognition result. If the semantic understanding index is smaller than the second preset threshold, the semantic intention obtained by recognition can be considered to be ambiguous, and therefore the preset information corresponding to the semantic understanding index can be used as a voice recognition result.

For example, the preset information corresponding to the semantic understanding index may be configured according to the original configuration, for example, the semantic understanding index is 0 to 0.3, and the preset information is "the instruction of the user is not clearly heard, please say again"; the semantic understanding index is 0.3-0.7 (the second threshold is 0.7), and the preset information is to inquire whether the user intention is the intention which is used most frequently or the latest historical intention.

Optionally, after performing speech recognition on the speech information to determine a speech recognition result, the method further includes:

if the voice recognition result is the semantic intention, determining a control instruction according to the semantic intention;

and executing corresponding control operation according to the control instruction.

Specifically, after the semantic intention of the voice information sent by the user is determined by performing voice recognition on the voice information according to the voice information and the visual information, a control instruction is determined according to the semantic intention, and corresponding control operation is executed according to the control instruction, so that awakening-free voice control is realized.

As shown in fig. 4, the specific steps of the embodiment of the present invention include: the method comprises the steps that visual information of a user is collected through a visual sensor, and voice information of the user is collected through a voice sensor; determining a sight line watching event according to the visual information, and determining a voice detection event according to the voice information; judging whether the sight angle in the sight watching event is the angle of an execution object corresponding to the voice information, if so, acquiring sight watching starting and ending time corresponding to the sight watching event and voice detection starting and ending time corresponding to the voice detection event; if not, no operation is executed. After the sight line watching start-stop time and the voice detection start-stop time are obtained, determining the time contact ratio of a sight line watching event and a voice detection event, continuously judging whether the time contact ratio is greater than a first set threshold value, if so, realizing awakening-free voice recognition starting, carrying out voice recognition on voice information to determine semantic intention, and determining the semantic understanding index of the semantic intention; if not, no operation is executed. After the semantic understanding index is determined, judging that the semantic understanding index is larger than or equal to a second preset threshold, if so, determining a control instruction corresponding to the semantic intention, executing control operation corresponding to the control instruction, and realizing wake-up-free voice control; if not, no operation is executed.

EXAMPLE III

Fig. 5 is a schematic structural diagram of a wake-up free speech recognition device according to a third embodiment of the present invention. As shown in fig. 5, the apparatus includes: a visual recognition module 310, a voice detection module 320, an overlap ratio determination module 330, and a recognition result determination module 340;

the visual identification module 310 is configured to acquire visual information of a user through a visual sensor, and determine a sight watching start-stop time corresponding to a sight watching event according to the visual information;

the voice detection module 320 is configured to acquire voice information of a user through a voice sensor, and determine a voice detection start-stop time corresponding to a voice detection event according to the voice information;

an overlap ratio determining module 330, configured to determine a time overlap ratio between the gaze fixation event and the voice detection event according to the gaze fixation start-stop time and the voice detection start-stop time;

and the recognition result determining module 340 is configured to perform speech recognition on the speech information according to the time overlap ratio to determine a speech recognition result.

Optionally, the visual recognition module 310 is specifically configured to:

performing visual recognition on the visual information to determine a gaze event, the gaze event comprising: a gaze start event and a gaze end event corresponding to the user sight angle being the target angle; the target angle is the angle of a control object corresponding to the voice information;

and acquiring the gazing starting time corresponding to the gazing starting event and the gazing ending time corresponding to the gazing ending event.

Optionally, the voice detecting module 320 is specifically configured to:

performing voice detection on the voice information to determine a voice detection event, where the voice detection event includes: a voice detection start event and a voice detection end event;

and acquiring voice detection starting time corresponding to the voice detection starting event and voice detection ending time corresponding to the voice detection ending event.

Optionally, the coincidence degree determining module 330 is specifically configured to:

determining the coincidence time length of the sight-line watching event and the voice detection event according to the sight-line watching starting and ending time and the voice detection starting and ending time;

determining the voice time length according to the voice detection start-stop time;

and determining the ratio of the coincidence time length to the voice time length as the time coincidence degree.

Optionally, the recognition result determining module 340 is specifically configured to:

and if the time coincidence degree is greater than a first preset threshold value, performing voice recognition on the voice information to determine a voice recognition result.

Optionally, the recognition result determining module 340 includes:

the semantic intention determining unit is used for carrying out voice recognition on the voice information to determine a semantic intention;

an understanding index determining unit, configured to determine a semantic understanding index of the semantic intention;

a first result determining module, configured to determine the semantic intention as the speech recognition result if the semantic understanding index is greater than or equal to a second preset threshold;

and the second result determining module is used for determining preset information corresponding to the semantic understanding index as a voice recognition result if the semantic understanding index is smaller than the second preset threshold.

Optionally, the method further includes:

the instruction determining module is used for determining a voice recognition result after performing voice recognition on the voice information, and determining a control instruction according to the semantic intention if the voice recognition result is the semantic intention;

and the execution module is used for executing corresponding control operation according to the control instruction.

The wake-up-free voice recognition device provided by the embodiment of the invention can execute the wake-up-free voice recognition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 6 illustrates a block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 may also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as a wake-free voice recognition method.

In some embodiments, the wake-free speech recognition method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the wake-free speech recognition method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the wake-free speech recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A wake-free speech recognition method, comprising:

the method comprises the steps that visual information of a user is collected through a visual sensor, and sight watching starting and stopping time corresponding to a sight watching event is determined according to the visual information;

2. The method of claim 1, wherein determining a gaze start and stop time for a gaze event based on the visual information comprises:

3. The method according to claim 1, wherein the determining a voice detection start-stop time corresponding to a voice detection event according to the voice information comprises:

performing voice detection on the voice information to determine a voice detection event, wherein the voice detection event comprises: a voice detection start event and a voice detection end event;

4. The method of claim 1, wherein determining a time overlap ratio of the gaze event and the voice detection event based on the gaze start-stop time and the voice detection start-stop time comprises:

determining the length of the coincidence time of the sight line watching event and the voice detection event according to the sight line watching start-stop time and the voice detection start-stop time;

5. The method of claim 1, wherein the determining a speech recognition result by performing speech recognition on the speech information according to the time overlapping ratio comprises:

6. The method according to any one of claims 1 or 5, wherein performing speech recognition on the speech information to determine a speech recognition result comprises:

performing voice recognition on the voice information to determine semantic intention;

determining a semantic understanding index of the semantic intent;

if the semantic understanding index is larger than or equal to a second preset threshold value, determining the semantic intention as the voice recognition result;

and if the semantic understanding index is smaller than the second preset threshold value, the preset information corresponding to the semantic understanding index is used as a voice recognition result.

7. The method of claim 6, after performing speech recognition on the speech information to determine a speech recognition result, further comprising:

8. A wake-free speech recognition device, comprising:

the visual identification module is used for acquiring visual information of a user through a visual sensor and determining sight watching starting and stopping time corresponding to a sight watching event according to the visual information;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the wake-free speech recognition method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to implement the wake-free speech recognition method of any one of claims 1-7 when executed.