WO2023231211A1

WO2023231211A1 - Voice recognition method and apparatus, electronic device, storage medium, and product

Info

Publication number: WO2023231211A1
Application number: PCT/CN2022/117333
Authority: WO
Inventors: 蒋磊; 蔡勇
Original assignee: 合众新能源汽车股份有限公司
Priority date: 2022-06-01
Filing date: 2022-09-06
Publication date: 2023-12-07
Also published as: CN115171678A

Abstract

A voice recognition method and apparatus, an electronic device, a storage medium, and a product. The method comprises: in response to voice information of a user in a vehicle, obtaining a facial image of the user; determining a current state of the user according to a facial feature on the facial image; when the current state of the user satisfies a set condition, recognizing the voice information to obtain a recognition result; and when the recognition result is an in-vehicle infotainment instruction, executing a corresponding operation according to the in-vehicle infotainment instruction.

Description

Speech recognition methods, devices, electronic equipment, storage media and products

This application requires the priority of the Chinese patent application submitted to the China Patent Office on June 1, 2022, with the application number 202210617455.2 and the invention title "Speech Recognition Methods, Devices, Electronic Equipment, Storage Media and Products", and its entire content has been approved. This reference is incorporated into this application.

Technical field

The present application relates to the field of speech understanding technology, and in particular to a speech recognition method, device, electronic equipment, computer-readable storage medium and computer program product.

Background technique

With the rapid development of smart cars, voice function is an important function of smart cars. The user needs to use the wake-up word every time he communicates with the car, for example, "The user says: Hello Nezha" to wake up the voice function of the car. Since the wake-up word has to be used every time, it will be more troublesome.

Based on this, a "wake-free" solution has been proposed in related technologies. However, in the "wake-free" solution, when the user speaks in the car, the car computer cannot accurately determine which words are "instructions to the car computer" , which words "are not instructions to the vehicle." This will cause a "false recall", causing the vehicle to incorrectly execute instructions, thus affecting the user experience.

Therefore, when detecting the voice of a user in the car, how to accurately identify which voices are instructions for the car and reduce the misoperation rate of the car is a technical problem that needs to be solved.

Overview

This application provides a speech recognition method, device, electronic equipment, computer-readable storage medium and computer program product, to at least solve the problem in related technologies that the vehicle-machine instructions cannot be accurately recognized due to the inability to accurately recognize the voice in the car, resulting in vehicle-machine execution errors. instructions, technical problems that increase the rate of misoperation. The technical solution of this application is as follows:

According to a first aspect of the embodiment of the present application, a speech recognition method is provided, including:

Respond to the voice information of the user in the car and obtain the facial image of the user;

Determine the current status of the user based on facial features on the facial image;

When the user's current status meets the set conditions, recognize the voice information and obtain a recognition result;

When the recognition result is a vehicle-machine instruction, the corresponding operation is performed according to the vehicle-machine instruction.

Optionally, the method also includes:

When the current status of the user does not meet the set conditions, the recognition of the voice information is refused.

Optionally, determining the current status of the user based on facial features on the facial image includes at least one of the following:

Obtain the information status of the vehicle, and determine that the vehicle-mounted Bluetooth phone is not turned on based on the information status and the facial features of the facial image, and determine that the user is in a non-calling state;

When it is determined based on the facial features of the facial image that the user's front face is looking in the direction of vehicle travel, it is determined that the user is in a state of facing forward;

When it is determined that the user's mouth is in an opening and closing state based on the facial features of the facial image, it is determined that the user is in a speaking state.

Optionally, when the current status of the user meets the set conditions, the voice information is recognized and the recognition result is obtained, including:

When the current state of the user is at least one of: the user is in a non-phone state, the user is in a state of facing forward, and the user is in a speaking state, it is determined that the user satisfies the set condition;

The voice information is recognized and a recognition result is obtained.

Optionally, the recognition of the voice information to obtain the recognition result includes:

Perform local speech-to-text conversion processing on the voice information to obtain converted text information; or

Send the voice information to the cloud, and the cloud performs voice-to-text conversion processing to obtain text information;

Receive the converted text information sent by the cloud.

Optionally, when the recognition result is a vehicle-machine instruction, performing corresponding operations according to the vehicle-machine instruction includes:

The recognition result is judged by a trained vehicle-machine command recognition model, and the recognition result is a vehicle-machine command; wherein the trained vehicle-machine command recognition model is based on multiple histories of human-vehicle interaction A model obtained by learning and training audio pairs, text pairs, scenes and keywords;

Perform corresponding operations according to the obtained vehicle and machine instructions.

According to a second aspect of the embodiment of the present application, a speech recognition device is provided, including:

The acquisition module is used to respond to the voice information of the user in the car and obtain the facial image of the user;

a determining module configured to determine the current status of the user based on facial features on the facial image;

A recognition module, used to recognize the voice information and obtain a recognition result when the current status of the user meets the set conditions;

An execution module, when the recognition result is a vehicle-machine instruction, executes the corresponding operation according to the vehicle-machine instruction.

Optionally, the device also includes:

A recognition rejection module is configured to reject recognition of the voice information when the user's current status does not meet the set conditions.

Optionally, the determination module includes at least one of the following modules:

A first determination module configured to determine that the user is in a non-phone state when determining that the vehicle-mounted Bluetooth phone is not turned on based on the obtained information status of the vehicle and the facial features of the facial image;

a second determination module, configured to determine that the user is in a state of facing forward when it is determined based on the facial features of the facial image that the user's front face is looking in the direction of vehicle travel;

The third determination module is configured to determine that the user is in a speaking state when it is determined that the user's mouth is in an opening and closing state based on the facial features of the facial image.

Optionally, the identification module includes:

The first judgment module is used to determine that the set conditions are met when the current state of the user is at least one of: the user is not on the phone, the user is in a state of facing forward, and the user is in a speaking state. ;

A speech recognition module is used to recognize the speech information and obtain a recognition result.

Optionally, the speech recognition module includes: a speech conversion module; and/or a sending module and a receiving module, wherein,

The voice conversion module is used to perform local voice-to-text conversion processing on the voice information to obtain converted text information;

The sending module is used to send the voice information to the cloud, and the cloud performs voice-to-text conversion processing to obtain text information;

The receiving module is used to receive the converted text information sent by the cloud.

Optionally, the execution module includes:

The second judgment module is used to judge the recognition result through a trained vehicle-machine command recognition model, and obtain that the recognition result is a vehicle-machine command; wherein the trained vehicle-machine command recognition model is based on human and computer commands. A model obtained by learning and training multiple historical audio pairs, text pairs, scenes and keywords of vehicle-computer interaction;

An instruction execution module is used to execute corresponding operations according to the vehicle-machine instructions obtained by the second judgment module.

According to a third aspect of the embodiment of the present application, an electronic device is provided, including:

processor;

memory for storing instructions executable by the processor;

Wherein, the processor is configured to execute the instructions to implement the speech recognition method as described above.

According to a fourth aspect of an embodiment of the present application, a computer-readable storage medium is provided, which when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform speech recognition as described above. method.

According to a fifth aspect of the embodiments of the present application, a computer program product is provided, including a computer program or instructions that implement the speech recognition method as described above when executed by a processor.

The technical solutions provided by the embodiments of the present application at least bring the following beneficial effects:

In the embodiment of the present application, in response to the voice information of the user in the car, the user's facial image is obtained; the current state of the user is determined based on the facial features on the facial image; and the user's current state satisfies the set conditions When, the voice information is recognized to obtain a recognition result; when the recognition result is a vehicle-machine instruction, the corresponding operation is performed according to the vehicle-machine instruction. That is to say, in the embodiment of the present application, the current status of the user is determined based on the facial features on the facial image, and the voice information is recognized based on the user's current status, so that it can be accurately determined which voice information is from the vehicle. Commands, which voice messages are not vehicle-machine instructions, improve the efficiency of the vehicle-machine accurately executing vehicle-machine instructions, reduce the rate of vehicle-machine misoperation, and also improve the user experience.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present application.

The above description is only an overview of the technical solutions of the present application. In order to have a clearer understanding of the technical means of the present application, they can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and understandable. , the specific implementation methods of the present application are specifically listed below.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, illustrate embodiments consistent with the present application, and are used together with the description to explain the principles of the present application, and do not constitute undue limitations on the present application. In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is a flow chart of a speech recognition method provided by an embodiment of the present application.

Figure 2 is a flow chart of an application example of a speech recognition method provided by an embodiment of the present application.

Figure 3 is a block diagram of a speech recognition device provided by an embodiment of the present application.

Figure 4 is another block diagram of a speech recognition device provided by an embodiment of the present application.

Figure 5 is a block diagram of a determination module provided by an embodiment of the present application.

Figure 6 is a block diagram of an identification module provided by an embodiment of the present application.

Figure 7 is a block diagram of an execution module provided by an embodiment of the present application.

Figure 8 is a block diagram of an electronic device provided by an embodiment of the present application.

Figure 9 is a block diagram of a device for speech recognition provided by an embodiment of the present application.

A detailed description

In order to enable ordinary people in the art to better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings.

It should be noted that the terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the appended claims.

In recent years, research on computer vision, deep learning, machine learning, image processing, image recognition and other technologies based on artificial intelligence has made important progress. Artificial Intelligence (AI) is an emerging science and technology that studies and develops theories, methods, technologies and application systems for simulating and extending human intelligence. The subject of artificial intelligence is a comprehensive subject, involving many types of technologies such as chips, big data, cloud computing, Internet of Things, distributed storage, deep learning, machine learning, neural networks, etc. Computer vision, as an important branch of artificial intelligence, specifically allows machines to recognize the world. Computer vision technology usually includes face recognition, live body detection, fingerprint recognition and anti-counterfeiting verification, biometric recognition, face detection, pedestrian detection, target detection, pedestrian Recognition, image processing, image recognition, image semantic understanding, image retrieval, text recognition, video processing, video content recognition, behavior recognition, 3D reconstruction, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), computational photography, robotics Navigation and positioning technologies. With the research and progress of artificial intelligence technology, this technology has been applied in many fields, such as security, urban management, traffic management, building management, park management, face traffic, face attendance, logistics management, warehousing management, robots , intelligent marketing, computational photography, mobile imaging, cloud services, smart home, wearable devices, driverless driving, autonomous driving, smart medical care, face payment, face unlocking, fingerprint unlocking, ID verification, smart screen, smart TV, Cameras, mobile Internet, webcasting, beauty, cosmetics, medical cosmetology, intelligent temperature measurement and other fields.

Figure 1 is a flow chart of a speech recognition method provided by an embodiment of the present application. As shown in Figure 1, the speech recognition method includes the following steps:

Step 101: Respond to the voice message of the user in the car and obtain the user's facial image.

Step 102: Determine the current status of the user based on facial features on the facial image.

Step 103: When the user's current status meets the set conditions, recognize the voice information and obtain a recognition result.

Step 104: When the recognition result is a vehicle-machine instruction, perform the corresponding operation according to the vehicle-machine instruction.

The speech recognition method described in this application can be applied to vehicle-machine terminals, etc., and is not limited here. The vehicle-machine terminal implementation equipment can be electronic equipment such as smart car-machine, vehicle-machine platform, etc., which is not limited here.

The specific implementation steps of a speech recognition method provided by an embodiment of the present application will be described in detail below with reference to Figure 1 .

In step 101, in response to the voice information of the user in the car, the facial image of the user is obtained.

In this step, when the user in the car speaks, the car terminal can detect the user's voice information through the microphone on the vehicle. At this time, the face of the user in the car can be obtained through the image collection device (such as a camera, etc.) on the vehicle. Image, the facial image can be a single frame image or multiple frame images. Wherein, the image acquisition device can be set at a position aimed at the driver, so that the image acquisition device can clearly capture the driver's facial image.

In step 102, the current status of the user is determined based on facial features on the facial image.

In this step, the acquired facial image is recognized, facial feature points on the facial image are obtained, and the current state of the user's face is determined based on the facial feature points. Among them, facial image recognition uses computer image processing technology to extract facial feature points from facial images, such as whether the eyes are open, whether the mouth is open, etc.

Afterwards, determining the current user's facial state based on facial features may include at least one of the following, but is not limited to this:

1) Obtain the information status of the vehicle, and determine that the user is in a non-phone state when it is determined that the vehicle-mounted Bluetooth phone is not turned on based on the information status and the facial features of the facial image.

That is to say, first obtain the information status of the vehicle, such as whether the Bluetooth phone in the car (i.e., the car Bluetooth phone) is turned on, etc., and then combine the facial features in the facial image (such as whether the mouth is open and closed, etc.) to determine the user Are you on the phone? For example, if the car Bluetooth phone is in the open state and the user's mouth is opening and closing, it is determined that the user is making a call at this time; otherwise, it is determined that the user is in a non-call state, for example, the car Bluetooth phone is in a non-open state. And the user's mouth is in an open and closed state, then it is determined that the user is in a speaking state, not that the user is in a non-calling state; of course, if the car Bluetooth phone is in a non-open state and the user's mouth is in a closed state, then Make sure the user is not speaking, is in a quiet state, etc.

2) When it is determined that the user's eyes are looking in the vehicle traveling direction based on the facial features of the facial image, it is determined that the user is in a state of looking forward.

In this step, multi-angle face recognition technology can be used to determine whether the user's front face is looking in the direction of the vehicle. If so, it is determined that the user is facing forward. Otherwise, it is determined that the user is not facing forward. The state of looking forward. That is to say, it is determined whether the user's front face is looking within 90 degrees of the vehicle's driving direction. If so, it is determined that the front face is facing forward.

In this embodiment, the multi-angle face recognition technology is a branch of the multi-pose face recognition technology. Among them, a deep learning multi-angle face recognition algorithm includes: first, constructing a deep learning training data set; second, training a deep face classifier; finally, applying the classifier for face detection. The specific implementation process is a well-known technology in this field and will not be described again.

In other words, this algorithm takes the side image of the face as input and the corresponding frontal image of the face as the output. The supervised model learns the mapping from the side image of the face in different poses to the frontal image, thus increasing the effectiveness in recognition. facial information. Of course, in practical applications, it is not limited to this. For example, the trained face angle classification model can also be used to determine the frontal angle of whether the user is looking forward. If it is judged that the user's front face is within the range of 90 degrees forward, all Make sure the user is facing forward.

3) When it is determined that the user's mouth is in an opening and closing state based on the facial image, it is determined that the user is in a speaking state.

In this step, it is determined according to the facial features of the facial image whether the user's mouth is in an open and closed state. If so, it is determined that the user is in a speaking state; otherwise, it is determined that the user is in a non-speaking state.

Specifically, the lip movement feature extraction algorithm (or lip movement model) can be used to determine whether the user opens his mouth, thereby determining whether the user has lip movement. Of course, speaking user identification technology based on lip movement can also be used to extract information that reflects both the physiological characteristics of the speaker's mouth and the behavioral characteristics of the speaker's lip movements from the image sequence of the speaking user through discrete cosine transformation. Visual features. Based on these visual features, a static and dynamic hybrid model is established for the speaking user to determine whether the user has lip movements. The specific process is a familiar technique to those skilled in the art and will not be described in detail here.

In step 103, when the current status of the user meets the set conditions, the voice information is recognized and a recognition result is obtained.

In this step, after determining the current status of the user, it is necessary to determine whether the current status of the user satisfies the set conditions. If satisfied, the step of identifying the voice information is performed. Otherwise, the step of identifying the voice information is rejected. The voice information is recognized, that is, the recognition is refused. Among them, the setting conditions include at least one of the following: the user is in a non-phone state, the user is in a state of facing forward, and the user is in a speaking state. When it is determined that the user's current status satisfies at least one of the above set conditions, the recognition of the voice information can be performed. The best way for this embodiment is to satisfy all the above setting conditions.

In another embodiment, the voice information is recognized and the recognition results are obtained, including:

In one case, the voice information is subjected to local voice-to-text conversion processing to obtain converted text information.

Another situation is to send the voice information to the cloud, and the cloud performs voice-to-text conversion processing to obtain text information; and receives the converted text information sent by the cloud.

The specific speech-to-text conversion process is a familiar technology to those skilled in the art and will not be described in detail here.

In step 104, when the recognition result is a vehicle-machine instruction, the corresponding operation is performed according to the vehicle-machine instruction.

After the voice information is recognized, the obtained recognition result is input into the trained vehicle-to-machine command model to determine whether the recognition result is a vehicle-to-machine command. Among them, the trained vehicle-machine command recognition model is a model obtained by learning and training based on multiple historical audio pairs, text pairs, scenes and keywords of human-vehicle interaction.

Optionally, in one embodiment, the vehicle-machine command recognition model is trained in advance, wherein the input for training the vehicle-machine command recognition model usually selects historical dialogue audio of multiple human-vehicle-machine (referred to as human-machine) interactions, etc. , for example, select the audio and text of 10 rounds of human-computer interaction dialogues, and confirm whether the user speaks to the vehicle and the computer (or gives the vehicle and computer commands to the vehicle and the computer) and record the results of each round of recognition; its vehicle-to-machine command recognition model The output results include: 1 means speaking to the vehicle and the computer, that is, the vehicle and computer commands; 0 does not speak to the car, that is, it is not a vehicle and computer command; of course, it can also be set to: 0 means speaking to the vehicle and the computer, 1 does not speak to the vehicle and the computer, etc. , there is no limitation in this embodiment.

In this embodiment, the training of the vehicle-machine command recognition model is to allow the vehicle-machine command model to learn more vehicle-machine commands from it, thereby improving the accuracy of training the vehicle-machine command recognition model.

In one case, this embodiment selects a large number of data groups for learning. Each group of data includes: historical audio and current audio to learn which type of audio speaks to the vehicle, that is, the vehicle instructions issued to the vehicle. .

In another case, this embodiment can also learn which texts are command words from the text. If the command words are not rich enough, use the historical results as the input this time, thereby improving the accuracy of the vehicle-machine command recognition model training. .

In the embodiment of the present application, in response to the voice information of the user in the car, the user's facial image is obtained; the current state of the user is determined based on the facial features on the facial image; and the user's current state satisfies the set conditions When, the voice information is recognized to obtain a recognition result; when the recognition result is a vehicle-machine instruction, the corresponding operation is performed according to the vehicle-machine instruction. That is to say, in the embodiment of the present application, the current status of the user is determined based on the facial features on the facial image, and the voice information is recognized based on the user's current status, so that it can be accurately determined which voice information is from the vehicle. Instructions, which voice messages are not vehicle-machine instructions, improve the efficiency of the vehicle-machine executing accurate vehicle-machine instructions, reduce the rate of vehicle-machine misoperation, and also improve the user experience.

Please also refer to Figure 2, which is an application example diagram of a speech recognition method provided by an embodiment of the present application. The method is applied to a vehicle-machine terminal. The method includes:

Step 201: Detect the voice information of the user in the car;

In this step, when a user speaks in the car, the car terminal detects the user's voice information.

Step 202: Obtain the user's facial image;

Step 203: Determine the current status of the user based on the facial features on the facial image;

For example, the current state of the user includes: the user is in a non-phone state, the user is in a state of facing forward, and the user is in a speaking state, but in practical applications, it is not limited to this.

Step 204: Determine whether the user is currently on the phone. If not, perform step 205; otherwise, perform step 210:

Step 205: Determine whether the current state of the user is facing the direction of the vehicle. If so, perform step 206; otherwise, perform step 210:

Step 206: Determine whether the current state of the user is in an open and closed state. If so, perform step 207; otherwise, perform step 210:

Step 207: Recognize the voice information and obtain the recognition result;

Step 208: Determine whether the recognition result is a vehicle-machine command. If so, execute step 209; otherwise, execute step 211;

Step 209: Perform corresponding operations according to the vehicle and machine instructions;

Step 210: Refuse to recognize the voice information, that is, reject recognition.

Step 211: Refuse to execute the identification result.

Of course, the recognition result can also be deleted or ignored.

In this embodiment, the details of the implementation process of each step can be found in the implementation process of the above corresponding embodiment, which will not be described again here.

In the embodiment of the present application, the current status of the user is determined based on the facial features on the facial image, and the voice information is recognized based on the user's current status, so that it can be accurately determined which voice information is a vehicle-machine command and which voice information is The information is not a vehicle-machine instruction, that is, multi-mode (such as visual and audio, etc.) is used to determine whether the voice message is a vehicle-machine instruction, which improves the efficiency of the vehicle-machine executing accurate vehicle-machine instructions and reduces the probability of "false recall" of the vehicle-machine. It also improves user experience. That is to say, the embodiment of the present application uses the visual system and voice system in the car to reduce the "false recall" rate of the car and improve the user experience.

It should be noted that for the sake of simple description, method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that this implementation disclosure is not limited by the described action sequence, because according to In this application, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily necessary for this application.

Figure 3 is a block diagram of a speech recognition device provided by an embodiment of the present application. Referring to Figure 3, the device includes: an acquisition module 301, a determination module 302, an identification module 303 and an execution module 304, where,

The acquisition module 301 is used to respond to the voice information of the user in the car and acquire the facial image of the user;

The determination module 302 is used to determine the current status of the user according to the facial features on the facial image;

The recognition module 303 is used to recognize the voice information and obtain a recognition result when the user's current status meets the set conditions;

The execution module 304, when the recognition result is a vehicle-machine instruction, performs corresponding operations according to the vehicle-machine instruction.

Optionally, in another embodiment, based on the above embodiment, the device further includes: a rejection identification module 401, the structural block diagram of which is shown in Figure 4, wherein,

The recognition rejection module 401 is used to reject recognition of the voice information when the user's current status does not meet the set conditions.

Optionally, in another embodiment, based on the above embodiment, the determination module 302 includes at least one of the following modules: a first determination module 501, a second determination module 502 and a third determination module. 503, the structural block diagram of which is shown in Figure 5, in which this embodiment includes all modules at the same time as an example:

The first determination module 501 is used to determine that the user is in a non-phone state when determining that the vehicle-mounted Bluetooth phone is not turned on based on the information status of the vehicle and the facial features of the facial image;

The second determination module 502 is configured to determine that the user is in a state of facing forward when it is determined based on the facial features of the facial image that the user's front face is looking in the direction of vehicle travel;

The third determination module 503 is configured to determine that the user is in a speaking state when it is determined that the user's mouth is in an opening and closing state based on the facial features of the facial image.

Optionally, in another embodiment, based on the above embodiment, the recognition module 303 includes: a first judgment module 601 and a speech recognition module 602, whose structural block diagram is shown in Figure 6, where ,

The first judgment module 601 is used to judge that when the current state of the user is at least one of: the user is in a non-phone state, the user is in a state of facing forward, and the user is in a speaking state, the condition is satisfied. set conditions;

The speech recognition module 602 is used to recognize the speech information and obtain a recognition result.

Optionally, in another embodiment, based on the above embodiment, the speech recognition module includes: a speech conversion module; and/or a sending module and a receiving module, wherein,

Optionally, in another embodiment, based on the above embodiment, the execution module 304 includes: a second judgment module 701 and an instruction execution module 702, whose structural block diagram is shown in Figure 7, where ,

The second judgment module 701 is used to judge the recognition result through a trained vehicle-machine command recognition model, and obtain that the recognition result is a vehicle-machine command; wherein the trained vehicle-machine command recognition model is based on A model obtained by learning and training multiple historical audio pairs, text pairs, scenes and keywords of human-vehicle interaction;

The instruction execution module 702 is used to execute corresponding operations according to the vehicle-machine instructions obtained by the second judgment module 701 .

Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

The device embodiments described above are only illustrative. The modules described as separate components may or may not be physically separated. The components shown as modules may or may not be physical modules, that is, they may be located in One place, or it can be distributed across multiple networks. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Optionally, this embodiment of the present application also provides an electronic device, including:

processor;

memory for storing instructions executable by the processor;

Optionally, an embodiment of the present application also provides a computer-readable storage medium. When instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device can perform the speech recognition method as described above. Alternatively, the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Optionally, this embodiment of the present application further provides a computer program product, including a computer program or instructions, which implement the speech recognition method as described above when executed by a processor.

FIG. 8 is a block diagram of an electronic device 800 provided by an embodiment of the present application. For example, the electronic device 800 may be a mobile terminal or a server. In the embodiment of this application, the electronic device 800 is a mobile terminal as an example for description. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.

Referring to FIG. 8 , the electronic device 800 may include one or more of the following components: a processing component 802 , a memory 804 , a power component 806 , a multimedia component 808 , an audio component 810 , an input/output (I/O) interface 812 , and a sensor component 814 , and communication component 816.

Processing component 802 generally controls the overall operations of electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

Memory 804 is configured to store various types of data to support operations at device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, etc. Memory 804 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Power supply component 806 provides power to various components of electronic device 800 . Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 800 .

Multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action. In some embodiments, multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the device 800 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.

Audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a microphone (MIC) configured to receive external audio signals when electronic device 800 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signal may be further stored in memory 804 or sent via communication component 816 . In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.

Sensor component 814 includes one or more sensors for providing various aspects of status assessment for electronic device 800 . For example, the sensor component 814 can detect the open/closed state of the device 800, the relative positioning of components, such as the display and keypad of the electronic device 800. The sensor component 814 can also detect the electronic device 800 or a component of the electronic device 800. changes in position, the presence or absence of user contact with the electronic device 800 , the orientation or acceleration/deceleration of the electronic device 800 and changes in the temperature of the electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 816 is configured to facilitate wired or wireless communication between electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G or 5G), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communications component 816 also includes a near field communications (NFC) module to facilitate short-range communications. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In embodiments, electronic device 800 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gates Array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented for executing the speech recognition method shown above.

In an embodiment, a computer-readable storage medium is also provided, such as a memory 804 including instructions, and the instructions can be executed by the processor 820 of the electronic device 800 to complete the speech recognition method shown above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an embodiment, a computer program product is also provided. When the instructions in the computer program product are executed by the processor 820 of the electronic device 800, the electronic device 800 performs the speech recognition method shown above.

Figure 9 is a block diagram of a device 900 for speech recognition provided by an embodiment of the present application. For example, device 900 may be provided as a server. Referring to Figure 9, apparatus 900 includes a processing component 922, which further includes one or more processors, and memory resources represented by memory 932 for storing instructions, such as application programs, executable by processing component 922. The application program stored in memory 932 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 922 is configured to execute instructions to perform the above-described method.

Device 900 may also include a power supply component 926 configured to perform power management of device 900, a wired or wireless network interface 950 configured to connect device 900 to a network, and an input-output (I/O) interface 958. Device 900 may operate based on an operating system stored in memory 932, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

A speech recognition method, characterized by including:

Respond to the voice information of the user in the car and obtain the facial image of the user;

Determine the current status of the user based on facial features on the facial image;

When the user's current status meets the set conditions, recognize the voice information and obtain a recognition result;

When the recognition result is a vehicle-machine instruction, the corresponding operation is performed according to the vehicle-machine instruction.
The speech recognition method according to claim 1, characterized in that the method further includes:

When the current status of the user does not meet the set conditions, the recognition of the voice information is refused.
The speech recognition method according to claim 1 or 2, characterized in that determining the current state of the user based on facial features on the facial image includes at least one of the following:

Obtain the information status of the vehicle, and when it is determined that the vehicle-mounted Bluetooth phone is not turned on based on the information status and the facial features of the facial image, it is determined that the user is in a non-phone state;

When it is determined based on the facial features of the facial image that the user's front face is looking in the direction of vehicle travel, it is determined that the user is in a state of facing forward;

When it is determined that the user's mouth is in an opening and closing state based on the facial features of the facial image, it is determined that the user is in a speaking state.
The speech recognition method according to claim 3, characterized in that, when the current state of the user meets the set conditions, the speech information is recognized and the recognition result is obtained, including:

When the current state of the user is at least one of: the user is in a non-phone state, the user is in a state of facing forward, and the user is in a speaking state, it is determined that the user satisfies the set condition;

The voice information is recognized and a recognition result is obtained.
The speech recognition method according to claim 4, wherein the step of identifying the speech information to obtain a recognition result includes:

Perform local speech-to-text conversion processing on the voice information to obtain converted text information; or

Send the voice information to the cloud, and the cloud performs voice-to-text conversion processing to obtain text information;

Receive the converted text information sent by the cloud.
The speech recognition method according to claim 4, characterized in that when the recognition result is a vehicle-machine instruction, performing corresponding operations according to the vehicle-machine instruction includes:

The recognition result is judged by a trained vehicle-machine command recognition model, and the recognition result is a vehicle-machine command; wherein the trained vehicle-machine command recognition model is based on multiple histories of human-vehicle interaction A model obtained by learning and training audio pairs, text pairs, scenes and keywords;

Perform corresponding operations according to the obtained vehicle and machine instructions.
A speech recognition device, characterized by including:

The acquisition module is used to respond to the voice information of the user in the car and obtain the facial image of the user;

a determining module configured to determine the current status of the user based on facial features on the facial image;

A recognition module, used to recognize the voice information and obtain a recognition result when the current status of the user meets the set conditions;

An execution module, when the recognition result is a vehicle-machine instruction, executes the corresponding operation according to the vehicle-machine instruction.
An electronic device, characterized by including:

processor;

memory for storing instructions executable by the processor;

Wherein, the processor is configured to execute the instructions to implement the speech recognition method according to any one of claims 1 to 6.
A computer-readable storage medium, characterized in that, when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is capable of executing the method described in any one of claims 1 to 6 Speech recognition methods.
A computer program product, including a computer program or instructions, characterized in that when the computer program or instructions are executed by a processor, the speech recognition method according to any one of claims 1 to 6 is implemented.