CN117116268A

CN117116268A - Speech recognition method, device, electronic equipment and readable storage medium

Info

Publication number: CN117116268A
Application number: CN202311041475.0A
Authority: CN
Inventors: 任廷志; 沈启函; 胡宸; 刘群
Original assignee: Chongqing Seres New Energy Automobile Design Institute Co Ltd
Current assignee: Chongqing Seres New Energy Automobile Design Institute Co Ltd
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2023-11-24

Abstract

The application relates to the technical field of automobiles, and provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium. The method comprises the following steps: acquiring image information in a target vehicle under the condition that the door opening and closing state or gear information of the target vehicle is changed; determining target seat information on which a person in a target vehicle sits according to the image information; determining a target voice recognition algorithm corresponding to the target seat information according to a preset corresponding relation between the riding position information and the voice recognition algorithm; and recognizing the voice in the target vehicle according to the target voice recognition algorithm. The embodiment of the application solves the problem of waste of electric quantity of the trolley in the related technology.

Description

Speech recognition method, device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of automotive technologies, and in particular, to a method and apparatus for voice recognition, an electronic device, and a readable storage medium.

Background

With the development of new energy automobile technology, the voice recognition technology has become an indispensable function in the automobile field, and the voice recognition technology can simplify complicated steps of traditional mechanical operation in practical application, so that driving safety is greatly improved.

A plurality of pickup microphones are installed in a general vehicle in a new energy vehicle, sound emitted by personnel in the vehicle is collected through all the pickup microphones, and sound received by the pickup microphones is processed in real time in the background through a voice recognition system to perform voice recognition. However, this type of speech recognition method has a problem of high power consumption.

Disclosure of Invention

In view of the above, embodiments of the present application provide a method, an apparatus, an electronic device, and a readable storage medium for voice recognition, so as to solve the problem of high power consumption in the voice recognition method in the related art vehicle.

In a first aspect of an embodiment of the present application, a method for speech recognition is provided, including:

acquiring image information in a target vehicle under the condition that the door opening and closing state or gear information of the target vehicle is changed;

determining target seat information on which a person in the target vehicle sits according to the image information;

determining a target voice recognition algorithm corresponding to the target seat information according to a preset corresponding relation between the riding position information and the voice recognition algorithm;

and recognizing the voice in the target vehicle according to the target voice recognition algorithm.

In a second aspect of an embodiment of the present application, there is provided a device for speech recognition, including:

the acquisition module is used for acquiring image information in the target vehicle under the condition that the door opening and closing state or gear information of the target vehicle changes;

a first determining module for determining target seat information on which a person in the target vehicle sits according to the image information;

the second determining module is used for determining a target voice recognition algorithm corresponding to the target seat information according to a preset corresponding relation between the riding position information and the voice recognition algorithm;

and the voice recognition module is used for recognizing the voice in the target vehicle according to the target voice recognition algorithm.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a readable storage medium storing a computer program which, when executed by a processor, performs the steps of the above method.

Compared with the prior art, the embodiment of the application has the beneficial effects that:

under the condition that the door opening and closing state or gear information of the target vehicle changes, acquiring image information in the target vehicle, acquiring the image information once every time the door opening and closing state or the gear information changes, determining personnel distribution conditions in the vehicle through the image information, and ensuring the accuracy of the determined personnel distribution conditions; in addition, the target seat information of the personnel in the target vehicle is determined according to the image information, and the target voice recognition algorithm corresponding to the target seat information is determined according to the corresponding relation between the preset riding position information and the voice recognition algorithm, so that the adopted target voice recognition algorithm is adapted to the personnel distribution condition, the voice recognition quality is ensured, the voice recognition electric quantity is saved, and the problem that the power consumption is high due to the fact that the same voice recognition algorithm is adopted in the vehicle no matter how many personnel exist is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for speech recognition according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the operation of a method for speech recognition according to an embodiment of the present application;

FIG. 3 is a flow chart of another method for speech recognition according to an embodiment of the present application;

FIG. 4 is an internal block diagram of a speech recognition module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a voice recognition device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that embodiments of the application may be practiced otherwise than as specifically illustrated and described herein, and that the objects identified by "first," "second," etc. are generally of the same type and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Furthermore, it should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The following describes a method, apparatus, electronic device and readable storage medium for voice recognition according to embodiments of the present application in detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for speech recognition according to an embodiment of the present application. As shown in fig. 1, the method includes:

step 101, when the door opening/closing state or the gear information of the target vehicle is changed, acquiring image information in the target vehicle.

The change in the door opening and closing state of the target vehicle includes switching the door from the closed state to the open state, or switching from the open state to the closed state.

The shift position information of the target vehicle is changed to include shift of the target vehicle from the park position to the forward position, shift from the neutral position to the forward position, and the like.

The door opening and closing state or gear information of the target vehicle CAN be monitored in real time through a controller area network (Controller Area Network, CAN).

The image information may be an overall panoramic view in the subject vehicle or an image of a seat portion in the subject vehicle, and the image information may be acquired by a camera installed in the subject vehicle.

By acquiring the image information in the target vehicle under the condition that the door opening and closing state or gear information of the target vehicle is changed, whether the personnel on the seat in the target vehicle is changed or not can be clarified, and the change from personnel on the seat to no personnel are included, so that the image information is acquired once every time the door opening state or the gear information is changed, and the instantaneity of the acquired image information is ensured.

And 102, determining target seat information on which a person in the target vehicle sits according to the image information.

Specifically, the target seat information indicates seats in which a person in the target vehicle sits, including a main driver's seat, a co-driver's seat, and rear left-side seats, rear right-side seats, rear middle seats, and the like in the target vehicle.

And the target seat information of the personnel in the target vehicle is determined through the image information, namely the personnel distribution condition in the vehicle is determined, so that the accuracy of the determined personnel distribution condition is ensured.

The present embodiment may also determine the target seat information on which the person in the target vehicle sits by an infrared sensor installed in the target vehicle or by a pressure sensor installed on the target vehicle seat, and is not particularly limited herein.

And step 103, determining a target voice recognition algorithm corresponding to the target seat information according to the corresponding relation between the preset riding position information and the voice recognition algorithm.

In particular, the speech recognition algorithm may convert the audio signal into a text form or other form in order to recognize and understand speech features extracted from the speech signal and thereby convert it into a corresponding text representation.

The speech recognition algorithm may include a single-microphone speech algorithm, a double-microphone speech algorithm, a four-microphone speech algorithm, and the like. In addition, the speech recognition algorithm may include a hidden Markov model (Hidden Markov Models, HMMs), deep neural networks (Deep Neural Networks, DNNs), or a connection time deep neural network (Convolutional Temporal Deep Neural Networks, CTC-DNN) recognition algorithm, not specifically limited herein.

A correspondence is preset between the seating position information and the voice recognition algorithm, for example, as an example, if the seating position of the person is the main driving position, the voice recognition may be performed by using the voice recognition algorithm corresponding to the main driving position; therefore, if other voice recognition algorithms are adopted, the voice data collected by the pickup microphones corresponding to the rest positions of the target vehicle are far less than the voice data collected by the pickup microphones corresponding to the main driving positions, the voice data do not have improvement on the voice recognition effect, and the occupancy rate of the central processing unit (Central Processing Unit, CPU) is increased, so that the energy consumption of the whole target vehicle is increased, and therefore, the voice recognition effect is guaranteed and the energy consumption of the vehicle is reduced by adopting the voice recognition algorithm corresponding to the main driving positions for voice recognition.

In this way, the target voice recognition algorithm corresponding to the target seat information is adopted, so that the adopted target voice recognition algorithm is adapted to personnel distribution conditions, voice recognition quality is ensured, meanwhile, voice recognition electric quantity is saved, and the problem that the energy consumption of a processor of a vehicle is increased due to unreasonable use of the voice recognition algorithm is avoided.

Step 104, recognizing the voice in the target vehicle according to the target voice recognition algorithm.

Specifically, the voice in the target vehicle is identified through the target voice identification algorithm, and the adopted target voice identification algorithm is adapted to the personnel distribution condition, so that the voice identification quality is ensured, and meanwhile, the voice identification electric quantity is saved.

When recognizing the voice in the target vehicle, the voice signal may be preprocessed, for example, noise reduction, enhancement, equalization, and the like, so as to improve accuracy and reliability of the voice recognition.

In this way, the embodiment realizes that the image information is acquired once when the door opening state or the gear information of the vehicle door changes every time, and the personnel distribution condition in the vehicle is determined through the image information, so that the accuracy of the determined personnel distribution condition is ensured; in addition, the target seat information of the personnel in the target vehicle is determined according to the image information, and the target voice recognition algorithm corresponding to the target seat information is determined according to the corresponding relation between the preset riding position information and the voice recognition algorithm, so that the adopted target voice recognition algorithm is adapted to the personnel distribution condition, the voice recognition quality is ensured, the voice recognition electric quantity is saved, and the problem that the power consumption is high due to the fact that the same voice recognition algorithm is adopted in the vehicle no matter how many personnel exist is solved.

In some embodiments, after the image information in the target vehicle is acquired, the pickup microphone corresponding to a seat in the target vehicle may be controlled to be turned off if no person is determined to be on the seat according to the image information; and controlling to cut off the power supply of the pickup microphone.

Specifically, if it is determined that no occupant is present in a seat based on the image information, the pickup microphone corresponding to the seat is turned off and the power supply to the pickup microphone is cut off.

For example, assuming that the target vehicle has one pickup microphone mounted on each of the main driver's seat and the auxiliary driver's seat, if it is determined that the passenger is seated on the main driver's seat based on the image information, and the passenger is not seated on the auxiliary driver's seat, control is performed to turn off the pickup microphone corresponding to the auxiliary driver's seat and to cut off the power supply to the pickup microphone.

The seat of the vehicle may be any seat of the target vehicle, and the present embodiment is not particularly limited.

Through closing the pickup microphone that the seat that does not take personnel corresponds to cut off power supply, realized confirming the pickup microphone that needs to close to personnel's distribution condition, make the power consumption equipment of target vehicle reduce, thereby reach the purpose of saving electricity, solved the problem that the whole waste target vehicle electric quantity of opening of pickup microphone.

In some embodiments, before determining the target seat information on which the person in the target vehicle sits according to the image information, the method further includes:

detecting whether face images exist in the space ranges corresponding to the seats according to the image information;

if the face image does not exist in the space range corresponding to the seat, determining that no person is taken on the seat; if the face images exist in the space range corresponding to the seat, acquiring a plurality of face images corresponding to different moments of the seat, and determining that the seat is occupied with a person under the condition that the plurality of face images are different.

Specifically, whether a face image exists in a space range corresponding to the seat can be determined by determining whether the face features exist through the trained image recognition model. If no face image exists in the space range corresponding to the seat, the fact that no passenger exists on the seat can be directly determined.

In addition, in order to ensure that the person on the seat is not a model or a photo, when the face image exists in the space range corresponding to the seat according to the image information, a plurality of face images corresponding to different moments of the seat can be acquired, and whether the face image is the model or the photo can be detected through the plurality of face images corresponding to different moments of the same seat. For example, if it is determined that there is a change in face characteristics such as blink degree and mouth angle opening/closing degree of a plurality of face images, it is determined that the person on the seat is not a model or a photograph, but a real person.

According to the embodiment, whether the face images exist in the space range corresponding to each seat is detected according to the image information to determine whether the seat is occupied with people, and whether the face images exist motion changes or not is detected through a plurality of face images corresponding to different moments of the same seat, so that whether the people on the seat are models or photos is determined, accurate identification of whether the people on the seat exist or not is achieved, and the situation that the models on the seat acquire photos to cause false identification is avoided.

In addition, the present embodiment can control the pickup microphone corresponding to the target seat information to be turned on according to the target seat information. Specifically, in some embodiments, the front seat of the target vehicle corresponds to at least one pickup microphone, and the rear seat of the target vehicle corresponds to at least two pickup microphones:

after determining the target seat information of the person in the target vehicle according to the image information, the method further comprises at least one of the following steps:

firstly, controlling to start only a pickup microphone corresponding to a main driving position when the target seat information is the main driving position only; the pickup microphone corresponding to the main driving position is the pickup microphone closest to the main driving position in at least one pickup microphone corresponding to the front seat.

Specifically, when the image information indicates that the target seat information is only the main driving position, the pickup microphone corresponding to the main driving position can be controlled to be turned on.

For example, if the target vehicle is provided with only two pickup microphones and both are located in the front row, the control turns on the pickup microphone closest to the main driving position; if the two pickup microphones are arranged in the front and rear rows of the target vehicle, one pickup microphone is arranged in the front and rear rows of the target vehicle, the pickup microphones corresponding to the front row positions are controlled to be started; if the target vehicle is provided with four pickup microphones, control turns on the pickup microphone closest to the main driving position.

Thus, the control only turns on the pickup microphone corresponding to the main driving position, ensures that the pickup microphone can accurately pick up the sound in the vehicle, and avoids the problem of waste of electric quantity caused by turning on all the pickup microphones,

and secondly, when the target seat information is the main driver seat and the auxiliary driver seat, controlling to start all pickup microphones corresponding to the front seats.

Specifically, when the image information indicates that the target seat information is the main driver's seat and the co-driver's seat, all pickup microphones corresponding to the front seats can be controlled to be turned on.

For example, as one example, if the target vehicle is configured with only two pickup microphones and both are located in the front row, the control turns on the front row seat corresponding to the two pickup microphones; if the two pickup microphones are arranged in the front and rear rows of the target vehicle, one pickup microphone is arranged in the front and rear rows of the target vehicle, the pickup microphones corresponding to the front row positions are controlled to be started; if the target vehicle is provided with four pickup microphones, the control turns on two pickup microphones corresponding to the front seat.

And thirdly, when the target seat information is the main driving seat and the rear-row side seat, controlling to start all pickup microphones corresponding to the front-row seats and the pickup microphones corresponding to the rear-row side seat.

Specifically, when the image information indicates that the target seat information is the main driving position and the rear-row-side seat, it is possible to control turning on all of the pickup microphones corresponding to the front-row seats and the pickup microphones corresponding to the rear-row-side seats.

For example, as one example, if the target vehicle is configured with only two pickup microphones and both are located in the front row, the control turns on the front row seat corresponding to the two pickup microphones; if the two pickup microphones are arranged in the front and rear rows of the target vehicle, all the pickup microphones are controlled to be started; if the target vehicle is provided with four pickup microphones, the control turns on two pickup microphones corresponding to the front seat and the microphone corresponding to the seat of the rear passenger.

Fourth, when the target seat information is the main driving seat and the seats on two sides of the rear row, all pickup microphones corresponding to the front row of seats and all pickup microphones corresponding to the rear row of seats are controlled to be started.

Specifically, when the image information indicates that the target seat information is the main driving seat and the rear-row both-side seats, it is possible to control turning on all pickup microphones corresponding to the front-row seats and all pickup microphones corresponding to the rear-row seats.

For example, as one example, if the target vehicle is configured with only two pickup microphones, control turns on the two pickup microphones in the target vehicle; if the target vehicle is provided with only four pickup microphones, the control turns on the four pickup microphones in the target vehicle.

According to the method, the pickup microphone corresponding to the target seat information is controlled to be started according to the target seat information, so that reasonable use of the pickup microphone in the target vehicle is realized, accurate pickup of sound in the vehicle is guaranteed, and meanwhile electric quantity loss caused by the fact that the pickup microphone is completely started in the target vehicle is avoided.

Further, in some embodiments, the speech recognition algorithms include a single-microphone speech algorithm, a double-microphone speech algorithm, and a quad-microphone speech algorithm;

The method for determining the target voice recognition algorithm corresponding to the target seat information according to the corresponding relation between the preset riding position information and the voice recognition algorithm comprises at least one of the following steps:

according to the corresponding relation, under the condition that the target seat information is only the main driving position, the target voice recognition algorithm is determined to be a single microphone voice algorithm.

Specifically, the single microphone voice algorithm refers to an algorithm for recognizing based on voice signals collected by a single pickup microphone, and voice analysis and recognition are mainly performed by means of audio information collected by the single pickup microphone.

If the current image information indicates that the target seat information is only the main driving position, the target voice recognition algorithm can be determined to be a single-microphone voice algorithm, namely, the collected voice data is controlled to be subjected to voice recognition by adopting the single-microphone voice algorithm.

And secondly, determining that the target voice recognition algorithm is a double-microphone voice algorithm under the condition that the target seat information is a main driver seat and a co-driver seat according to the corresponding relation.

Specifically, the double microphone voice algorithm refers to an algorithm for recognition based on voice signals collected by two pickup microphones. The double-microphone voice recognition algorithm utilizes the audio signals collected by the two pickup microphones, and provides more sound source localization and noise reduction capability through the information such as time difference and sound intensity difference between the pickup microphones.

If the current image information indicates that the target seat information is the main driver position and the auxiliary driver position, the target voice recognition algorithm can be determined to be a double-microphone voice algorithm, namely, the collected voice data is controlled to be subjected to voice recognition by adopting the double-microphone voice algorithm.

Thirdly, according to the corresponding relation, under the condition that the target seat information is a main driving seat and a seat on at least one side of a rear seat, the target voice recognition algorithm is determined to be a four-microphone voice algorithm.

Specifically, the four-microphone voice algorithm is an algorithm for identifying based on voice signals collected by four pickup microphones, and the four-microphone voice algorithm can locate the sound source position more accurately and provide stronger noise reduction and echo cancellation capabilities through the audio signals collected by the four pickup microphones, so that the performance and accuracy of voice identification are improved.

If the current image information indicates that the target seat information is the main driving seat and at least one seat of the rear seat, the target voice recognition algorithm is determined to be a four-microphone voice algorithm, and the collected voice data can be subjected to voice recognition by adopting the four-microphone voice algorithm.

It should be noted that, compared with the two-microphone voice algorithm and the four-microphone voice algorithm, the single-microphone voice algorithm has low operation amount, least power consumption, high operation amount of the four-microphone voice recognition algorithm and most power consumption.

For example, with the number of requests per second (Queries Per Second, QPS) as an indicator, QPS represents the number of voice recognition requests that the system can process per second in voice recognition, this indicator is typically used to measure the processing capacity and performance of a voice recognition system, single microphone voice algorithms typically do not have the required indicators to process, double microphone voice algorithms need to process 1-3 required indicators, quad microphone voice algorithms need to process 4-10 required indicators, it can be seen that the single microphone voice algorithm has the lowest operand, the quad microphone voice algorithm has the highest operand, and the double microphone voice algorithm has an operand between that of single microphone voice algorithms and quad microphone voice algorithms.

In this way, the embodiment determines the target voice recognition algorithm through the above manner, and realizes that the corresponding voice recognition algorithm is adopted according to the number of people and the position information in the target vehicle, so that CPU operation is reduced, and the purpose of saving electricity is achieved.

Additionally, in some embodiments, each pickup microphone in the target vehicle corresponds to one pickup microphone data; the identifying the voice in the target vehicle according to the target voice identification algorithm comprises the following steps:

acquiring voice data in the target vehicle, wherein the voice data comprises first data and second data, the first data is data picked up by a pickup microphone in an on state, and the second data is zero corresponding to the pickup microphone in an off state;

And recognizing the voice in the target vehicle through the target voice recognition algorithm according to the voice data.

Specifically, the pickup microphone data is sound signal data collected by the pickup microphone, and when the vehicle starts the voice recognition function, the pickup microphone can collect the voice input in the vehicle and convert the voice input into a digital signal so as to perform voice recognition processing. For example, the driver's instructions or the passenger's dialogue content, by analyzing and processing these pickup microphone data, the system can implement voice command control, telephone interaction, navigation instructions, etc.

When the pickup microphone is turned off, if no processing is performed, the voice recognition unit may receive some environmental noise or other non-voice signals, causing a recognition error by the voice recognition unit, so that voice data corresponding to the pickup microphone in the turned-off state may be set to zero; that is, the first data is the data picked up by the pickup microphone in the on state, and the second data is zero corresponding to the pickup microphone in the off state.

In this embodiment, the voice data includes first data and second data, the first data is data picked up by the pickup microphone in the on state, the second data is zero corresponding to the pickup microphone in the off state, and the data of the pickup microphone corresponding to the seat where the person sits is set to 0, so that the voice input data is simulated to be in the mute state, the error rate of system identification is effectively reduced, and the influence on the voice identification function is avoided.

In addition, in some embodiments, after the target voice recognition algorithm recognizes the voice in the target vehicle, a feedback message input by the user may also be received, where the feedback message is used to indicate satisfaction of the user with respect to the voice recognition result; and updating the corresponding relation between the riding position information and the voice recognition algorithm according to the satisfaction degree.

Specifically, after the user performs voice interaction with the voice interaction system in the vehicle, feedback information may be input, where the feedback information is used to indicate satisfaction of the user with respect to the voice recognition result, for example, if the user is inaccurate with respect to the voice recognition result, the user may input dissatisfaction.

After the vehicle receives the feedback information, the corresponding relation between the riding position information and the voice recognition algorithm can be updated according to the satisfaction degree, so that the updated voice recognition algorithm can more accurately recognize the voice in the vehicle, and the accuracy of voice recognition is improved.

Therefore, the corresponding relation between the riding position information and the voice recognition algorithm is updated according to the satisfaction degree, so that the voice recognition algorithm is updated according to the user requirement, and the accuracy of voice recognition is improved.

The following describes the operation of the module of a method for speech recognition according to an embodiment of the present application with reference to fig. 2, where fig. 2 shows:

firstly, a vehicle body signal monitoring module can detect the door opening and closing state or gear information of a target vehicle in real time, and when detecting that the door opening and closing state or the gear information of the target vehicle changes, the vehicle body signal monitoring module sends a detection signal to an image recognition module.

Then, after receiving the signal, the image recognition module recognizes the image information acquired by the camera installed in the vehicle, and sends the recognition result to the voice recognition module and the system driving module.

Finally, the system driving module analyzes and processes the sound data acquired by the pickup microphone hardware by controlling the voice recognition module through a voice algorithm corresponding to the image information according to the recognition result of the image recognition module, and simultaneously controls to close the corresponding pickup microphone and set the pickup microphone data to 0 according to the recognition result of the image recognition module.

A method for speech recognition according to an embodiment of the present application will be described with reference to fig. 3, and as shown in the book of fig. 3, the method includes:

firstly, the door opening and closing state or gear information of a target vehicle is monitored in real time through a CAN bus, under the condition that the door opening and closing state or the gear information of the target vehicle is changed, image information in the target vehicle is obtained, whether the target seat information occupied by a person is determined by a face image in a space range corresponding to each seat is detected according to the image information, and it is required to say that in order to ensure the accuracy of image information identification, various riding postures and facial features of the person CAN be extracted through an image identification algorithm to carry out training, so that an image identification model is obtained and used for identifying the image information, and the accuracy of image identification is improved.

Then, it is determined whether two pickup microphones or four pickup microphones are mounted in the target vehicle.

If the target vehicle is provided with only two pickup microphones, detecting whether a passenger is in a passenger seat, and controlling to start at least one pickup microphone corresponding to a front seat and close other pickup microphones under the condition that the image information indicates that the passenger is in the space range of the target vehicle; for example, if the target vehicle is provided with only two pickup microphones and is located at the front row, the front row seat is controlled to be turned on to correspond to the two pickup microphones, and the rest pickup microphones are controlled to be turned off; if the two pickup microphones are arranged in the front and rear rows of the target vehicle, one pickup microphone is arranged in the front and rear rows of the target vehicle, the pickup microphones corresponding to the front row positions are controlled to be turned on, and the other pickup microphones are controlled to be turned off; if the target vehicle is provided with four pickup microphones, controlling to start two pickup microphones corresponding to the front seats, closing the rest pickup microphones, and controlling the voice recognition controller to perform voice recognition by adopting a double-microphone voice algorithm.

In the case where the image information indicates that only the main driving position has a person within the spatial range of the target vehicle, control turns on the pickup microphone corresponding to the main driving position, and turns off the remaining pickup microphones, for example, if the target vehicle is provided with only two pickup microphones and both are located in the front row, control turns on the pickup microphone closest to the main driving position; if the two pickup microphones are arranged in the front and rear rows of the target vehicle, one pickup microphone is arranged in the front and rear rows of the target vehicle, the pickup microphones corresponding to the front row positions are controlled to be turned on, and the other pickup microphones are controlled to be turned off; if the target vehicle is provided with four pickup microphones, the pickup microphone closest to the main driving position is controlled to be turned on, the rest pickup microphones are turned off, and the voice recognition controller is controlled to perform voice recognition by adopting a single-microphone voice algorithm.

Finally, the closed pickup microphone data is set to be 0, so that voice input data is simulated to be in a mute state, the error rate of system identification is effectively reduced, and the influence on a voice identification function is avoided.

An internal structure of a speech recognition module according to an embodiment of the present application will be described with reference to fig. 4, and as shown in fig. 4, the structure includes:

the voice recognition module internally comprises six parts, namely audio data acquisition, voice activity detection, noise reduction, natural language understanding, automatic voice recognition and sound source positioning.

Acquiring audio data refers to acquiring real-time audio input from an input source (e.g., a pickup microphone, an audio file, etc.) for subsequent speech recognition processing.

Voice activity monitoring (Voice Activity Detection, VAD) refers to the process of identifying active speech segments (i.e., periods of time that contain valid speech) and non-speech segments (i.e., periods of time that do not contain valid speech) in a speech signal. In speech recognition or speech processing, for long-term audio input, it is often necessary to recognize speech portions therein for subsequent processing, but to ignore background noise or silence portions, which requires speech activity monitoring to determine when there is speech activity and when there is no speech activity, by accurately determining the time period of speech activity, speech signals can be more effectively extracted and processed, recognition accuracy and quality of speech portions can be improved, while also helping to eliminate noise, reduce the amount of output data, and save computational resources.

Noise reduction (Noise reduction) refers to a process of reducing or removing Noise components in an audio signal. In speech recognition and audio processing, noise refers to unwanted ambient sound or other non-speech sound components that are not related to the signal of interest. Noise may be generated by various sources, such as background noise, electrical noise, wind noise, and the like. These noise reduction methods may be used alone or in combination, and the specific selection and application method depends on factors such as noise environment, application requirements, and performance requirements. Through the noise reduction processing, the quality, the definition and the understandability of the audio signal can be effectively improved, and the performance and the user experience of the voice related application are improved.

Natural speech understanding (Natural Language Understanding, NLU) refers to the ability to enable a computer to understand and interpret natural language. It relates to converting natural language text or speech input into a form that a computer can understand and process to extract information such as intent, entities, relationships, etc. in the text or speech, the goal of natural language understanding is to enable the computer to understand and process natural language input in a manner similar to humans. Through effective natural language understanding technology, the computer can better understand and respond to the language demands of users, and more intelligent and humanized interaction experience is realized.

Automatic speech recognition (Automatic Speech Recognition, ASR) refers to an automated process that converts human speech input into a textual representation. It is a technique that converts sound signals into corresponding literal text by analyzing and decoding them.

Sound source localization refers to the process of determining the source location of sound in an audio signal. The method aims at determining the direction and the position of sound by analyzing the characteristics of time delay, intensity, frequency spectrum and the like of the sound signal.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, and will not be repeated.

It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution sequence, and the execution sequence of each process should be determined by the function and the internal logic of each process, and should not be construed as limiting the process in the embodiment of the present application.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Fig. 5 is a device for speech recognition according to an embodiment of the present application, as shown in fig. 5, the device includes:

an obtaining module 501, configured to obtain image information in a target vehicle when a door opening and closing state or gear information of the target vehicle changes;

A first determining module 502, configured to determine, according to the image information, target seat information on which a person in the target vehicle sits;

a second determining module 503, configured to determine a target voice recognition algorithm corresponding to the target seat information according to a preset correspondence between the seating position information and the voice recognition algorithm;

and the voice recognition module 504 is configured to recognize voice in the target vehicle according to the target voice recognition algorithm.

In some embodiments, the first determining module is further configured to control to turn off a pickup microphone corresponding to a seat in the target vehicle if it is determined that there is no person on the seat based on the image information; and controlling to cut off the power supply of the pickup microphone.

In some embodiments, the first determining module is further configured to detect, according to the image information, whether a face image exists in a spatial range corresponding to each seat; if the face image does not exist in the space range corresponding to the seat, determining that no person is taken on the seat; if the face images exist in the space range corresponding to the seat, acquiring a plurality of face images corresponding to different moments of the seat, and determining that the seat is occupied with a person under the condition that the plurality of face images are different.

In some embodiments, the front seat of the target vehicle corresponds to at least one pickup microphone and the rear seat of the target vehicle corresponds to at least two pickup microphones:

the first determining module is specifically configured to perform at least one of the following:

when the target seat information is only the main driving position, controlling to only start a pickup microphone corresponding to the main driving position; wherein the pickup microphone corresponding to the main driving position is a pickup microphone closest to the main driving position among at least one pickup microphone corresponding to the front seat; if the target seat information is the main driver seat and the auxiliary driver seat, controlling to start all pickup microphones corresponding to the front seats; when the target seat information is the main driving position and the rear-row side seat, controlling to start all pickup microphones corresponding to the front-row seat and the pickup microphones corresponding to the rear-row side seat; and if the target seat information is the main driving position and the seats at the two sides of the rear row, controlling to start all pickup microphones corresponding to the front row of seats and all pickup microphones corresponding to the rear row of seats.

In some embodiments, the speech recognition algorithms include single-microphone speech algorithms, double-microphone speech algorithms, and quad-microphone speech algorithms; the second determining module is specifically configured to perform at least one of the following:

according to the corresponding relation, under the condition that the target seat information is only the main driving position, determining that the target voice recognition algorithm is a single microphone voice algorithm; according to the corresponding relation, under the condition that the target seat information is a main driver's seat and a co-driver's seat, determining that the target voice recognition algorithm is a double-microphone voice algorithm; and according to the corresponding relation, determining that the target voice recognition algorithm is a four-microphone voice algorithm under the condition that the target seat information is a main driving seat and at least one seat of a rear seat.

In some embodiments, each pickup microphone in the target vehicle corresponds to one pickup microphone data;

the voice recognition module is specifically configured to obtain voice data in the target vehicle, where the voice data includes first data and second data, the first data is data picked up by a pickup microphone in an on state, and the second data is zero corresponding to the pickup microphone in an off state; and recognizing the voice in the target vehicle through the target voice recognition algorithm according to the voice data.

In some embodiments, the second determining module is further configured to receive a feedback message input by a user, where the feedback message is used to indicate satisfaction of the user with respect to a speech recognition result; and updating the corresponding relation between the riding position information and the voice recognition algorithm according to the satisfaction degree.

The device provided by the embodiment of the application can realize all the method steps of the method embodiment and achieve the same technical effects, and is not described herein.

Fig. 6 is a schematic diagram of an electronic device 6 according to an embodiment of the present application. As shown in fig. 6, the electronic device 6 of this embodiment includes: a processor 601, a memory 602 and a computer program 603 stored in the memory 602 and executable on the processor 601. The steps of the various method embodiments described above are implemented by the processor 601 when executing the computer program 603. Alternatively, the processor 601, when executing the computer program 603, performs the functions of the modules/units of the apparatus embodiments described above.

The electronic device 6 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 6 may include, but is not limited to, a processor 601 and a memory 602. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the electronic device 6 and is not limiting of the electronic device 6 and may include more or fewer components than shown, or different components.

The processor 601 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 602 may be an internal storage unit of the electronic device 6, for example, a hard disk or a memory of the electronic device 6. The memory 602 may also be an external storage device of the electronic device 6, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 6. The memory 602 may also include both internal and external storage units of the electronic device 6. The memory 602 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units may be stored in a readable storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a readable storage medium, where the computer program may implement the steps of the method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of speech recognition, comprising:

2. The method of claim 1, wherein after the capturing the image information in the target vehicle, further comprising:

Controlling to turn off a pickup microphone corresponding to a seat in the target vehicle under the condition that no person is determined to be on the seat according to the image information; the method comprises the steps of,

and controlling to cut off the power supply of the pickup microphone.

3. The method of claim 1, wherein prior to determining the target seat information for the person in the target vehicle based on the image information, further comprising:

detecting whether face images exist in the space range corresponding to each seat according to the image information;

if the face image does not exist in the space range corresponding to the seat, determining that no person is taken on the seat;

if the face images exist in the space range corresponding to the seat, acquiring a plurality of face images corresponding to different moments of the seat, and determining that the seat is occupied with a person under the condition that the plurality of face images are different.

4. The method of claim 1, wherein the front seat of the target vehicle corresponds to at least one pickup microphone and the rear seat of the target vehicle corresponds to at least two pickup microphones:

when the target seat information is only the main driving position, controlling to only start a pickup microphone corresponding to the main driving position; wherein the pickup microphone corresponding to the main driving position is a pickup microphone closest to the main driving position among at least one pickup microphone corresponding to the front seat;

if the target seat information is the main driver seat and the auxiliary driver seat, controlling to start all pickup microphones corresponding to the front seats;

when the target seat information is the main driving position and the rear-row side seat, controlling to start all pickup microphones corresponding to the front-row seat and the pickup microphones corresponding to the rear-row side seat;

and if the target seat information is the main driving position and the seats at the two sides of the rear row, controlling to start all pickup microphones corresponding to the front row of seats and all pickup microphones corresponding to the rear row of seats.

5. The method of claim 1, wherein the speech recognition algorithm comprises a single-microphone speech algorithm, a double-microphone speech algorithm, and a four-microphone speech algorithm;

according to the corresponding relation, under the condition that the target seat information is only the main driving position, determining that the target voice recognition algorithm is a single microphone voice algorithm;

according to the corresponding relation, under the condition that the target seat information is a main driver's seat and a co-driver's seat, determining that the target voice recognition algorithm is a double-microphone voice algorithm;

and according to the corresponding relation, determining that the target voice recognition algorithm is a four-microphone voice algorithm under the condition that the target seat information is a main driving seat and at least one seat of a rear seat.

6. The method of claim 1, wherein the identifying speech within the target vehicle according to the target speech recognition algorithm comprises:

7. The method of claim 4, further comprising, after said recognizing the speech in the target vehicle according to the target speech recognition algorithm:

receiving feedback information input by a user, wherein the feedback information is used for indicating satisfaction degree of the user for a voice recognition result;

and updating the corresponding relation between the riding position information and the voice recognition algorithm according to the satisfaction degree.

8. An apparatus for speech recognition, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.