CN117995188A

CN117995188A - Method, apparatus and storage medium for performing operations

Info

Publication number: CN117995188A
Application number: CN202410197689.5A
Authority: CN
Inventors: 于劲道; 钱伟
Original assignee: Chery Automobile Co Ltd
Current assignee: Chery Automobile Co Ltd
Priority date: 2024-02-22
Filing date: 2024-02-22
Publication date: 2024-05-07

Abstract

The present disclosure provides a method, apparatus, and storage medium for performing an operation, which belong to the technical field of speech processing. According to the method, the user voice is collected through the voice collecting component, the noise reduction chip is used for carrying out noise reduction processing on the user voice, the processor is used for carrying out voice recognition on the user voice, and corresponding operation is carried out according to the voice recognition result, so that the noise in the user voice is reduced through the noise reduction chip before voice recognition, the accuracy of voice recognition can be effectively improved, and the accuracy of operation execution of the vehicle-mounted terminal is further improved.

Description

Method, apparatus and storage medium for performing operations

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a method, an apparatus, and a storage medium for performing operations.

Background

The vehicle generally has a voice processing function, a user can directly speak the own requirements, the vehicle-mounted terminal can recognize the voice of the user, an operation instruction is obtained, and then the operation instruction is executed, so that the control and the operation of the user on the vehicle through the voice are realized. However, performing speech processing in the presence of noise may result in inaccurate speech processing results, which may result in less accurate operations.

Disclosure of Invention

In order to solve the related art problems, the present disclosure provides a method, apparatus, and storage medium for performing operations. The technical proposal is as follows:

In a first aspect, there is provided a method for performing an operation, where the method is applied to a vehicle-mounted terminal, the vehicle-mounted terminal including a processor, a noise reduction chip, and a sound collection component, the method including:

The sound collection component collects first audio data and sends the first audio data to the noise reduction chip;

The noise reduction chip performs noise reduction processing on the first audio data to obtain second audio data, and sends the second audio data to the processor;

The processor performs voice recognition on the second audio data, and performs a first operation based on a voice recognition result.

In one possible implementation, before the sound collection component collects the first audio data, the method further includes:

The processor receives an activation instruction for the sound collection component and activates the sound collection component.

In one possible implementation, the processor performs voice recognition on the second audio data, including:

And the processor inputs the second audio data into a voice recognition model to obtain a recognition result corresponding to the second audio data output by the voice recognition model.

In one possible implementation, after the processor performs voice recognition on the second audio data, the method further includes:

Acquiring indication information corresponding to the first operation;

And outputting the indication information.

In one possible implementation manner, the vehicle-mounted terminal further comprises a communication component;

The method further comprises the steps of:

The processor acquires third audio data received by the communication component;

the processor sends the third audio data to the noise reduction chip;

The noise reduction chip performs noise reduction processing on the third audio data to obtain fourth audio data, and sends the fourth audio data to the processor;

the processor performs voice recognition on the fourth audio data, and performs a second operation based on a voice recognition result.

In one possible implementation, the voice recognition result is text information;

the performing a first operation based on the speech recognition result includes:

Determining an operation identifier corresponding to the text information;

and executing the first operation corresponding to the operation identifier.

In one possible implementation manner, the voice recognition result is an operation identifier;

and executing the first operation corresponding to the operation identifier.

In a second aspect, there is provided an apparatus for performing an operation, the apparatus comprising a processor, a noise reduction chip, and a sound collection component;

The sound collection component is used for collecting first audio data and sending the first audio data to the noise reduction chip;

The noise reduction chip is used for carrying out noise reduction processing on the first audio data to obtain second audio data, and sending the second audio data to the processor;

The processor is used for carrying out voice recognition on the second audio data and executing a first operation based on a voice recognition result.

In a possible implementation manner, the processor is further configured to receive an activation instruction for the sound collection component, and activate the sound collection component.

In a possible implementation manner, the processor is configured to input the second audio data into a speech recognition model, and obtain a recognition result corresponding to the second audio data output by the speech recognition model.

In one possible implementation, the processor is configured to:

Acquiring indication information corresponding to the first operation;

And outputting the indication information.

In one possible implementation, the device further comprises a communication component;

The processor is configured to:

Acquiring third audio data received by the communication component;

Transmitting the third audio data to the noise reduction chip;

The noise reduction chip is used for carrying out noise reduction processing on the third audio data to obtain fourth audio data, and sending the fourth audio data to the processor;

The processor is configured to perform voice recognition on the fourth audio data, and perform a second operation based on a voice recognition result.

The processor is configured to:

Determining an operation identifier corresponding to the text information;

and executing the first operation corresponding to the operation identifier.

the processor is used for executing the first operation corresponding to the operation identifier.

In a third aspect, a computer readable storage medium is provided, the computer readable storage medium storing computer program code which, when executed by a computer device, performs the method provided by the first aspect and its possible implementation.

In a fourth aspect, a computer program product is provided, the computer program product comprising computer program code for performing the method provided by the first aspect and its possible implementation forms when the computer program code is executed by a computer device.

By adopting the method, the user voice is acquired through the voice acquisition component, the noise reduction processing is carried out on the user voice through the noise reduction chip, the processor carries out voice recognition on the user voice, and corresponding operation is carried out according to the voice recognition result, so that the noise in the user voice is reduced through the noise reduction chip before the voice recognition is carried out, the accuracy of the voice recognition can be effectively improved, and the accuracy of the operation carried out by the vehicle-mounted terminal is further improved.

Drawings

Fig. 1 is a schematic structural diagram of a vehicle-mounted terminal according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of performing operations provided by an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of a process flow for outputting indication information corresponding to a first operation according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a process flow of performing noise reduction processing on audio data received by a communication component according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram of a process for determining wakefulness of a user according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram of a process flow for performing a first operation provided by an embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of an apparatus for performing an operation according to an embodiment of the present disclosure.

Detailed Description

The embodiment of the disclosure provides a method for executing an operation. The method may be applied to a vehicle-mounted terminal, as shown in fig. 1, which includes a processor 110, a noise reduction chip 120, a sound collection part 130, and a communication part 140.

The processor 110 may be a central processing unit (central processing unit, CPU) or a system on chip (SoC) or the like, and the processor 110 may be configured to execute various instructions related to the method, such as an activation instruction of a sound collection device, or the like.

The noise reduction chip 120 may be a CPU, an SoC, or the like, and the noise reduction chip 120 may be configured to execute various instructions related to the method, for example, to perform noise reduction processing on the first audio data.

The sound collecting member 130 may be one microphone, or may be a plurality of microphones distributed at different positions in the vehicle, or may be an array microphone or an omni-directional collecting microphone, etc. The sound collection component 130 may be configured to collect sound waves of a user or environment and convert the sound waves into electrical signals for input to the slave processor 110 for processing, e.g., collecting first audio data, etc.

The communication component 140 may be a wired network connector, a wireless fidelity (WIRELESS FIDELITY, wiFi) module, a bluetooth module, a cellular network communication module, or the like. The communication section 140 may be used to receive various instructions and data, etc. For example, third audio data, etc.

The in-vehicle terminal may further include a display screen, an input part, an audio output part, and the like.

With the development of the internet of vehicles, the voice recognition technology can be applied to vehicles, a user can directly speak an operation to be executed by using the voice recognition technology, the vehicle-mounted terminal can recognize voice and recognize a target operation corresponding to the voice, and then a processor sends an instruction corresponding to the target operation to a vehicle controller, so that the control of the vehicles through voice is realized. For example, when the user wants to open the window, the vehicle-mounted terminal may say "open the window", at this time, the vehicle-mounted terminal may collect the voice and perform recognition noise reduction processing, and when the corresponding target operation is recognized as "open the window", the window may be opened. The embodiment of the disclosure provides a method for executing an operation, which is applied to a vehicle-mounted terminal, and the processing flow of the method can be as shown in fig. 2, and includes the following steps:

201, the sound collection component collects first audio data and sends the first audio data to the noise reduction chip.

When the user wants to control and operate each component of the vehicle, the operation keys of each component may be manually operated, but the manual operation of the user during driving may be distracted from the user, resulting in unsafe driving, so the user typically controls each component through voice during driving.

Before the sound collection component collects the first audio data, the processor receives an activation instruction for the sound collection component and activates the sound collection component. When the sound collection component is not activated, the sound in the vehicle cannot be collected, and when the sound exists in the vehicle after the sound collection component is activated, the sound collection component can collect the sound in the vehicle.

After the sound collection component collects the first audio data, the first audio data can be sent to the noise reduction chip. Optionally, the first audio data may be sent to the processor first, where the processor determines whether the first audio data needs to be subjected to noise reduction, if the processor determines that the first audio data needs to be subjected to noise reduction, the first audio data is sent to the noise reduction chip, and if the processor determines that the first audio data does not need to be subjected to noise reduction, the first audio data does not need to be sent to the noise reduction chip.

Optionally, the processor may pre-process the first audio data, including removing noise, enhancing the speech signal, and the like.

202, The noise reduction chip performs noise reduction processing on the first audio data to obtain second audio data, and sends the second audio data to the processor.

The processing procedure of the noise reduction processing comprises filtering, spectrum subtraction, noise estimation and other digital signal processing algorithms, each frame of the first audio data is processed, spectrum analysis is carried out on each frame, the noise factor is eliminated, the second audio data is obtained, and the noise reduction processing is completed. After the noise reduction chip finishes the noise reduction processing, the second audio data obtained after the noise reduction processing is finished can be sent to the processor.

203, The processor performs voice recognition on the second audio data, and performs a first operation based on the voice recognition result.

The processor inputs the second audio data into the voice recognition model to obtain a recognition result corresponding to the second audio data output by the voice recognition model.

The training process of the speech recognition model can be as follows:

A plurality of sets of sample data are acquired, each set of sample data including at least one sample voice and a reference operation. And obtaining a group of sample data, inputting at least one sample voice into the voice recognition model to be trained, generating a prediction operation, determining a loss value based on the reference operation and the prediction operation, and performing parameter adjustment on the voice recognition model to be trained based on the loss value. And performing the training parameter adjustment processing by using a plurality of groups of sample data until the training ending condition is met, and determining the speech recognition model after parameter adjustment as the speech recognition model for completing training. The training end condition may be reaching a specified number of training times, or a loss value less than a preset threshold in N consecutive training times, etc.

In addition, before the voice recognition, the processor can carry out safety verification on the second audio data to prevent illegal audio data, if the second audio data is detected to not meet the condition of safety verification, the audio data can be deleted, and prompt information is output, wherein the prompt information can be broadcasted by a loudspeaker or displayed by words on a vehicle.

In one possible implementation manner, after the processor performs voice recognition on the second audio data, a processing flow of outputting the indication information corresponding to the first operation may be as shown in fig. 3, and includes the following steps:

and 301, acquiring indication information corresponding to the first operation.

After the processor performs voice recognition on the second audio data, a first operation is obtained, and in order to confirm whether the voice recognition result is consistent with the expected operation of the user, indication information corresponding to the first operation can be obtained. For example, the first operation is "open window", and the corresponding instruction information is "open window".

302, Outputting indication information.

The indication information may be broadcasted by a speaker, or the indication information corresponding to the first operation may be displayed by text on the vehicle, or the indication information may be displayed by text on the vehicle while being broadcasted by the speaker. For example, after the processor performs voice recognition on the second audio data, the first operation is obtained as "window opening", at this time, the "window opening" may be broadcasted by using a speaker, the user hears the indication information, the operation may be confirmed to be consistent with the own expected operation, and other users on the vehicle may be prompted again that the window has been opened.

Alternatively, the output instruction information may be performed before or after the first operation is performed, or may be performed simultaneously with the first operation.

The in-vehicle terminal further includes a communication section. In a possible implementation manner, the noise reduction chip may further perform noise reduction processing on the audio data received by the communication component, and a corresponding processing flow may be shown in fig. 4, where the processing flow includes the following steps:

The processor acquires 401 the third audio data received by the communication section.

When the user is not in the vehicle but wants to control the vehicle, the user can be connected with the vehicle-mounted terminal through the mobile phone, the voice of the user can be sent to the processor through the communication component of the vehicle-mounted terminal, and after the processor receives the voice of the user, the processor can send a success notification of the receiving to the mobile phone, so that the remote control of the vehicle can be realized. For example, the microphone of the user's mobile phone collects third audio data "window closed", and sends the third audio data "window closed" to the processor through the communication component, and the processor acquires the third audio data "window closed" received by the communication component.

The processor sends 402 the third audio data to the noise reduction chip.

And the processor sends the received third audio data to the noise reduction chip for noise reduction processing. Optionally, the processor may determine whether the noise reduction processing needs to be performed on the third audio data, if the processor determines that the noise reduction processing needs to be performed on the third audio data, send the third audio data to the noise reduction chip, and if the processor determines that the noise reduction processing does not need to be performed on the third audio data, send the third audio data to the noise reduction chip is not needed.

403, The noise reduction chip performs noise reduction processing on the third audio data to obtain fourth audio data, and sends the fourth audio data to the processor.

The processing procedure of the noise reduction processing comprises filtering, spectrum subtraction, noise estimation and other digital signal processing algorithms, each frame of the third audio data is processed, spectrum analysis is carried out on each frame, the fourth audio data is obtained after the noise factor is eliminated, and the noise reduction processing is completed. After the noise reduction chip finishes the noise reduction processing, fourth audio data obtained after the noise reduction processing is finished can be sent to the processor.

The processor performs speech recognition on the fourth audio data and performs a second operation based on the speech recognition result 404.

The processor inputs the fourth audio data into the voice recognition model to obtain a recognition result corresponding to the fourth audio data output by the voice recognition model.

In addition, before the voice recognition, the processor can carry out safety verification on the fourth audio data to prevent illegal audio data, if the fourth audio data is detected to not meet the condition of safety verification, the audio data can be deleted, and prompt information is output, wherein the prompt information can be broadcasted by a loudspeaker of a mobile phone or displayed by words on the mobile phone.

In one possible implementation, the speech recognition result has a plurality of possible forms, and the possible forms of the speech recognition result are described in detail below.

Form one: the voice recognition result is text information.

Determining an operation identifier corresponding to the text information, and executing a first operation corresponding to the operation identifier.

The corresponding relation table of the text information and the operation identifiers may be stored in advance, where one or more text information may correspond to one operation identifier, for example, as shown in table 1, the operation identifier corresponding to "temperature reduction" is "air conditioner", "window opening", "window closing", and "window closing" are all "window".

For example, the text information is "open window", the operation identifier found in table 1 is "window", and then, according to the action "open", it is determined that the first operation is "open window".

TABLE 1

Form two: the voice recognition result is an operation identifier.

When the voice recognition result is the operation identifier, a first operation corresponding to the operation identifier can be executed. For example, the voice is "open window", the operation identifier is "window", and then the first operation is determined to be "open window" based on the action "open".

In one possible implementation, the process flow of determining the wakefulness of the user based on the audio data may be as shown in fig. 5, including the steps of:

And 501, the processor inputs the second audio data into the wakefulness recognition model to obtain the wakefulness corresponding to the second audio data output by the wakefulness recognition model.

The processor can determine the current wakefulness of the user according to the intonation and other characteristics of the voice of the user, the wakefulness is positively correlated with the safe driving degree of the user, and the higher the wakefulness is, the higher the safe driving degree of the user is. And the processor inputs the second audio data into the wakefulness recognition model to obtain the wakefulness corresponding to the second audio data output by the wakefulness recognition model. The training process of the wakefulness recognition model may be as follows:

A plurality of sets of sample data are acquired, each set of sample data including at least one sample voice and a reference wakefulness. And obtaining a group of sample data, inputting at least one sample voice into a to-be-trained wakefulness recognition model, generating a predicted wakefulness, determining a loss value based on the reference wakefulness and the predicted wakefulness, and performing parameter adjustment on the to-be-trained wakefulness recognition model based on the loss value. And performing the training parameter adjustment processing by using a plurality of groups of sample data until the training ending condition is met, and determining the wakefulness recognition model after parameter adjustment as the wakefulness recognition model for completing training. The training end condition may be reaching a specified number of training times, or a loss value less than a preset threshold in N consecutive training times, etc.

502, When the wakefulness exceeds the threshold value of the wakefulness, a prompt message is output.

The prompt information can be the sounding of an alarm, the playing of music, the playing of a prompt that the wakefulness exceeds the threshold value of the wakefulness, and the like.

In one possible implementation, the process flow of performing the first operation based on the sound source location information and the operation identifier may be as shown in fig. 6, including the steps of:

601, determining sound source position information based on the volumes of different microphones in the sound collection means;

In addition to determining the first operation based on the operation identification, the first operation may be determined in combination with the sound source position information and the operation identification. In the vehicle, a plurality of microphones may be disposed, for example, one microphone may be disposed near each seat, and when the sound collecting means collects sound, the position of the microphone that collects the sound with the largest volume may be determined as the sound source position, and the sound source position information may include the position of the seat corresponding to the microphone. For example, if the position of the seat corresponding to the microphone that collects the voice with the largest volume is the main driving, the sound source position information is "main driving". Or the sound source position information may include the position identifier of the corresponding seat of the microphone, where each seat has a different position identifier, and a correspondence table between each seat and the position identifier may be stored in advance, for example, as shown in table 2.

TABLE 2

Seat with a seat cover	Location identification
		Main driving	A
Auxiliary driving	B
		……	……

For example, the sound source position information is "a". From table 2, it can be known that the sound source position of the voice is the primary drive.

A first operation is performed 602 based on the sound source location information and the operation identification.

After the sound source position information is obtained, the first operation can be determined by combining the operation identifier. For example, the voice is "open window", the operation flag is "window", the sound source position information is the secondary driving, and the first operation is "open window of secondary driving".

Alternatively, when the sound source position information is not recognized, the first operation may be performed on all the components on the vehicle to which the operation identification corresponds. For example, the voice is "open windows", the operation flag is "windows", and when there is no information that the sound source position information is "sub-driving", then the first operation is "open all windows".

In the embodiment of the disclosure, the user voice is collected through the voice collection component, the noise reduction processing is performed on the user voice through the noise reduction chip, the processor performs voice recognition on the user voice, and corresponding operation is performed according to the voice recognition result, so that the noise reduction processing is performed on the user voice through the noise reduction chip before the voice recognition is performed, the noise in the noise reduction processing is reduced, the accuracy of the voice recognition can be effectively improved, and the accuracy of the operation performed by the vehicle-mounted terminal is further improved.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The embodiment of the present disclosure also provides an apparatus for performing operations, as shown in fig. 7, the apparatus including a sound collection part 710, a noise reduction chip 720, and a processor 730;

a sound collection part 710 for: and acquiring first audio data and sending the first audio data to the noise reduction chip. The processing functions of step 201 described above, as well as other implicit steps, may be implemented in particular.

Noise reduction chip 720 for: and carrying out noise reduction processing on the first audio data to obtain second audio data, and sending the second audio data to the processor. The processing functions of step 202 described above, as well as other implicit steps, may be implemented in particular.

A processor 730 for: and performing voice recognition on the second audio data, and performing a first operation based on the voice recognition result. The processing functions of step 203 described above, as well as other implicit steps, may be implemented in particular.

In one possible implementation, processor 730 is further configured to: and receiving an activation instruction for the sound collection component, and activating the sound collection component. The processing functions of step 201 described above, as well as other implicit steps, may be implemented in particular.

In one possible implementation, processor 730 is configured to: and inputting the second audio data into the voice recognition model to obtain a recognition result corresponding to the second audio data output by the voice recognition model. The processing functions of step 203 described above, as well as other implicit steps, may be implemented in particular.

In one possible implementation, processor 730 is configured to: acquiring indication information corresponding to a first operation; and outputting indication information. The processing functions of steps 301, 302 described above, as well as other implicit steps, may be implemented in particular.

In one possible implementation, the device further includes a communication component 740;

A processor 730 for: acquiring third audio data received by the communication component; and sending the third audio data to the noise reduction chip. The processing functions of the above steps 401, 402, as well as other implicit steps, may be implemented in particular.

Noise reduction chip 720 for: and carrying out noise reduction processing on the third audio data to obtain fourth audio data, and sending the fourth audio data to the processor. The processing functions of step 403 described above, as well as other implicit steps, may be implemented in particular.

A processor 730 for: and performing voice recognition on the fourth audio data, and performing a second operation based on the voice recognition result. The processing functions of step 404 described above, as well as other implicit steps, may be implemented in particular.

In one possible implementation, the speech recognition result is text information; a processor 730 for: determining an operation identifier corresponding to the text information; and executing the first operation corresponding to the operation identifier. The processing functions of form one above, as well as other implicit steps, may be implemented in particular.

In one possible implementation, the speech recognition result is an operation identifier; a processor 730 for: and executing the first operation corresponding to the operation identifier. The processing functions of form two above, as well as other implicit steps, may be implemented in particular.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that direct a computing device to perform a method of business processing or direct a computing device to perform a method of business processing.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; these modifications or substitutions do not depart from the essence of the corresponding technical solutions from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of performing an operation, the method being applied to a vehicle-mounted terminal including a processor, a noise reduction chip, and a sound collection component, the method comprising:

2. The method of claim 1, wherein prior to the sound collection component collecting the first audio data, the method further comprises:

3. The method of claim 1, wherein the processor speech-recognizes the second audio data, comprising:

4. The method of claim 1, wherein after the processor performs speech recognition on the second audio data, the method further comprises:

Acquiring indication information corresponding to the first operation;

And outputting the indication information.

5. The method of claim 1, wherein the vehicle-mounted terminal further comprises a communication component;

The method further comprises the steps of:

the processor sends the third audio data to the noise reduction chip;

6. The method of claim 1, wherein the speech recognition result is text information;

Determining an operation identifier corresponding to the text information;

and executing the first operation corresponding to the operation identifier.

7. The method of claim 1, wherein the speech recognition result is an operation identity;

and executing the first operation corresponding to the operation identifier.

8. An apparatus for performing an operation, the apparatus comprising a processor, a noise reduction chip, and a sound collection component;

9. The apparatus of claim 8, wherein the processor is further configured to receive an activation instruction for the sound collection component to activate the sound collection component.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores computer program code which, when executed by a computer device, performs the method of any of the preceding claims 1-7.