CN109273003B

CN109273003B - Voice control method and system for automobile data recorder

Info

Publication number: CN109273003B
Application number: CN201811380932.8A
Authority: CN
Inventors: 白生炜
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2021-11-02
Anticipated expiration: 2038-11-20
Also published as: CN109273003A

Abstract

The embodiment of the invention provides a voice control method for a vehicle event data recorder. The method comprises the following steps: collecting sound in a vehicle in real time to generate corresponding audio; extracting Fbank characteristics of the audio, analyzing the Fbank characteristics through a built-in neural network, and determining the posterior probability of each command word in the control command words hit by the audio; determining the joint probability of audio hitting each control command word by filtering the posterior probability of each command word in the control command word; taking the control command word with the maximum joint probability as an effective control command word; and acquiring a preset identification threshold, and when the joint probability of the effective control command words reaches the preset identification threshold, corresponding the audio to the effective control command words and executing the operation corresponding to the effective control command words. The embodiment of the invention also provides a voice control system for the automobile data recorder. According to the embodiment of the invention, the Fbank characteristic extraction is carried out on the collected audio, so that the operation amount is reduced, and the occupation of a memory and a hardware algorithm is saved because the decoding is not carried out.

Description

Voice control method and system for automobile data recorder

Technical Field

The invention relates to the field of intelligent voice, in particular to a voice control method and system for a vehicle event data recorder.

Background

The driving recorder is an instrument for recording the image and sound of the vehicle during driving. After the automobile data recorder is installed, the video, the image and the sound of the whole automobile driving process can be recorded, and evidence can be provided for traffic accidents. With the development of voice technology, the automobile data recorder is controlled by adopting a touch screen or a key, and is gradually developed to use voice control. Through speech control vehicle event data recorder, liberated vehicle driver's both hands, ensured that vehicle driver's attention is not dispersed, it is safer.

The automobile data recorder capable of being controlled by voice usually adopts a real-time voice decoding recognition mode to recognize and decode the collected audio, outputs characters corresponding to the voice spoken by a driver, compares the characters with command words, finally confirms a recognition result and further executes a corresponding control instruction.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the adoption of the recognition decoding method is limited by the algorithm characteristics of the decoding method, so that the model resource used actually is larger, the actual computation amount and the storage space requirement are higher, the operation in the equipment with stronger processing performance and larger storage space is required, and the cost is higher. In addition, the decoding mode is affected by noise in the vehicle and wind noise, so that the accuracy of identification is reduced, and the experience is affected. Some automobile data recorders do not perform voice recognition, and only send received audio signals to a specific cloud neural network through a wireless network for recognition. And then the automobile data recorder receives a specific instruction fed back by the cloud neural network to operate. However, this method requires the wireless network to be clear, and if the network is delayed or does not have a network, voice control cannot be achieved.

Disclosure of Invention

The method and the device aim to at least solve the problem that in the prior art, due to the characteristics of an identification decoding method, a vehicle event data recorder is required to have higher storage space and higher processing performance, so that the cost is higher. Meanwhile, the recognition decoding is interfered by noise, so that the accuracy of the recognition result is low. And the problem that the voice control cannot be realized when the network is delayed or has no network is solved.

In a first aspect, an embodiment of the present invention provides a voice control method for a vehicle event data recorder, including:

collecting sound in a vehicle in real time to generate corresponding audio;

extracting Fbank characteristics of the audio, analyzing the Fbank characteristics through a built-in neural network, and determining the posterior probability of each command word in the control command words hit by the audio;

processing the posterior probability of each command word in the control command words through filtering, and determining the joint probability of the audio hitting each control command word;

taking the control command word with the maximum joint probability as an effective control command word;

and acquiring a preset identification threshold, and when the joint probability of the effective control command words reaches the preset identification threshold, corresponding the audio to the effective control command words and executing the operation corresponding to the effective control command words.

In a second aspect, an embodiment of the present invention provides a voice control system for a vehicle event data recorder, including:

the sound acquisition program module is used for acquiring sound in the vehicle in real time and generating corresponding audio;

the command word posterior probability determining program module is used for extracting Fbank characteristics of the audio, analyzing the Fbank characteristics through a built-in neural network and determining the posterior probability of each command word in the control command words hit by the audio;

a joint probability determination program module for determining the joint probability of the audio hitting each control command word by filtering the posterior probability of each command word in the control command words;

an effective control command word determining program module, configured to use the control command word with the highest joint probability as an effective control command word;

and the control program module is used for acquiring a preset identification threshold, corresponding the audio frequency to the effective control command word when the joint probability of the effective control command word reaches the preset identification threshold, and executing the operation corresponding to the effective control command word.

In a third aspect, an electronic device is provided, comprising: the voice control system comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the voice control method for the automobile data recorder of any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the voice control method for a tachograph according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the Fbank feature extraction is carried out on the collected audio, the audio is converted into character-type vectors, the data volume actually sent into a neural network is reduced, meanwhile, due to the fact that a decoding part is not made, occupation of an internal memory and a hardware algorithm is saved, more stable output is obtained through digital filtering, and the identification accuracy is improved. The neural network is configured locally in the driving recorder, and the network is not needed, so that the use scene is better and wider, the problems of network speed and the like are avoided, and the use effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a voice control method for a vehicle event data recorder according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice control system for a car recorder according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a voice control method for a car recorder according to an embodiment of the present invention, which includes the following steps:

s11: collecting sound in a vehicle in real time to generate corresponding audio;

s12: extracting Fbank characteristics of the audio, analyzing the Fbank characteristics through a built-in neural network, and determining the posterior probability of each command word in the control command words hit by the audio;

s13: processing the posterior probability of each command word in the control command words through filtering, and determining the joint probability of the audio hitting each control command word;

s14: taking the control command word with the maximum joint probability as an effective control command word;

s15: and acquiring a preset identification threshold, and when the joint probability of the effective control command words reaches the preset identification threshold, corresponding the audio to the effective control command words and executing the operation corresponding to the effective control command words.

In the present embodiment, in order to solve the complicated use environment in the vehicle, the neural network is trained in the drive recorder in advance, and training is performed by using the actual in-vehicle recording so as to cover most of the in-vehicle use scenes.

For step S11, after the vehicle is started, the automobile data recorder collects the sound in the vehicle in real time, so that the sound of the user can be acquired at any time. The automobile data recorder is arranged at the position where the automobile data recorder is arranged, or the automobile data recorder is provided with a special additional microphone, the automobile data recorder or the special additional microphone can be arranged near the head of a driver, so that collected sound is clearer, and the sound quality effect of the collected sound can be further improved. And generating corresponding audio frequency according to the collected sound.

For step S12, extracting Fbank features of the audio according to the audio generated in step S11, wherein extracting Fbank features of the audio includes: the pre-emphasis is to eliminate the effects caused by vocal cords and lips during the generation process to compensate the high frequency portion of the speech signal suppressed by the pronunciation system. And can highlight the resonance peak of high frequency; dividing the speech signal into frames, wherein the frame length is usually 20-40 ms, and the frame shift is 10ms (which may be specific); windowing, namely adding a hamming/panning window to each frame of signal to enable two ends of each frame of signal to be attenuated to be close to 0; STFT, obtaining vector characteristics, and converting an energy (amplitude) spectrum into a power spectrum; mel filtering, filtering through a Mel filter bank to obtain a sound spectrum according with the hearing habits of human ears, and finally converting a unit into db by taking a logarithm usually; DCT, discrete cosine transform, obtain the cepstrum coefficient. Analyzing the Fbank characteristics through a pre-trained built-in neural network, and determining the posterior probability of each command word in the control command words hit by the audio. For example, the control command words include control command words such as "play music", "next", "previous" …, and the determined command words include "play", "music", "down", "up", "one", "first" …, and the like, so as to determine the posterior probability of each command word.

For step S13, the posterior probability of each command word in the control command word is filtered to determine the joint probability of the audio hitting the control command word, and the joint probability of the control command word is determined according to the posterior probability of each command word determined in step S12, i.e., the joint probability of the control command word is determined according to the posterior probability of the word.

For step S14, since the present implementation method is used for voice control of the car recorder, and the car recorder always determines a corresponding operation after passing through audio recognition, the control command word with the highest joint probability is used as the effective control command word.

For step S15, a preset recognition threshold is obtained, wherein the higher the setting of the recognition threshold, the more accurate the precision of the effective control command word reaching the recognition threshold. However, if the setting is too high, the joint probability of the effective control command words can not reach the recognition threshold, and the specific control command cannot be recognized, so that the recognition threshold can be adjusted according to the corresponding situation. And when the joint probability of the effective control command words is determined to reach a preset identification threshold value, corresponding the audio frequency to the effective control command words, and executing the operation corresponding to the effective control command words.

According to the implementation method, the Fbank characteristic extraction is carried out on the collected audio, the audio is converted into the character-type vector, the data volume actually sent into the neural network is reduced, meanwhile, the occupation of the memory and a hardware algorithm is saved due to the fact that a decoding part is not carried out, the neural network is configured locally on the driving recorder, a network is not needed, the use scene is wider, the problems of network speed and the like are avoided, and the use effect is improved.

As an implementation manner, in this embodiment, the filtering process includes: and (4) digital filtering.

The digital filtering filters a posterior probability of each command word within the control command word, including:

taking the maximum value of the posterior probability of each command word in each control command word as the corresponding effective posterior probability of each command word in each control command word;

and multiplying the effective posterior probabilities of the command words in the control command words by two to determine the joint probability of the control command words.

In the embodiment, the posterior probability of each command word in the command words obtained after the neural network is used can be obtained only by digitally filtering the posterior probability of each word, and the corresponding command judgment can be performed according to the comparison between the posterior probability of the command word and the preset identification threshold direction, so as to output the corresponding command. The digital filtering part filters each command word of the command words output by the neural network, and can obtain the average value of the posterior probability of each word in a range of a fixed length to avoid the false recognition of the burrs. The maximum value of the posterior probability of each command word within the command word is then searched. And finally, multiplying the maximum value of the posterior probability of the command words in sequence according to the sequence of the command words to obtain the joint probability of the command words. For example, the posterior probability of "lower" is 70%, the posterior probability of "upper" is 30%, the posterior probability of "one" is 85%, and the posterior probability of "top" is 90%. The joint probability of "next" is obtained to be 53.55%, and the joint probability of "previous" is 22.95%.

By the implementation method, the more stable output can be obtained through digital filtering. The accuracy of discernment has been promoted.

As an implementation manner, in this embodiment, the acquiring, by the automobile data recorder, sounds in the vehicle in real time, and generating the corresponding audio further includes:

the method comprises the steps of collecting sound in a vehicle in real time, and generating corresponding audio when the sound in the vehicle reaches a preset sound pressure level.

In the embodiment, the sound in the vehicle is collected in real time, the corresponding audio is generated when the sound in the vehicle reaches the preset sound pressure level, the corresponding audio is not generated in real time considering that a vehicle driver does not always speak, and only needs to use the corresponding function, so that the corresponding audio is not required to be generated in real time, when the vehicle driver speaks, the sound pressure level changes, and the corresponding audio is regenerated to be identified.

It can be seen from this embodiment that by presetting the sound pressure level, some identification processes that are not functional are avoided. The calculation amount of the automobile data recorder is further reduced.

As an implementation manner, in this embodiment, when the joint probability of the valid control command words does not reach the preset recognition threshold, the control command words corresponding to the audio cannot be determined, and recognition failure information is fed back.

In this embodiment, when it is determined that the joint probability of the valid control command words spoken by the vehicle driver does not reach the preset recognition threshold, it is also difficult for the tachograph to determine the corresponding command. Therefore, the identification failure information is fed back to the vehicle driver to remind the vehicle driver.

According to the implementation method, the vehicle driver is reminded by feeding back the information of the identification failure of the vehicle driver, and the using effect of the user is improved.

Fig. 2 is a schematic structural diagram of a voice control system for a car recorder according to an embodiment of the present invention, which can execute the voice control method for a car recorder according to any of the above embodiments and is configured in a terminal.

The voice control system for the automobile data recorder provided by the embodiment comprises: a sound collection program module 11, a command word posterior probability determination program module 12, a joint probability determination program module 13, an effective control command word determination program module 14, and a control program module 15.

The sound collection program module 11 is configured to collect sounds in the vehicle in real time and generate corresponding audio frequencies; the command word posterior probability determination program module 12 is configured to extract an Fbank feature of the audio, analyze the Fbank feature through a built-in neural network, and determine the posterior probability of each command word in the control command word hit by the audio; the joint probability determination program module 13 is configured to determine a joint probability that the audio hits each control command word by filtering a posterior probability of each command word in the control command words; the effective control command word determining program module 14 is configured to use the control command word with the largest joint probability as an effective control command word; the control program module 15 is configured to obtain a preset identification threshold, and when the joint probability of the effective control command word reaches the preset identification threshold, correspond the audio to the effective control command word, and execute an operation corresponding to the effective control command word.

Further, the system further comprises: the filtering process includes: and (4) digital filtering.

Further, the joint probability determination program module is for:

Further, the line sound collection program module is further configured to:

Further, the system is also configured to:

and when the joint probability of the effective control command words does not reach the preset identification threshold, the control command words corresponding to the audio cannot be determined, and identification failure information is fed back.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice control method for the automobile data recorder in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

collecting sound in a vehicle in real time to generate corresponding audio;

As a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the methods of testing software in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a voice control method for a tachograph in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the voice control system comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the voice control method for the automobile data recorder of any embodiment of the invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) Other electronic devices with processing functions.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice control method for a tachograph, comprising:

collecting sound in a vehicle in real time to generate corresponding audio;

extracting Fbank characteristics of the audio, analyzing the Fbank characteristics through a built-in neural network, and determining the posterior probability of each command word in the control command words hit by the audio; the built-in neural network is obtained by training in advance by using actual in-vehicle recording;

acquiring a preset identification threshold, and when the joint probability of the effective control command words reaches the preset identification threshold, corresponding the audio to the effective control command words and executing the operation corresponding to the effective control command words;

the filtering process includes: digital filtering;

2. The method of claim 1, wherein the tachograph captures sounds within the vehicle in real time, generating corresponding audio further comprises:

3. The method of claim 1, wherein the method further comprises:

4. A voice control system for a tachograph, comprising:

the command word posterior probability determining program module is used for extracting Fbank characteristics of the audio, analyzing the Fbank characteristics through a built-in neural network and determining the posterior probability of each command word in the control command words hit by the audio; the built-in neural network is obtained by training in advance by using actual in-vehicle recording;

the control program module is used for acquiring a preset identification threshold, corresponding the audio frequency to the effective control command words when the joint probability of the effective control command words reaches the preset identification threshold, and executing the operation corresponding to the effective control command words;

the filtering process includes: digital filtering;

the joint probability determination program module is to:

5. The system of claim 4, wherein the sound collection program module is further to:

6. The system of claim 4, wherein the system is further configured to: