CN111354371A

CN111354371A - Method, device, terminal and storage medium for predicting running state of vehicle

Info

Publication number: CN111354371A
Application number: CN202010120671.7A
Authority: CN
Inventors: 刘文龙
Original assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-06-30
Anticipated expiration: 2040-02-26
Also published as: CN111354371B; WO2021169742A1

Abstract

The embodiment of the application discloses a method, a device, a terminal and a storage medium for predicting the running state of a vehicle, and belongs to the field of artificial intelligence. The method comprises the following steps: collecting ambient sounds by a microphone while in a vehicle; extracting the characteristics of the environmental sound to obtain the environmental sound characteristics of the environmental sound, wherein the environmental sound characteristics comprise at least one of energy characteristics and time-frequency characteristics; inputting the environmental sound characteristics into a preset prediction model to obtain a target prediction result output by the preset prediction model, wherein the preset prediction model comprises at least one of a first prediction model and a second prediction model, the second prediction model is a classification model adopting RNN, and the target prediction result is the predicted running state of the vehicle; and determining the target running state of the vehicle according to the target prediction result. In the embodiment of the application, the running state is predicted by using the sound change characteristics of the vehicle, so that the accuracy and the effectiveness of the running state prediction are improved.

Description

Method, device, terminal and storage medium for predicting running state of vehicle

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a method, a device, a terminal and a storage medium for predicting the running state of a vehicle.

Background

When people go out by taking public transport means such as a subway, people need to pay attention to whether a current stop station is a target station of the people at all times, and the arrival reminding function is a function of reminding passengers to get off the bus in time when the passengers arrive at the target station.

In the related art, a terminal generally calculates whether a vehicle enters or leaves a station by acceleration and deceleration using data collected by sensors (such as an acceleration sensor, a gravity sensor, a magnetic sensor, and the like), so as to determine whether the station where the terminal is currently located is a target station of a passenger.

However, when the method is used for predicting whether the vehicle enters or leaves the station, the vehicle does not always run at a constant speed in the running process, and the posture and the action of the passenger holding the terminal and the walking inside the vehicle also have certain influence on the sensor of the terminal, which can cause inaccurate prediction of the terminal entering or leaving the station.

Disclosure of Invention

The embodiment of the application provides a method, a device, a terminal and a storage medium for predicting the running state of a vehicle. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for predicting an operating state of a vehicle, where the method includes:

collecting ambient sounds by a microphone while in a vehicle;

extracting the characteristics of the environmental sound to obtain the environmental sound characteristics of the environmental sound, wherein the environmental sound characteristics comprise at least one of energy characteristics and time-frequency characteristics;

inputting the environmental sound characteristics into a preset prediction model to obtain a target prediction result output by the preset prediction model, wherein the preset prediction model comprises at least one of a first prediction model and a second prediction model, the input of the first prediction model is the energy characteristics, the input of the second prediction model is the time-frequency characteristics, the second prediction model is a classification model adopting a Recurrent Neural Network (RNN), the target prediction result is a predicted operation state of the vehicle, and the operation state comprises at least one of an entering state, an exiting state and an inter-station driving state;

and determining the target running state of the vehicle according to the target prediction result.

In another aspect, an embodiment of the present application provides an apparatus for predicting an operating state of a vehicle, where the apparatus includes:

the acquisition module is used for acquiring environmental sounds through the microphone when the vehicle is in the traffic state;

the characteristic extraction module is used for extracting the characteristics of the environmental sound to obtain the environmental sound characteristics of the environmental sound, wherein the environmental sound characteristics comprise at least one of energy characteristics and time-frequency characteristics;

the prediction module is used for inputting the environmental sound characteristics into a preset prediction model to obtain a target prediction result output by the preset prediction model, the preset prediction model comprises at least one of a first prediction model and a second prediction model, the input of the first prediction model is the energy characteristics, the input of the second prediction model is the time-frequency characteristics, the second prediction model is a classification model adopting a Recurrent Neural Network (RNN), the target prediction result is a predicted operation state of the vehicle, and the operation state comprises at least one of an entering state, an exiting state and an inter-station driving state;

and the state determining module is used for determining the target running state of the vehicle according to the target prediction result.

In another aspect, an embodiment of the present application provides a terminal, where the terminal includes a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the method of predicting a vehicle operating condition of the above aspect.

In another aspect, the present embodiment provides a computer-readable storage medium, which stores at least one instruction for execution by a processor to implement the method for predicting the vehicle operating state according to the above aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

in the embodiment of the application, whether the vehicle enters or leaves a station is judged by acquiring the environmental sound in real time and predicting the running state of the vehicle according to the environmental sound characteristics of the environmental sound; the terminal extracts energy and time-frequency characteristics of the collected environment sound, inputs the obtained environment sound characteristics into a preset prediction model, and adopts at least one method of detecting energy change of the environment sound or classifying the environment sound by using a Recurrent Neural Network (RNN) according to the time-frequency characteristics of the environment sound, so that the accuracy of a prediction result is improved; because the sound change of the vehicles during the brake when the vehicles enter the station and the start when the vehicles leave the station is obvious and is not influenced by other factors such as the motion state of the terminal or other environmental sounds, the accuracy and the effectiveness of the prediction of the running state can be improved by predicting the running state based on the sound change of the vehicles.

Drawings

FIG. 1 is a flow chart illustrating a method of predicting an operating condition of a vehicle in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of operating state prediction for a vehicle in accordance with another exemplary embodiment;

FIG. 3 is a block diagram illustrating the structure of an energy feature extraction module in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating the structure of the MFCC feature matrix extraction module in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a process of computing a target feature matrix in accordance with an exemplary embodiment;

FIG. 6 is an environmental audio spectrum diagram illustrating an outbound status of a vehicle in accordance with an exemplary embodiment;

FIG. 7 is an environmental audio spectrum diagram illustrating a vehicle inbound situation in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating the structure of a second predictive model in accordance with an exemplary embodiment;

FIG. 9 is a flow chart illustrating a method of operating state prediction for a vehicle in accordance with another exemplary embodiment;

FIG. 10 is a schematic diagram illustrating a sliding window process, according to an exemplary embodiment;

FIG. 11 is a flow chart illustrating a method of operating state prediction for a vehicle in accordance with another exemplary embodiment;

fig. 12 is a block diagram showing the configuration of an operation state prediction apparatus of a vehicle according to an exemplary embodiment;

fig. 13 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

However, when the method is adopted to predict whether the vehicle enters or leaves the station, the vehicle does not always run at a constant speed in the running process but has a certain oscillation phenomenon; the gesture of holding the terminal by the user can influence the acceleration direction recorded by the acceleration sensor; and if the user walks in the vehicle, the acceleration value recorded by the terminal comprises the acceleration of the user when the user walks, and the acceleration value also has certain influence on the sensor of the terminal, so that the vehicle is difficult to judge whether the vehicle is in an acceleration state or a deceleration state, and the factors can cause inaccurate station entering and exiting prediction of the terminal.

In order to solve the above problem, an embodiment of the present application provides a method for predicting an operating state of a vehicle, where the method is used for a terminal with audio acquisition and processing functions, where the terminal may be a smart phone, a tablet computer, an electronic book reader, a wearable smart device, and the like. In a possible implementation manner, the method for predicting the vehicle operation state provided by the embodiment of the application can be implemented as an application program or a part of the application program, and is installed in the terminal. When the user takes the vehicle, the application program can be manually started (or automatically started), so that the user is prompted to be located at the current site through the application program.

Referring to fig. 1, a flow chart of a method for predicting a vehicle operating state according to an embodiment of the present application is shown. The embodiment takes an example that a prediction method of a vehicle running state is used for a terminal with audio acquisition and processing functions, and the method comprises the following steps:

in step 101, ambient sounds are collected by a microphone while in a vehicle.

When the vehicle is in the traffic state, the terminal starts the running state prediction function and collects the environmental sound in real time through the microphone.

In a possible implementation mode, when the method for predicting the running state of the vehicle is applied to a map navigation application program, the terminal acquires the position information of the user in real time, and when the user enters the vehicle according to the position information of the user, the terminal starts a running state prediction function.

Optionally, when the user uses the payment application program to swipe a card to take the vehicle, the terminal confirms to enter the vehicle and starts the running state prediction function.

Optionally, in order to reduce power consumption of the terminal, the terminal may use a low power consumption microphone to perform real-time acquisition.

And 102, extracting the characteristics of the environmental sound to obtain the environmental sound characteristics of the environmental sound, wherein the environmental sound characteristics comprise at least one of energy characteristics and time-frequency characteristics.

Since the terminal cannot directly recognize the sound change condition of the vehicle in operation from the environmental sound, the acquired environmental sound needs to be preprocessed. In a possible implementation manner, the terminal converts the environmental sound acquired by the microphone in real time into audio data, and performs environmental sound feature extraction on the audio data to obtain digital features that can be recognized by the terminal.

When a vehicle enters a station and brakes, the energy of sound gradually changes from large to small, and the frequency of a sound signal also changes obviously; similarly, when the vehicle is accelerated when going out of the station, there is a gradual change process and frequency change of the energy of the sound from small to large. Therefore, in a possible implementation manner, after the terminal collects the environmental sound, at least one of the energy feature and the time-frequency feature of the environmental sound is extracted.

Step 103, inputting the environmental sound characteristics into a preset prediction model to obtain a target prediction result output by the preset prediction model, wherein the preset prediction model comprises at least one of a first prediction model and a second prediction model.

The input of the first prediction model is energy characteristics, the input of the second prediction model is time-frequency characteristics, the second prediction model is a classification model adopting RNN, the target prediction result is the predicted operation state of the vehicle, and the operation state comprises at least one of an entering state, an exiting state and an inter-station driving state.

In a possible implementation manner, the terminal takes a simple first prediction model with low power consumption as a preset prediction model, and obtains a target prediction result by using energy characteristics of environmental sounds.

Optionally, in order to improve the accuracy of the prediction result, the terminal uses a second prediction model with high accuracy but high power consumption as a preset prediction model, and inputs the time-frequency characteristics of the environmental sound into the RNN-based neural network model to obtain the target prediction result.

Optionally, in order to improve accuracy of the prediction result and reduce power consumption of the terminal, the terminal is simultaneously provided with a first prediction model and a second prediction model, the first prediction model is always started when the terminal is in a vehicle, the second prediction model is started for prediction when the first prediction model judges that the vehicle is possibly in an inbound state or an outbound state, and the second prediction model is closed after a target prediction result is obtained.

And 104, determining the target running state of the vehicle according to the target prediction result.

In one possible implementation mode, the terminal determines the target operation state of the vehicle according to the target prediction result of the at least one preset prediction model. Optionally, an inbound or outbound prompt message is sent to the passenger when an upcoming inbound or outbound alert is predicted for the vehicle.

In summary, in the embodiment of the application, whether the vehicle enters or leaves a station is judged by acquiring the environmental sound in real time and predicting the running state of the vehicle according to the environmental sound characteristics of the environmental sound; the terminal extracts energy characteristics and time-frequency characteristics of the collected environmental sounds, inputs the obtained environmental sound characteristics into a preset prediction model, and adopts at least one method of detecting energy changes of the environmental sounds or classifying the environmental sounds according to the time-frequency characteristics of the environmental sounds by using the RNN, so that the accuracy of a prediction result is improved; because the sound change of the vehicles during the brake when the vehicles enter the station and the start when the vehicles leave the station is obvious and is not influenced by other factors such as the motion state of the terminal or other environmental sounds, the accuracy and the effectiveness of the prediction of the running state can be improved by predicting the running state based on the sound change of the vehicles.

The terminal needs to start the microphone all the time to obtain the environmental sound in the driving process of the vehicle and inputs the environmental sound characteristics of the environmental sound into the preset prediction model to predict the running state, so that the preset prediction model is always in the working state. If only one preset prediction model is arranged in the terminal, the problem of low prediction accuracy or large terminal power consumption exists, so in a possible implementation mode, two preset prediction models are arranged in the terminal, the model with a simple model and low power consumption can be started to operate in real time, and the model with a complex model and high power consumption but high accuracy is started when the prediction result of the former model meets the condition, so that the operation state of the vehicle is further predicted.

Referring to fig. 2, a flow chart of a method for predicting a vehicle operating state according to another embodiment of the present application is shown. The embodiment takes an example that a prediction method of a vehicle running state is used for a terminal with audio acquisition and processing functions, and the method comprises the following steps:

in step 201, ambient sounds are collected by a microphone while in a vehicle.

The step 101 may be referred to in the implementation manner of the step 201, and this embodiment is not described herein again.

Step 202, extracting the features of the environmental sound to obtain the environmental sound features of the environmental sound, wherein the environmental sound features include at least one of energy features and time-frequency features.

In a possible implementation manner, when the operating state is predicted by using the first prediction model, the terminal first extracts the energy feature corresponding to the environmental sound, and step 202 includes the following steps one to four:

firstly, performing framing processing on audio data of environmental sound to obtain m audio frames, wherein m is an integer greater than or equal to 2.

Referring to fig. 3, a schematic diagram of an energy feature extraction module is shown. After the terminal acquires the audio data, the framing module 301 is first used to frame the audio data.

Because the terminal microphone collects the environmental sound in real time, the audio data of the terminal microphone is not stable on the whole, but the local part of the terminal microphone can be regarded as stable data, and the preset prediction model can only identify the stable data, the terminal firstly carries out framing processing on the corresponding audio data to obtain the audio data corresponding to m different audio frames, wherein m is an integer more than or equal to 2.

Illustratively, the terminal performs framing processing on the audio data of the environmental sound by taking 64ms as a frame, and each frame is not overlapped.

And secondly, calculating the energy value of each audio frame in a preset frequency band, wherein the preset frequency band is a low-frequency band lower than the preset frequency.

Optionally, because the sound signal when the vehicle runs is mainly a low-frequency signal, in order to reduce the influence of the rest high-frequency environmental sounds, the terminal only selects a signal of a low-frequency part for calculation when calculating the energy value, for example, 0 to 600 Hz.

In one possible implementation, the terminal performs fourier transform on each audio frame by using the fourier transform module 302, and the formula is as follows:

wherein, N is the number of points of fourier transform, k is frequency information of fourier transform, and x (N) is audio data corresponding to the nth frame of audio frame.

The terminal calculates the energy value of the audio frame after fourier transform by using the energy calculation module 303. Optionally, k represents frequency information of the environmental sound, and k represents that the frequency is from low frequency to high frequency from small to large. The calculation formula of the amount energy of the first Y frequency parts of each audio frame is as follows:

wherein T represents each audio frame, T is the number of audio frames, and Y is the frequency band number.

And thirdly, calculating a first energy value according to the energy values corresponding to the previous m/2 audio frames.

In order to obtain the energy change condition of the preset frequency band in the environmental sound, the terminal needs to combine the audio frame energy values corresponding to part of the environmental sound, and compare the combined energy values of the parts.

Optionally, the terminal intercepts the environmental sound with a certain duration every time to predict, for example, the duration of the intercepted environmental sound is t, the energy characteristic includes a first energy value of the environmental sound in the first t/2 duration and a second energy value of the environmental sound in the last t/2 duration, where t is a number greater than 0, and after the terminal calculates the energy value of each audio frame, the terminal calculates the first energy values of m/2 audio frames corresponding to the environmental sound of the first t/2.

The terminal calculates the first energy value by using the energy merging module 304 according to the following calculation formula:

and fourthly, calculating a second energy value according to the energy values corresponding to the m/2 audio frames.

Correspondingly, the terminal calculates a second energy value corresponding to m/2 audio frames by using the energy merging module 304, and the calculation formula is as follows:

step 203, inputting the energy characteristics into the first prediction model to obtain a first prediction result output by the first prediction model.

In one possible embodiment, the first prediction model is a model for predicting the vehicle operating state based on the energy characteristics of the ambient sounds. By acquiring the energy characteristics of the current environmental sound and comparing the energy change of the corresponding frequency part within a period of time, the running state of the vehicle can be predicted, for example, when the energy changes from small to large, the vehicle can be accelerated to get out of the station; when the energy changes from large to small, the vehicle may be slowing down to the station; and the vehicle may be in a driving state when the energy variation amplitude is small and there is no consistent variation trend.

Optionally, the first prediction model may use data such as a zero-crossing rate as the environmental sound characteristics to perform prediction, for example, one or more frequencies are selected from sound signals of the vehicle during driving and stopping as standard frequencies, that is, 0 values, and an audio frame with a signal frequency crossing zero is screened out, and whether the vehicle enters or leaves the station is determined according to the zero-crossing rate of the audio data of the current environmental sound.

In a possible embodiment, the duration of the ambient sound is t, the energy characteristic includes a first energy value of the ambient sound in the first t/2 duration and a second energy value of the ambient sound in the last t/2 duration, t is a number greater than 0, and step 203 includes the following steps one to four:

firstly, inputting the first energy value and the second energy value into a first prediction model.

When the terminal predicts by using the first prediction model, the energy characteristics are a first energy value and a second energy value, in order to reflect the energy change of low-frequency parts in environmental sound, the terminal firstly calculates the ratio of the second energy value to the first energy value, namely E2/E1, and if the ratio is greater than 1, the energy is increased; if the ratio is less than 1, it indicates a decrease in energy.

And secondly, outputting a first prediction result indicating that the vehicle is in a driving state in response to the ratio of the second energy value to the first energy value being smaller than or equal to a first threshold value and being larger than or equal to a second threshold value, wherein the second threshold value is smaller than the first threshold value.

When the ratio of the second energy value to the first energy value is in different ranges, the corresponding running states of the vehicles are different. For example, when the ratio fluctuates within a certain range, the vehicle is in a driving state, and when the ratio exceeds a certain threshold value, it indicates that the energy change of the low-frequency part of the ambient sound in the period of time is obvious and has a certain trend, and the vehicle may be in an inbound state or an outbound state.

Optionally, when the ratio of the second energy value to the first energy value is smaller than or equal to the first threshold and greater than or equal to the second threshold, it indicates that the speed of the vehicle does not change much at the time and the vehicle is running smoothly, and outputs a first prediction result indicating that the vehicle is in a running state, where the first prediction result is that the vehicle is in a station-to-station running state.

Illustratively, the first threshold is 1.2, the second threshold is 0.8, and if the ratio of the second energy value to the first energy value of the current environmental sound is 1.1, the output first prediction result is that the vehicle is in the inter-station driving state.

And thirdly, in response to the ratio of the second energy value to the first energy value being smaller than a second threshold value, outputting a first prediction result indicating that the vehicle is in the inbound state.

When the ratio of the second energy value to the first energy value is smaller than a second threshold value, the sound energy generated by the running of the vehicle is changed from large to small at the moment, and the change rate is large, and a first prediction result indicating that the vehicle is in the station entering state is output.

And fourthly, responding to the ratio of the second energy value to the first energy value being larger than a first threshold value, and outputting a first prediction result indicating that the vehicle is in the outbound state.

When the ratio of the second energy value to the first energy value is larger than a first threshold value, the sound energy generated by the running of the vehicle is changed from small to large at the moment, and the change rate is larger, and a first prediction result indicating that the vehicle is in an outbound state is output.

Illustratively, based on the example in step one, when the ratio of the second energy value to the first energy value is greater than 1.2, the vehicle is in an outbound state; and when the ratio of the second energy value to the first energy value is less than 0.8, the vehicle is in the station entering state.

And step 204, responding to the first prediction result indicating that the vehicle is in the inter-station driving state, and determining the first prediction result as a target prediction result.

Since the first prediction model is a model for predicting according to the energy characteristics of the environmental sounds, the prediction of the change of the environmental sounds is not accurate enough, but when the first prediction result of the first prediction model indicates that the vehicle is in the inter-station driving state, the change of the environmental sounds at the moment is not in accordance with the energy characteristics of the inbound state or the outbound state, and therefore the first prediction result can be used as a target prediction result, namely, the running state of the vehicle is the inter-station driving state.

Step 205, in response to the first prediction result indicating that the vehicle is in the inbound state or the outbound state, inputting the time-frequency characteristics into the second prediction model to obtain a second prediction result output by the second prediction model, and determining the second prediction result as a target prediction result.

In a possible implementation manner, when the first prediction result indicates that the vehicle is in an inbound state or an outbound state, in order to ensure accuracy of the prediction result and avoid a wrong prediction of an inbound and outbound situation, a more accurate model is required to be used for further prediction, and the terminal inputs time-frequency characteristics of the environmental sound into the second prediction model, wherein the second prediction model is a model for prediction according to the time-frequency characteristics of the environmental sound, and a deep learning algorithm can be used for constructing the second prediction model to improve the prediction accuracy.

And when the first prediction result is that the vehicle is in an inbound state or an outbound state, the terminal starts a second prediction model for prediction, the obtained second prediction result may be that the vehicle is in the inbound state, the outbound state or a driving state, and the terminal determines the second prediction result as a target prediction result.

And step 206, determining the target running state of the vehicle according to the target prediction result.

The step 206 may be implemented by referring to the step 104, and this embodiment is not described herein again.

In the embodiment of the application, the running state of the vehicle is predicted by setting two preset prediction models, wherein the first prediction model is low in power consumption and easy to realize, is started in real time and detects the environmental sound characteristics of the environmental sound, the second prediction model is high in precision and high in power consumption, and when the first prediction model judges that the vehicle is likely to enter or leave, the second prediction model is started, so that the accuracy of the prediction result is improved, and the power consumption of the terminal is reduced.

When the second prediction model is used for predicting the running state of the vehicle, the terminal firstly extracts the time-frequency characteristics of the audio data of the environmental sounds, and then inputs the corresponding time-frequency characteristics into the second prediction model based on the RNN to obtain a second prediction result. In a possible implementation manner, after the terminal performs step 203, if the first prediction result is an inbound state or an outbound state, step 202 is performed again to perform time-frequency feature extraction, and when the environmental sound feature is a time-frequency feature, step 202 includes the following five to seven steps:

and fifthly, dividing the ambient sound into i sections, wherein i is an integer greater than or equal to 2.

In one possible implementation mode, in order to facilitate subsequent feature extraction and operation state prediction based on a preset prediction model, the terminal processes the collected continuous environmental sounds in segments. Illustratively, the terminal intercepts 10080ms of environment sound every time to perform feature extraction, and predicts the running state of the vehicle within 10080 ms.

Optionally, before extracting the time-frequency features of the environmental sound, the terminal decomposes the audio data of the environmental sound into i intervals, for example, the terminal decomposes 10080ms of audio data into 10 intervals of audio data, and the time length of each interval is 1008 ms.

And sixthly, carrying out Mel filtering processing on the audio data of the environmental sound of each interval to obtain a Mel-Frequency Cepstral Coefficients (MFCC) feature matrix of the i-section audio data.

Referring to fig. 4, a block diagram of a module for extracting an MFCC feature matrix is shown. The audio data is first pre-emphasized by the pre-emphasis module 401, and the pre-emphasis process uses a high-pass filter, which only allows signal components above a certain frequency to pass through, and suppresses signal components below the certain frequency, so as to remove unnecessary low-frequency interference such as human talk sound, footstep sound, and mechanical noise in the audio data, and flatten the frequency spectrum of the audio signal. The mathematical expression for the high pass filter is:

H(z)＝1-az^-1

where a is a correction coefficient, generally ranging from 0.95 to 0.97, and z is an audio signal.

The audio data after the noise removal is subjected to framing processing by the framing windowing module 402, so as to obtain audio data corresponding to different audio frames. Illustratively, in this embodiment, the audio data including 512 data points is divided into one frame, when the sampling frequency of the audio data is 16000Hz, the duration of the one frame of audio data is 32ms, and when the audio data is actually extracted, 16ms overlaps between two adjacent frames.

Because discrete fourier transform is required to be performed on the audio data subjected to framing processing during subsequent feature extraction, and one frame of audio data has no obvious periodicity, that is, the left end and the right end of the frame are discontinuous, errors can be generated between the audio data subjected to fourier transform and original data, and the more frames are divided, the larger the error is, so that windowing processing needs to be performed through the framing windowing module 402 in order to make the audio data subjected to framing continuous and each frame of audio data shows the features of a periodic function.

In one possible implementation, a hamming window is used to window the audio frame. The hamming window function is multiplied by each frame of data, and the resulting audio data has significant periodicity. The Hamming window has the functional form:

where n is an integer, the value range of n is 0 to M, M is the number of points of fourier transform, and illustratively, 512 data points are taken as fourier transform points in this embodiment.

Since it is difficult to obtain the signal characteristics of the audio signal from the transform in the time domain, the time domain signal usually needs to be converted into the energy distribution in the frequency domain for processing, so the terminal first inputs the audio frame data into the fourier transform module 203 for fourier transform, and then inputs the fourier transformed audio frame data into the energy spectrum calculation module 404 for calculating the energy spectrum of the audio frame data. In order to convert the energy spectrum into a mel spectrum which accords with the auditory sense of human ears, the energy spectrum needs to be input into a mel filtering processing module for filtering, and the mathematical expression of the filtering processing is as follows:

wherein f is a frequency point after Fourier transform.

After obtaining the mel spectrum of the audio frame, the terminal logarithms the mel spectrum through a Discrete Cosine Transform (DCT) module 406, and an obtained DCT coefficient is the MFCC characteristic.

And seventhly, combining the MFCC characteristic matrixes of the i sections of audio data to obtain time-frequency characteristics.

Illustratively, in the embodiment of the present application, 64-dimensional MFCC features are selected, when a terminal actually extracts features, the duration of audio data in one interval is 1008ms, the duration of one frame of audio frame is 32ms, and adjacent audio frames are overlapped for 16ms, so that the time-frequency feature corresponding to the audio data in each interval is a MFCC feature matrix of 62 × 64, and the time-frequency feature corresponding to each input audio data is a MFCC feature matrix of 10 × 62 × 64.

After obtaining the time-frequency feature of the environmental sound, the terminal performs step 205, and inputs the time-frequency feature into a second prediction model to obtain a second prediction result, where the second prediction model includes a Gated Recursive Unit (GRU) layer, an attention mechanism layer, a full connection layer, and a classification layer, and in one possible implementation, step 205 includes:

firstly, decomposing the time-frequency characteristics to obtain n time-frequency characteristic matrixes, wherein the dimensionalities of the time-frequency characteristic matrixes are the same.

Because the RNN is a neural network that processes sequence data, before the audio feature matrix is input to the second prediction model, the terminal first decomposes the time-frequency features obtained after the feature extraction to obtain n time-frequency feature matrices with the same dimensionality.

Illustratively, the terminal directly decomposes the time-frequency characteristics of 10080ms audio data into MFCC characteristic matrices of various intervals, which are 62 × 64 time-frequency characteristic matrices.

And secondly, extracting the characteristics of the n time-frequency characteristic matrixes through the GRU layer and the attention mechanism layer to obtain a target characteristic matrix.

Because sound is a time sequence feature, the terminal inputs the time-frequency feature matrix obtained by decomposition into the RNN model, and extracts correlation and effective information among different time-frequency feature matrices through a GRU layer and an attention mechanism layer in the model, thereby obtaining a target feature matrix.

In one possible embodiment, step two includes the following steps a to c:

a. and inputting the n time-frequency characteristic matrixes into the GRU layer to obtain candidate characteristic matrixes corresponding to the time-frequency characteristic matrixes.

In one possible embodiment, as shown in fig. 5, the first and second layers of the second prediction model are GRU layer 501 and GRU layer 502, the GRU is a commonly used gated recurrent neural network, and its inputs are the input at the current time and the hidden state at the previous time, i.e. the output yt is influenced by the information at the current time t and t-1 times before. And inputting the time-frequency feature matrixes x1 to xt obtained by decomposition into GRU by the terminal to obtain corresponding candidate feature matrixes y1 to yt, wherein t is the number of the time-frequency feature matrixes.

b. Inputting the n candidate feature matrices into an attention mechanism layer to obtain matrix weights corresponding to the candidate feature matrices, and normalizing the matrix weights.

In one possible implementation, as shown in FIG. 5, the third layer of the second predictive model is an attention mechanism layer 503, which may determine the portions of the input that need attention, and allocate limited information processing resources to the important portions, mathematically embodied as calculating weights α_t. After the terminal obtains the candidate feature matrixes through the GRU layer, the attention mechanism layer is utilized to calculate the matrix weight of each candidate feature matrix, and the calculation formula of the matrix weight is as follows:

e_t＝tanh(w_ty_t＋b)，

wherein, y_tCandidate feature matrix output for GRU, e_tFor each candidate feature matrix y_tCorresponding weight, α_tIs e_tMatrix weight, w, obtained after normalization_tAnd b is a parameter in the weight calculation process, and is obtained through model training.

c. And determining a target characteristic matrix according to the candidate characteristic matrix and the matrix weight.

After the terminal calculates the matrix weight of the candidate feature matrix through the attention mechanism layer 503, the terminal performs weighted calculation on the candidate feature matrix to obtain a target feature matrix, and the calculation formula is as follows:

where y is the target feature matrix, α_tIs the matrix weight, y_tIs the candidate feature matrix and T is the total number of candidate feature matrices.

The target characteristic matrix integrates the characteristics of each frame of audio data in the current environmental sound, and the second prediction model utilizes the target characteristic matrix for identification, so that the running state of the vehicle can be accurately judged.

And thirdly, classifying the target feature matrix through a full connection layer and a classification layer to obtain a target prediction result.

In one possible implementation, after the attention mechanism layer 503, the second prediction model further includes two Fully Connected (FC) layers and a classification layer. And after the terminal obtains the target characteristic matrix according to the weighting calculation, integrating and classifying the information of the target characteristic matrix by using the FC connection layer and the classification layer, and outputting a final second prediction result.

Optionally, the classification layer classifies the target feature matrix by using a normalized exponential function (Softmax), and an output result indicates a corresponding vehicle operating state.

In one possible embodiment, the second prediction model is trained according to the samples and is trained by a gradient descent algorithm with the focus loss focalloss as a loss function, and the model training process is as follows:

firstly, converting collected environmental sounds including the driving states of an inbound station, an outbound station and an inter-station into a spectrogram.

Referring to fig. 6 and fig. 7, which show the spectrograms of the outbound state and the inbound state of a vehicle respectively, the energy variation of the low frequency part in the spectrograms can be clearly seen, in the spectrograms of the outbound state of the vehicle, the energy of the low frequency part is changed from low to high, and the bottom 601 color is changed from dark to bright in the graphs; in the spectrogram of the vehicle inbound state, the energy of the low-frequency part is changed from high to low, and the bottom 701 color is changed from bright to dark in the spectrogram; for the spectrogram of the driving state, the energy change of the low-frequency part is not large, and the bottom color of the spectrogram is more uniform.

And secondly, extracting the characteristics of the collected environmental sounds.

The method for extracting the characteristics of the environmental sound collected in advance is the same as the method for extracting the time-frequency characteristics in the embodiment, and the time-frequency characteristic matrix corresponding to each audio frame is used as a training sample, wherein the first sample is the environmental sound of the sample collected in the station entering state, and the label of the corresponding sample is 1; the second sample is sample environment sound collected in an outbound state, and the corresponding sample label is 0; the third sample is sample environmental sound collected in a driving state, and the corresponding sample label is 2.

And thirdly, establishing an RNN model.

In a possible embodiment, the RNN model structure is as shown in fig. 8, the first GUR layer 801 and the second GRU layer 802 are configured to extract features of an input MFCC feature matrix, convert the features into candidate feature matrices, the attention mechanism layer 803 calculates matrix weights of the candidate feature matrices, performs weighted calculation on the candidate feature matrices to obtain target feature matrices, the first fully-connected layer 804 and the second fully-connected layer 805 integrate information with category distinction in the target feature matrices, and finally connects Softmax806 to classify the information integrated by the fully-connected layers to obtain a second prediction result.

And fourthly, constructing a loss function of the model.

Since the inbound state and the outbound state are usually only a few seconds while the vehicle is running, and the rest environmental sounds are as long as a few minutes, the sample data is very unbalanced, so the following formula is chosen to solve the problem of sample imbalance:

wherein y' is the probability output by the CNN classification model, y is the label corresponding to the training sample, and α and gamma are manual adjusting parameters for adjusting the proportion of the three samples.

And fifthly, importing training samples to perform model training.

In one possible implementation, the open source software library Tensorflow may be used to train the RNN classification model and employ the Focal loss and gradient descent algorithm until the model converges, at which time the model obtains the final network parameters.

In a possible implementation manner, the secondary prediction model may also use other conventional machine learning classifiers or deep learning classification models, which is not limited in this embodiment.

In the embodiment of the application, the GRU in the RNN model is used for extracting the characteristics of the time-frequency characteristic matrix, an attention mechanism is added, the weight is calculated according to the correlation and the time sequence characteristics between the time-frequency characteristic matrices, and then the target characteristic matrix is obtained through weighting calculation, so that the accuracy of the secondary prediction model is improved, and the accuracy and the timeliness of the prediction of the running state of the vehicle are improved.

In the actual riding process, abnormal conditions may exist to cause the sound signals of the low-frequency part of the environmental sound to have obvious changes in a certain time period, and at the moment, the vehicle may be in a running state, and if the running state of the vehicle is determined only according to the primary prediction result of the preset prediction model, the station entering and exiting prediction may be wrong. In order to further improve the accuracy of the prediction result, in a possible implementation, after the terminal collects the environmental sounds through the microphone, the collected environmental sounds are subjected to sliding window processing, so as to obtain the environmental sounds in each audio window.

In another possible implementation, on the basis of fig. 2, as shown in fig. 9, the method further includes step 207 and step 208 to step 211:

and step 207, performing sliding window processing on the acquired environmental sounds to obtain the environmental sounds in each audio window.

Illustratively, in order to process the audio data of the environmental sound conveniently, the terminal divides the continuous audio data, as shown in fig. 10, performs sliding window processing on the audio data of the environmental sound with 10080ms as a window length and 2.5s as a step length to obtain three

audio windows

1001, 1002, and 1003, and performs subsequent steps of extracting the characteristics of the environmental sound and predicting the model with the audio window as a unit.

Optionally, on the basis of step 207, step 206 may be replaced by step 208:

and step 208, in response to that the target prediction results corresponding to the n consecutive audio windows are the same and the confidence of the target prediction results is higher than the confidence threshold, determining the operation state indicated by the target prediction results as the target operation state.

Since the step length during the sliding window processing is smaller and is far shorter than the time length of the vehicle when the vehicle enters the station, leaves the station or runs, when the vehicle enters the station or leaves the station, the target prediction result of at least 2 audio windows is in an entering state or in an leaving state. Therefore, in order to eliminate the abnormal condition, when the target prediction results corresponding to n consecutive audio windows are the same and the confidence of the target prediction results is higher than the confidence threshold, the terminal determines that the operation state indicated by the target prediction results is determined as the target operation state. And the confidence coefficient is the probability value in the output result of the second prediction model.

And step 209, responding to the target running state being the station-entering state or the station-exiting state, and acquiring the average time length of the vehicles for maintaining the running state between the stations.

In one possible implementation mode, when the target operation state is determined to be the inbound state or the outbound state, the terminal records the current time and calculates the average time length of the vehicle for maintaining the inter-station driving state once, the average time length is the total time length of the vehicle in the inter-station driving state after the terminal enters the vehicle and is divided by the number of times of inter-station driving, and the time length of each inter-station driving state is obtained by subtracting the previous inbound time from the current inbound time or subtracting the previous outbound time from the current outbound time or subtracting the previous outbound time from the current inbound time.

Illustratively, the time the user enters the vehicle is 09: 59, the time when the terminal first detects the arrival of the vehicle is 10: and 10, the time for detecting the arrival of the vehicle for the second time is 10: 18, the time length for the vehicle to maintain the inter-station driving state for the first time is 8 minutes, and the total time length of the inter-station driving state is accumulated by analogy, and is divided by the number of times of inter-station driving, so that the average time length for the vehicle to maintain the inter-station driving state for the first time is obtained.

And step 210, determining a next operation state prediction interval according to the average time length and the current time.

In a possible implementation manner, the terminal determines the next operation state prediction interval based on the average time length of the vehicles maintaining the driving state between the stations, for example, the average time length is calculated by the terminal to be 10 minutes, when the vehicles are detected to enter or exit, the time T1 of the next vehicle entering the station or the time T2 of the next vehicle exiting the station is calculated according to the current time and the average time length, the predetermined operation state prediction time length is added to the T1 or the T2 to serve as the next operation state prediction interval, or the T1 to the T2 are selected as the next operation state prediction interval.

And step 211, in response to reaching the next operation state prediction interval, executing the step of collecting the environmental sound through the microphone.

When the number of stations passed by the vehicle reaches a threshold value, the microphone can be closed, and the microphone can be opened again when the next operation state prediction interval is reached, so that the power consumption of the terminal is reduced.

Optionally, because distances between stations are different and durations of time for maintaining driving states between stations by vehicles are also different, to avoid that the terminal misses the time for the vehicle to enter or exit, the next operation state prediction interval is appropriately extended, for example, the terminal determines the duration of maintaining the driving states between stations by using the time of entering as a reference, and the time of entering the station by the current terminal is 11: 00, the calculated inter-station running state is 10 minutes, the predicted time length of the preset running state is 3 minutes, and then the interval of the next predicted running state may be 11: 10 to 11: 13, in order to avoid missing the arrival or departure of the vehicle, the terminal advances the start time of the next operation state prediction interval and delays the end time, for example, setting the next operation state prediction interval as 11: 08 to 11: 15, the terminal reaches 11: and when 08, the microphone is turned on to predict the running state, and the running state reaches 11: the microphone is turned off at 15, or immediately after the arrival of the vehicle is detected.

In the embodiment of the application, the target prediction result output by the preset prediction model is post-processed by performing sliding window processing on continuous environmental sounds (when the target prediction results corresponding to n continuous audio windows are the same and the confidence coefficient of the target prediction result is higher than the confidence coefficient threshold value, the running state indicated by the target prediction result is determined as the target running state), so that the accuracy of the prediction result is further improved, and the influence of abnormal conditions is avoided; and when the number of stations passed by the vehicle reaches a threshold value, the microphone is closed, and the operation state is predicted only by restarting in the next operation state prediction interval, so that the power consumption of the terminal is further reduced.

Referring to fig. 11, a flow chart of a method for predicting the behavior of a vehicle is shown, comprising the following steps:

in step 1101, an ambient sound is input.

Step 1102, sliding window processing.

Step 1103, energy feature extraction.

And step 1104, judging an energy threshold. When the judgment result is that the vehicle running state is the inbound state or the outbound state, executing step 1105; otherwise, the step 1101 is returned to continue inputting the next environmental sound.

Step 1105, time-frequency feature extraction.

Step 1106, RNN classification. When the judgment result is that the vehicle running state is the inbound state or the outbound state, executing step 1007; otherwise, the step 1001 is returned to continue inputting the next environmental sound.

Step 1107, post-processing.

Step 1108, outputting the target running state.

Referring to fig. 12, a block diagram of a vehicle operation state prediction apparatus according to an exemplary embodiment of the present application is shown. The apparatus may be implemented as all or a portion of the terminal in software, hardware, or a combination of both. The device includes:

the collecting module 1201 is used for collecting environmental sounds through a microphone when the vehicle is in the traffic state;

a feature extraction module 1202, configured to perform feature extraction on the environmental sound to obtain an environmental sound feature of the environmental sound, where the environmental sound feature includes at least one of an energy feature and a time-frequency feature;

a prediction module 1203, configured to input the environmental sound feature into a preset prediction model to obtain a target prediction result output by the preset prediction model, where the preset prediction model includes at least one of a first prediction model and a second prediction model, an input of the first prediction model is the energy feature, an input of the second prediction model is the time-frequency feature, the second prediction model is a classification model using a recurrent neural network RNN, the target prediction result is a predicted operation state of the vehicle, and the operation state includes at least one of an entry state, an exit state, and an inter-station driving state;

and a state determining module 1204, configured to determine a target operating state of the vehicle according to the target prediction result.

Optionally, the predicting module 1203 includes:

the first input unit is used for inputting the energy characteristics into the first prediction model to obtain a first prediction result output by the first prediction model;

a first determination unit configured to determine the first prediction result as the target prediction result in response to the first prediction result indicating that the vehicle is in the inter-station travel state;

and the second determining unit is used for responding to the first prediction result indicating that the vehicle is in the station-entering state or the station-exiting state, inputting the time-frequency characteristics into the second prediction model to obtain a second prediction result output by the second prediction model, and determining the second prediction result as the target prediction result.

Optionally, the second prediction model includes a gated cyclic unit GRU layer, an attention mechanism layer, a full link layer, and a classification layer;

the second determining unit is further configured to:

decomposing the time-frequency characteristics to obtain n time-frequency characteristic matrixes, wherein the dimensions of the time-frequency characteristic matrixes are the same;

extracting the characteristics of the n time-frequency characteristic matrixes through the GRU layer and the attention mechanism layer to obtain a target characteristic matrix;

and classifying the target feature matrix through the full-connection layer and the classification layer to obtain the target prediction result.

Optionally, the second determining unit is further configured to:

inputting the n time-frequency feature matrixes into the GRU layer to obtain candidate feature matrixes corresponding to the time-frequency feature matrixes;

inputting the n candidate feature matrices into the attention mechanism layer to obtain matrix weights corresponding to the candidate feature matrices, wherein the matrix weights are subjected to normalization processing;

and determining the target feature matrix according to the candidate feature matrix and the matrix weight.

Optionally, the duration of the environmental sound is t, the energy characteristic includes a first energy value of the environmental sound in a first t/2 duration and a second energy value of the environmental sound in a second t/2 duration, and t is a number greater than 0;

the first input unit is further configured to:

inputting said first energy value and said second energy value into said first predictive model;

in response to a ratio of the second energy value to the first energy value being greater than a first threshold, outputting the first prediction result indicating that the vehicle is in the outbound state;

in response to a ratio of the second energy value to the first energy value being less than a second threshold value, the second threshold value being less than the first threshold value, outputting the first prediction result indicating that the vehicle is in the inbound state;

outputting the first prediction result indicating that the vehicle is in the driving state in response to a ratio of the second energy value to the first energy value being less than or equal to the first threshold value and greater than or equal to the second threshold value.

Optionally, the feature extraction module 1202 includes:

the framing processing unit is used for framing the audio data of the environmental sound to obtain m audio frames, wherein m is an integer greater than or equal to 2;

the first calculating unit is used for calculating the energy value of each audio frame in a preset frequency band, wherein the preset frequency band is a low-frequency band lower than a preset frequency;

the second calculation unit is used for calculating the first energy value according to the energy values corresponding to the previous m/2 audio frames;

and the third calculating unit is used for calculating the second energy value according to the energy values corresponding to the m/2 audio frames.

Optionally, the environmental sound feature is the time-frequency feature, and the feature extraction module 1202 includes:

the interval dividing unit is used for dividing the environment sound into i intervals, wherein i is an integer greater than or equal to 2;

the filtering processing unit is used for carrying out Mel filtering processing on the audio data of the environment sound in each interval to obtain a Mel frequency cepstrum coefficient MFCC characteristic matrix of the i sections of audio data;

and the merging unit is used for merging the MFCC characteristic matrixes of the i sections of audio data to obtain the time-frequency characteristic.

Optionally, the apparatus further comprises:

the sliding window processing module is used for performing sliding window processing on the acquired environmental sounds to obtain the environmental sounds in each audio window;

the state determining module 1204 includes:

and the state determining unit is used for determining the operating state indicated by the target prediction result as the target operating state in response to that the target prediction results corresponding to n consecutive audio windows are the same and the confidence coefficient of the target prediction results is higher than a confidence coefficient threshold value, wherein n is an integer greater than or equal to 2.

Optionally, the apparatus further comprises:

the acquisition module is used for responding to the target running state being the station entering state or the station exiting state, and acquiring the average duration of the vehicles maintaining the inter-station running state once;

the interval determining module is used for determining a next operation state prediction interval according to the average duration and the current moment;

and the function starting module is used for responding to the next operation state prediction interval and executing the step of collecting the environmental sound through the microphone.

Referring to fig. 13, a block diagram of a terminal 1300 according to an exemplary embodiment of the present application is shown. The terminal 1300 may be an electronic device in which an application is installed and run, such as a smart phone, a tablet computer, an electronic book, a portable personal computer, and the like. Terminal 1300 in the present application may include one or more of the following components: a processor 1320, a memory 1310, a screen 1330, and a microphone 1340.

Processor 1320 may include one or more processing cores. Processor 1320 interfaces various components throughout terminal 1300 using various interfaces and circuitry to perform various functions and process data of terminal 1300 by executing or executing instructions, programs, code sets, or instruction sets stored in memory 1310 and invoking data stored in memory 1310. Alternatively, the processor 1320 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). Processor 1320 may integrate one or a combination of Central Processing Units (CPUs), Graphics Processing Units (GPUs), modems, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is responsible for rendering and drawing the content that the screen 1330 needs to display; the modem is used to handle wireless communications. It is to be appreciated that the modem can be implemented by a single communication chip without being integrated into the processor 1320.

The Memory 1310 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 1310 includes a non-transitory computer-readable medium. The memory 1310 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1310 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, and the like), instructions for implementing the above method embodiments, and the like, and the operating system may be an Android (Android) system (including a system based on Android system depth development), an IOS system developed by apple inc (including a system based on IOS system depth development), or other systems. The stored data area may also store data created by terminal 1300 during use (e.g., phone book, audio-visual data, chat log data), etc.

The screen 1330 may be a capacitive touch display screen for receiving a user's touch operation on or near it using a finger, a stylus, or any other suitable object, as well as displaying a user interface for various applications. The touch display screen is generally provided on the front panel of the terminal 1300. The touch display screen may be designed as a full-face screen, a curved screen, or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

The microphone 1340 may be a low power consumption microphone, and the microphone 1340 is used to collect environmental sounds when the terminal starts the station entering and exiting prediction function, or may be used to collect environmental sounds during voice communication. The microphone 1340 is generally disposed at an edge portion (e.g., lower edge) of one side of the display screen of the terminal, which is not limited in the embodiment of the present application.

In addition, those skilled in the art will appreciate that the configuration of terminal 1300 illustrated in the above-described figures does not constitute a limitation of terminal 1300, as terminals may include more or less components than those illustrated, or some components may be combined, or a different arrangement of components. For example, the terminal 1300 further includes a radio frequency circuit, a shooting component, a sensor, an audio circuit, a Wireless Fidelity (Wi-Fi) component, a power supply, a bluetooth component, and other components, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, which stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for predicting the vehicle operating state according to the above embodiments.

The embodiment of the present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the method for predicting the vehicle operating state according to the above embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of predicting an operating condition of a vehicle, the method comprising:

collecting ambient sounds by a microphone while in a vehicle;

2. The method according to claim 1, wherein the inputting the environmental sound characteristic into a preset prediction model to obtain a target prediction result output by the preset prediction model comprises:

inputting the energy characteristics into the first prediction model to obtain a first prediction result output by the first prediction model;

in response to the first prediction result indicating that the vehicle is in the inter-station travel state, determining that the first prediction result is the target prediction result;

and in response to the first prediction result indicating that the vehicle is in the inbound state or the outbound state, inputting the time-frequency features into the second prediction model, obtaining a second prediction result output by the second prediction model, and determining the second prediction result as the target prediction result.

3. The method of claim 2, wherein the second prediction model comprises a gated loop unit (GRU) layer, an attention mechanism layer, a fully connected layer, and a classification layer;

the inputting the time-frequency characteristics into the second prediction model to obtain a second prediction result output by the second prediction model includes:

4. The method of claim 3, wherein the extracting features of the n time-frequency feature matrices through the GRU layer and the attention mechanism layer to obtain a target feature matrix comprises:

5. The method according to any one of claims 2 to 4, wherein the duration of the ambient sound is t, the energy characteristics include a first energy value of the ambient sound in a first t/2 duration and a second energy value of the ambient sound in a second t/2 duration, t being a number greater than 0;

the inputting the energy characteristics into the first prediction model to obtain a first prediction result output by the first prediction model includes:

6. The method according to claim 5, wherein the performing feature extraction on the environmental sound to obtain the environmental sound feature of the environmental sound comprises:

performing framing processing on the audio data of the environmental sound to obtain m audio frames, wherein m is an integer greater than or equal to 2;

calculating the energy value of each audio frame in a preset frequency band, wherein the preset frequency band is a low-frequency band lower than a preset frequency;

calculating the first energy value according to the energy values corresponding to the previous m/2 audio frames;

and calculating the second energy value according to the energy values corresponding to the m/2 audio frames.

7. The method according to any one of claims 1 to 4, wherein the environmental sound feature is the time-frequency feature, and the extracting the environmental sound feature to obtain the environmental sound feature of the environmental sound includes:

dividing the environment sound into i intervals, wherein i is an integer greater than or equal to 2;

performing Mel filtering processing on the audio data of the environmental sound of each interval to obtain a Mel frequency cepstrum coefficient MFCC characteristic matrix of the i sections of audio data;

and merging the MFCC feature matrixes of the i sections of audio data to obtain the time-frequency feature.

8. The method of any one of claims 1 to 4, wherein after the collecting ambient sounds by the microphone, the method further comprises:

performing sliding window processing on the acquired environmental sounds to obtain the environmental sounds in each audio window;

the determining the target operation state of the vehicle according to the target prediction result comprises:

and in response to that the target prediction results corresponding to n consecutive audio windows are the same and the confidence of the target prediction results is higher than a confidence threshold, determining the operating state indicated by the target prediction results as the target operating state, wherein n is an integer greater than or equal to 2.

9. The method of any of claims 1 to 4, wherein after determining the target operating state of the vehicle based on the target prediction, the method further comprises:

responding to the target running state being the station entering state or the station exiting state, and acquiring the average duration of the vehicles maintaining the inter-station running state once;

determining a next operation state prediction interval according to the average duration and the current moment;

in response to reaching the next operating condition prediction interval, performing the step of collecting ambient sounds with a microphone.

10. An apparatus for predicting an operating state of a vehicle, the apparatus comprising:

11. A terminal, characterized in that the terminal comprises a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a method of predicting an operating condition of a vehicle as claimed in any one of claims 1 to 9.

12. A computer-readable storage medium, wherein the storage medium stores at least one instruction for execution by a processor to implement the method of predicting a vehicle operating condition of any of claims 1 to 9.