CN113984078A

CN113984078A - Arrival reminding method, device, terminal and storage medium

Info

Publication number: CN113984078A
Application number: CN202111249921.8A
Authority: CN
Inventors: 刘文龙
Original assignee: Shanghai Jinsheng Communication Technology Co ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-28
Anticipated expiration: 2041-10-26
Also published as: WO2023071768A1; CN113984078B

Abstract

The embodiment of the application discloses a method and a device for reminding arrival, a terminal and a storage medium, and belongs to the technical field of terminals. The method comprises the following steps: acquiring environmental sound data in a target time period and inertial sensor data in the target time period; based on the time sequence of the environmental sound data, carrying out feature extraction on the environmental sound data to obtain global sound features; based on the time sequence of the inertial sensor data, performing feature extraction on the inertial sensor data to obtain global inertial sensor features; performing fusion processing on the global sound characteristic and the global inertia sensor characteristic based on an attention mechanism to obtain a fusion characteristic; acquiring traffic operation information based on the fusion characteristics; and performing arrival reminding based on the traffic operation information. The condition that the accuracy is poor due to external influence when the running state of the public transport means is judged only through the single modal characteristic is avoided, and the accuracy of the arrival reminding is improved.

Description

Arrival reminding method, device, terminal and storage medium

Technical Field

The present disclosure relates to the field of terminal technologies, and in particular, to a method and an apparatus for reminding a user of arrival, a terminal, and a storage medium.

Background

At present, when people take public transport means such as subways, people need to pay attention to whether a current stop station is a target station needing to get off, and along with the development of terminal technology, a terminal can have a function of reminding people of getting off when the terminal reminds passengers of getting off in time when the terminal reaches the target station.

In the related art, a terminal usually utilizes an embedded accelerometer to perform acceleration acquisition, and determines the acceleration condition of a currently-riding vehicle in real time according to an acceleration value recorded by the accelerometer in real time. For example, if the terminal detects that the acceleration is greater than zero, it is determined that the vehicle is in a starting stage, if the acceleration is less than zero, it is determined that the vehicle is decelerating to enter the station, and then it is determined whether the user arrives or needs to transfer by combining with a subway line map and the user's requirements, so that the terminal reminds of arriving or transferring.

However, at present, the mode of judging whether the subway arrives or not by recording the acceleration direction through the accelerometer sensor of the terminal has a great relationship with the attitude of the mobile phone, and the numerical value recorded through the accelerometer of the terminal is difficult to accurately judge whether the subway is in an acceleration state or a deceleration state, so that the problem of inaccurate arrival reminding exists.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for reminding the arrival, which can improve the accuracy of judging the running state of a public transport means and further improve the accuracy of reminding the arrival by a terminal. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a station arrival reminding method, where the method includes:

acquiring environmental sound data in a target time period and inertial sensor data in the target time period;

based on the time sequence of the environmental sound data, carrying out feature extraction on the environmental sound data to obtain global sound features;

performing feature extraction on the inertial sensor data based on the time sequence of the inertial sensor data to obtain global inertial sensor features;

performing fusion processing on the global sound characteristic and the global inertial sensor characteristic based on an attention-self mechanism to obtain a fusion characteristic;

acquiring traffic operation information based on the fusion features; the traffic operation information is used for indicating the operation state of public transport means in the target time period;

and executing the arrival reminding based on the traffic operation information.

On the other hand, the embodiment of the present application provides a device for reminding of arriving at a station, the device includes:

the data acquisition module is used for acquiring environmental sound data in a target time period and inertial sensor data in the target time period;

the first feature extraction module is used for extracting features of the environmental sound data based on the time sequence of the environmental sound data to obtain global sound features;

the second feature extraction module is used for performing feature extraction on the inertial sensor data based on the time sequence of the inertial sensor data to obtain global inertial sensor features;

the feature fusion module is used for carrying out fusion processing on the global sound feature and the global inertial sensor feature based on a self-attention mechanism to obtain a fusion feature;

the information acquisition module is used for acquiring traffic operation information based on the fusion characteristics; the traffic operation information is used for indicating the operation state of public transport means in the target time period;

and the reminding module is used for executing the arrival reminding based on the traffic operation information.

In one possible implementation manner, the feature fusion module includes:

the feature splicing submodule is used for splicing the global sound feature and the global inertial sensor feature;

the weight obtaining submodule is used for processing the spliced global sound feature and global inertia sensor feature based on a self-attention mechanism to obtain respective attention weights of the global sound feature and the global inertia sensor feature;

and the feature fusion submodule is used for acquiring the fusion features based on the attention weights of the global sound features and the global inertial sensor features.

In one possible implementation manner, the global sound feature includes a global sound sub-feature corresponding to each of at least two time periods within the target time period; the global inertial sensor features comprise global inertial sensor sub-features corresponding to at least two time periods in the target time period respectively;

the feature splicing submodule comprises:

the splicing unit is used for splicing the global sound sub-features and the global inertial sensor sub-features corresponding to at least two time periods in the target time period; the dimension number of the global sound sub-features is the same as that of the global inertial sensor sub-features;

the weight obtaining submodule includes:

the weighting unit is used for processing the spliced global sound features and global inertia sensor features based on a self-attention mechanism to obtain respective attention weights of the global sound sub-features and the global inertia sensor sub-features;

the feature fusion submodule includes:

a fusion feature obtaining unit, configured to obtain the fusion feature based on the attention weight of each of the global sound sub-features and the attention weight of each of the global inertial sensor sub-features.

In one possible implementation manner, the fusion feature obtaining unit is configured to,

and carrying out weighted summation or weighted average on the global sound sub-features and the global inertial sensor sub-features based on the respective attention weights of the global sound sub-features and the respective attention weights of the global inertial sensor sub-features to obtain the fusion features.

In one possible implementation, the ambient sound data includes at least two audio data segments;

the first feature extraction module includes:

the first extraction submodule is used for respectively extracting audio features of at least two audio data segments to obtain respective Mel frequency cepstrum coefficient features of the at least two audio data segments;

the first local acquisition submodule is used for carrying out feature extraction on the Mel frequency cepstrum coefficient features of the at least two audio data segments to obtain the sound local features of the at least two audio data segments;

and the first global acquisition submodule is used for processing the respective sound local characteristics of the at least two audio data segments based on an attention mechanism according to the time domain sequence of the at least two audio data segments to obtain the global sound characteristics.

In one possible implementation manner, the first global obtaining sub-module includes:

the first weight acquisition unit is used for processing the local sound characteristics of the at least two audio data segments based on an attention mechanism according to the time domain sequence of the at least two audio data segments to obtain the attention weight of the at least two audio data segments;

and the first global acquisition unit is used for weighting the local sound characteristics of the at least two audio data segments based on the attention weights of the at least two audio data segments to obtain the global sound characteristics.

In one possible implementation, the inertial sensor data includes at least two sensor data segments;

the second feature extraction module includes:

the second local acquisition submodule is used for extracting the characteristics of at least two sensor data segments to obtain the respective sensor local characteristics of the at least two sensor data segments;

and the second global acquisition submodule is used for processing the local sensor characteristics of the at least two sensor data segments based on an attention mechanism according to the time domain sequence of the at least two sensor data segments to obtain the global inertial sensor characteristics.

In a possible implementation manner, the second global obtaining sub-module includes:

the second weight acquisition unit is used for processing the local sensor characteristics of the at least two sensor data segments based on an attention mechanism according to the time domain sequence of the at least two sensor data segments to obtain the attention weights of the at least two sensor data segments;

and the second global acquisition unit is used for weighting the local sensor characteristics of the at least two sensor data segments based on the attention weights of the at least two sensor data segments to obtain the global inertial sensor characteristics.

In one possible implementation, the traffic operation information is used to indicate whether an operation state of the public transportation means within the target time period is a stop state;

the reminding module comprises:

and the reminding sub-module is used for executing the arrival reminding under the condition that the traffic operation information indicates that the operation state of the public transport means in the target time period is a stop state.

In a possible implementation manner, the reminding sub-module includes:

a position acquisition unit configured to acquire a current position of a public transportation means in a case where the traffic operation information indicates that an operation state of the public transportation means within the target time period is a stopped state;

the reminding unit is used for executing arrival reminding under the condition that the current position of the public transport means is matched with the specified station on the target route; the designated station is a destination station or a transfer station on the target route.

In one possible implementation, the apparatus further includes:

the interface display module is used for displaying a route setting interface before acquiring the environmental sound data in a target time period and the inertial sensor data in the target time period;

and the target route obtaining module is used for predicting a route according to a starting station and a target station which are set in the route setting interface by the user or according to the historical movement track of the user to obtain the target route.

In a possible implementation manner, the reminding sub-module includes:

and the arrival reminding unit is used for executing arrival reminding under the condition that the traffic operation information acquired for N times continuously indicates that the operation state of the public transport means is a stop state.

In another aspect, an embodiment of the present application provides a terminal, where the terminal includes a processor and a memory; the memory has stored therein at least one computer instruction that is loaded and executed by the processor to implement the arrival alert method as described in the above aspect.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, where at least one computer instruction is stored, and the computer instruction is loaded and executed by a processor to implement the arrival reminding method according to the above aspect.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the terminal reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the terminal executes the arrival reminding method provided in the various optional implementation modes of the above aspects.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the environment sound data and the inertial sensor data are collected in real time, time sequence related global feature extraction is carried out on the environment sound data and the inertial sensor data respectively, then fusion feature extraction is carried out by combining the relation among different modes based on the global sound feature and the global inertial sensor feature, the condition that the accuracy is poor due to external influence when the running state of the public transport means is judged only through a single mode feature is avoided, and the accuracy of the running state judgment of the public transport means is improved, so that the accuracy of the arrival reminding is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a diagram illustrating an application scenario in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of arrival reminder according to an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of arrival reminder according to an exemplary embodiment;

FIG. 4 is a flow chart illustrating a method of arrival reminder according to another exemplary embodiment;

FIG. 5 is a flow chart of a Mel-frequency cepstrum coefficient extraction process according to the embodiment shown in FIG. 4;

FIG. 6 is a diagram of a classification model architecture according to the embodiment shown in FIG. 4;

fig. 7 is a flowchart of an arrival judging method according to the embodiment shown in fig. 4;

FIG. 8 is a block diagram of an arrival reminder apparatus according to an exemplary embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment of the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The subsequent embodiment of the application provides a scheme for reminding the arrival of the station, and the arrival of the station of the public transport means can be reminded when a user takes the public transport means daily.

Please refer to fig. 1, which illustrates a schematic diagram of an application scenario related to various embodiments of the present application. As shown in fig. 1, a microphone 101 and an inertial sensor 102 are built in a terminal 100. For example, the terminal 100 may be a smart phone, a tablet computer, an e-book reader, a personal portable computer, or the like.

Optionally, an application having a function of reminding the arrival may be installed in the terminal 100, and the application may perform the arrival reminding by combining the data collected by the microphone 101 and the inertial sensor 102.

For example, when the user carries the terminal 100 to take the public transportation means 120, if the application starts the arrival reminding function, the terminal 100 may collect the ambient sound data through the microphone 101 and collect the inertial sensor data through the inertial sensor 102, and the application determines whether to remind the user of arrival based on the ambient sound data and the inertial sensor data in combination with the route 140 of the public transportation means, and sends the arrival reminding to the user when determining that the arrival reminding needs to be performed.

Fig. 2 shows a flowchart of a method for reminding a user of arrival according to an exemplary embodiment of the present application. The arrival reminding method can be executed by a terminal, and the terminal can be a terminal having a sound acquisition function and an inertial sensor data acquisition function, for example, the terminal can be the terminal 100 in the application scenario shown in fig. 1. The arrival reminding method comprises the following steps:

step 201, obtaining environmental sound data in a target time period and inertial sensor data in the target time period.

In the embodiment of the application, the terminal acquires environmental sound data in a target time period and inertial sensor data in the target time period.

For example, the terminal may collect the environmental sound data and the inertial sensor data according to a specified time period, each collection period collects the environmental sound data and the inertial sensor data within a specified time period, and the environmental sound data and the inertial sensor data within the target time period are data collected in one collection period, for example, the environmental sound data and the inertial sensor data collected in the latest collection period.

Wherein, the environmental sound data can be collected through a microphone component of the terminal. Inertial sensors, which may also be referred to as Inertial Measurement Units (IMUs), are devices that measure the three-axis attitude angles (or angular rates) and acceleration of an object. Generally, an IMU includes three single-axis accelerometers and three single-axis gyroscopes, the accelerometers are used for detecting acceleration signals of an object in three independent axes of a carrier coordinate system, and the gyroscopes are used for detecting angular velocity signals of the carrier relative to a navigation coordinate system, so that angular velocity and acceleration of the object in three-dimensional space can be measured by the IMU.

Step 202, based on the time sequence of the environmental sound data, performing feature extraction on the environmental sound data to obtain global sound features.

In the embodiment of the application, when the terminal performs feature extraction on the environmental sound data based on the time sequence of the environmental sound data in the target time period, the global sound feature related to the time sequence can be obtained.

The global sound feature is obtained by global feature extraction based on the time sequence of the environmental sound data, so that the global sound feature has better representation on the environmental sound data.

And step 203, extracting the features of the inertial sensor data based on the time sequence of the inertial sensor data to obtain the global inertial sensor features.

In the embodiment of the application, when the terminal performs feature extraction on the inertial sensor data based on the time sequence of the inertial sensor data in the target time period, the global inertial sensor feature related to the time sequence can be obtained.

Similar to the global acoustic feature, the global inertial sensor feature is obtained by global feature extraction based on the time sequence of the inertial sensor data, and therefore the global inertial sensor feature has better representation of the inertial sensor data.

And 204, performing fusion processing on the global sound characteristic and the global inertial sensor characteristic based on a self-attention mechanism to obtain a fusion characteristic.

In the embodiment of the application, the terminal can perform feature fusion processing on the global features of the two modes, namely the global sound feature and the global inertial sensor feature, through a self-attention mechanism to obtain the feature fused by the two modes.

Because the relation between the data of different modes can be better extracted through the self-attention mechanism, the scheme shown in the embodiment of the application can be combined with the relation between the data of two modes, namely the global sound characteristic and the global inertia sensor characteristic, so that the characteristic fusion effect between the characteristics of the two modes can be ensured, and the accuracy of subsequent traffic operation information acquisition based on the fusion characteristics is improved.

Step 205, acquiring traffic operation information based on the fusion features; the traffic operation information is used to indicate an operation state of the public transportation within the target time period.

In the embodiment of the present application, the above-described running state refers to a running state of the public transportation. For example, the operation state may include a constant speed driving state, a start acceleration state, a brake deceleration state, a parking state, and the like.

The terminal can predict the traffic operation information corresponding to the fusion characteristics by processing and analyzing the acquired fusion characteristics, so that the driving state of the public transport means in the target time period is determined.

And step 206, performing arrival reminding based on the traffic operation information.

In the embodiment of the application, the terminal can predict whether the terminal arrives at the station or not based on the running state indicated by the traffic running information, and determine whether to send the arrival reminding to the user or not based on the result of the arrival prediction.

Alternatively, the arrival reminding can be the arrival reminding of a target station (such as a destination station or a transfer station) in a route driven by the public transportation vehicle. That is, if the arrival or the future arrival of the public transportation is predicted based on the transportation operation information, an arrival reminder may be issued to the user.

For example, in order to prevent the user from missing the getting-off time due to the fact that the time between the time when the terminal sends the arrival reminding and the time when the public transport means drives to the next station after closing the door, a message prompt of coming to the station when the user arrives at the station before the target station can be set, so that the user can be ready to get-off in advance.

Optionally, the arrival reminding mode includes, but is not limited to: voice prompt, vibration prompt and interface prompt.

Wherein the station where the terminal is located can be determined in connection with the route map of the vehicle. For example, the terminal loads and stores a route map of a vehicle in a current city in advance, and the route map includes station information, transfer information, first and last shift time, a map near a station, and the like of each route. Before the terminal starts to execute the arrival reminding method in the embodiment of the application, the riding information of the user can be acquired, wherein the riding information comprises an initial station, a target station, a map near the station, first and last shift time and the like.

To sum up, in the embodiment of the application, the environment sound data and the inertial sensor data are collected in real time, time sequence-related global feature extraction is performed on the environment sound data and the inertial sensor data respectively, and then the feature extraction is performed by combining the relation between the features of different modes based on the global sound feature and the global inertial sensor feature, so that the condition that the accuracy is poor due to external influence when the running state of the public transport vehicle is judged only through a single mode feature is avoided, the running state judgment accuracy of the public transport vehicle is improved, and the arrival reminding accuracy is further improved.

For example, taking the case that the traffic operation information is the start-stop state of the public transportation means, and the arrival reminding is performed when the public transportation means is at the destination station and the transfer station, the embodiment of the present application provides a method for the arrival reminding, and the flow of the arrival reminding method is shown in fig. 3. Before the terminal uses the arrival reminding function for the first time, executing step 301, and storing a public transport means line diagram; when the terminal starts the arrival reminding function, firstly executing step 302 to determine a riding route; after entering the public transport means, executing step 303, acquiring environmental sounds in real time through a microphone, and acquiring sensor data through an inertial sensor of the terminal; executing step 304, judging the start-stop state of the public transport means through the collected environmental sound and sensor data, namely judging whether the public transport means is in a stop operation state or in an operation state, when judging that the public transport means is in the operation state, continuing to execute step 303, when judging that the public transport means is in the operation stop state, determining that the public transport means enters a certain station, step 305, combining the riding route and the number of stations which are already driven, judging whether the station is the destination station, if the station which enters is the destination station, executing step 306, sending a station arrival prompt, if the station which is not the destination station, executing step 307, judging whether the station is a transit station, if the station which is determined to be the transit station, executing step 308, sending a transfer prompt, if the station which is determined to be not the transit station, then execution continues with step 303.

The method and the device have the advantages that the starting and stopping states of the public transport means are judged by combining the sound acquired from the public transport means and the inertial sensor data acquired from the inertial sensor of the terminal, so that the target route is combined, the station where the public transport means is located when the public transport means is in the stopping state is determined, and the arrival reminding is further carried out; the situation that the judgment of the start-stop state of the public transport means is inaccurate due to the change of the terminal posture when the start-stop state judgment is carried out by independently utilizing the data acquired by the inertial sensor is also avoided; therefore, the two data characteristics are combined to complement each other, so that the robustness of the public transport means start-stop judgment algorithm is improved.

Fig. 4 shows a flowchart of a method for reminding a user of arrival according to an exemplary embodiment of the present application. The arrival reminding method may be executed by a terminal, for example, the terminal may be a terminal having a sound collection function and an inertial sensor data collection function, for example, the terminal may be the terminal 100 in the application scenario shown in fig. 1. The arrival reminding method comprises the following steps:

in step 401, a target route of a public transportation means is acquired.

In a possible implementation manner, the target route of the public transportation vehicle may be set by a user, for example, the terminal may display a route setting interface, and obtain the target route according to a starting station and a destination station set in the route setting interface by the user.

That is, in this embodiment of the application, the terminal may display a route setting interface on the application program interface, and generate a target route from the starting station to the destination station by receiving the starting station and the destination station set in the route setting interface by the user.

In a possible implementation manner, when the terminal displays the route setting interface, the position information of the user can be acquired in real time, and the starting station is determined according to the position information of the user at present.

Alternatively, the terminal may determine the starting station according to a selection operation of the starting station in the route setting interface by the user. Similarly, the terminal may determine the destination station according to a selection operation of the destination station by the user in the route setting interface.

After the start station and the target station are obtained, the terminal may obtain at least one route passing through the start station and the target station in sequence based on a prestored route map of the public transportation means, and determine the target route from the at least one route. The terminal can automatically recommend one of the at least one route and determine the automatically recommended route as the target route, or the terminal can display the at least one route on the interface and determine the target route by receiving the selection operation of the user.

For example, when the target route is determined based on automatic recommendation, the number of stops between the starting station and the target station in the at least one route may be obtained, and the route with the smallest number of stops may be determined as the target route, or the predicted travel time taken by the public transportation from the starting station to the target station through the at least one route may be obtained, and the route with the shortest predicted travel time may be determined as the target route.

Optionally, when the user uses the payment application to take a vehicle by swiping a card, the terminal may confirm that the user has entered or is about to enter the public transportation vehicle, and at this time, the arrival reminding function may be turned on.

That is, after the user starts the application program for realizing the arrival reminding, the starting station and the destination station can be input by means of manual input of the user, and an appropriate route is selected as the target route.

In another possible implementation manner, the terminal may also perform route prediction based on the behavior habits of the user to determine the target route.

The terminal predicts the route according to the historical movement track of the user and can obtain the target route. The historical movement track may be a movement track of the user within a specified time counted by the terminal.

That is, the terminal matches each movement track in the historical movement tracks with the complete route map of the public transportation means, obtains the routes covered by each movement track in the complete route map, if there are specified routes in each covered route, and the proportion of the number of the specified routes in the number of all covered routes is greater than a specified threshold, determines the specified routes as target routes (for example, there are 3 historical movement tracks, which are respectively a movement track a, a movement track B and a movement track C, the route covered by the movement track a on the complete route map is the route from station a to station B, the route covered by the movement track B on the complete route map is the route from station B to station C, the route covered by the movement track C on the complete route map is the route from station a to station B, because both the movement track a and the movement track C cover the route from station a to station B, and the proportion of the total covered routes is greater than 1/2, the route from the station a to the station b is determined as the target route).

In addition, the route prediction is carried out according to the historical movement track of the user, and the route prediction can also be carried out in a machine learning model prediction mode.

For example, the historical movement trajectory is input into the target route prediction model, and the target route is output by the target route prediction model. The target route prediction model may be a neural network model trained based on historical movement trajectory samples and route labels.

For example, the terminal may obtain a movement trajectory of the user within a predetermined time (for example, a week or a month) before the current time, and obtain a target route of the public transportation means on which the user is to ride by performing statistical analysis or machine learning model prediction on the movement trajectory of the user.

Step 402, ambient sound data within a target time period and inertial sensor data within the target time period are acquired.

In the embodiment of the application, after the arrival reminding function is started by the application program, the terminal can acquire audio and sensor data in real time according to a certain period to obtain the environmental sound data in a target time period and the inertial sensor data in the target time period.

In a possible implementation manner, the terminal may perform the step of acquiring the traffic operation information once after acquiring the ambient sound data and the inertial sensor data within a target time period.

For example, the terminal may acquire the ambient sound data and the inertial sensor data acquired within 2s at a time as the data acquired within the target time period.

Step 403, performing audio feature extraction on the at least two audio data segments respectively to obtain respective mel-frequency cepstrum coefficient features of the at least two audio data segments.

In the embodiment of the application, the environmental sound data in the target time period includes at least two audio data segments, and the terminal performs audio feature extraction on the at least two audio data segments respectively to obtain mel-frequency-to-spectral-coefficient features corresponding to the at least two audio data segments.

In a possible implementation manner, since the terminal microphone collects the environmental sound data in real time, the data is not stable as a whole, but a part of the data can be regarded as stable data, so that the environmental sound data in the target time period is subjected to framing processing to obtain at least two audio data segments which are arranged in a continuous time sequence in the target time period.

The sampling frequency of the environmental sound data can be 16kHz, the sampling frequency of the inertial sensor data can be 200Hz, and the sampling frequency of the environmental sound data is far higher than that of the inertial sensor data, so that the terminal can perform preliminary feature extraction on the environmental sound data, and the features of the environmental sound data in the target time period can be matched with the features of the inertial sensor data in the target time period in number.

In a possible implementation manner, the terminal may perform preliminary feature extraction on at least two audio data segments to obtain a preliminary audio feature, where the preliminary audio feature includes Mel-Frequency Cepstral Coefficients (MFCCs) of each audio data segment.

Fig. 5 is a flowchart of extraction of mel-frequency cepstrum coefficients according to an embodiment of the present application. As shown in fig. 5, the extraction of mel-frequency cepstrum coefficients may include the following steps:

first, the audio data segment is pre-emphasized by the pre-emphasis module 501, which may use a high-pass filter to allow only signal components above a certain frequency to pass through and suppress signal components below the certain frequency, so as to remove unnecessary low-frequency interference such as human talk, footstep and mechanical noise in the audio data segment and flatten the frequency spectrum of the audio data segment. The mathematical expression for the high pass filter is:

H(_z)＝1-az-1

where a is a correction coefficient, generally having a value ranging from 0.95 to 0.97, and z is an audio signal of an audio data segment.

The audio data segment with the noise removed is subjected to framing processing by the framing windowing module 502 to obtain audio data corresponding to different audio frames.

Illustratively, in this embodiment, the audio data containing 512 data points may be divided into one frame, and when the sampling frequency of the audio data is selected to be 16kHz, the duration of one frame of audio data is 32 ms. In order to avoid overlarge change between two frames of data and avoid data loss at two ends of the audio frame after windowing, the method can slide backwards for 16ms and then take down one frame of data after each frame of data is taken, namely, two adjacent frames of data are overlapped for 16 ms.

Because discrete Fourier transform is required to be performed on the audio data subjected to framing processing during subsequent feature extraction, and one frame of audio data has no obvious periodicity, namely the left end and the right end of the frame are discontinuous, errors can be generated between the audio data and original data after Fourier transform, the more frames are divided, the larger the error is, in order to enable the audio data subjected to framing to be continuous, and each frame of audio data shows the features of a periodic function, the scheme shown in the embodiment of the application performs framing windowing processing through the framing windowing module 502.

In one possible implementation, the terminal may employ a hamming window to window the audio frames. Wherein, the audio data obtained by multiplying each frame of audio data by a Hamming window function has obvious periodicity. The Hamming window has the functional form:

where n is an integer, a value range of n is 0 to M, M is a point number of fourier transform, and illustratively, 512 data points are taken as the point number of fourier transform in this embodiment.

In a possible implementation manner, since it is difficult to obtain the signal characteristics of the audio signal from the transform in the time domain, the time domain signal usually needs to be converted into the energy distribution in the frequency domain for processing, so the terminal first inputs the audio frame data into the fourier transform module 503 to perform fourier transform, and then inputs the fourier transformed audio frame data into the energy spectrum calculation module 504 to calculate the energy spectrum of the audio frame data. In order to convert the energy spectrum into a mel spectrum conforming to the hearing of human ears, the energy spectrum needs to be input into a mel filtering processing module 505 for filtering, and the mathematical expression of the filtering processing is as follows:

wherein f is a frequency point after Fourier transform.

After obtaining the mel spectrum of the audio frame, the terminal logarithms the mel spectrum through a Discrete Cosine Transform (DCT) module 506, and an obtained DCT coefficient is the MFCC characteristic.

Illustratively, 64-dimensional MFCC features may be selected in the embodiments of the present application, when the terminal actually extracts the features, the input window length of the audio data may be selected to be 200ms, the time length of one frame of signal is 32ms, and there is an overlapping portion of 16ms between two adjacent frames of data, so that each 200ms input window data corresponds to a generated matrix with features of 12 × 64.

Step 404, performing feature extraction on the mel-frequency cepstrum coefficient features of the at least two audio data segments to obtain the local sound features of the at least two audio data segments.

In this embodiment of the application, at least two audio data segments are respectively subjected to the above-mentioned MFCC feature extraction to obtain respective corresponding MFCC features, and local feature extraction is respectively performed on the respective corresponding MFCC features of the at least two audio data segments to obtain respective corresponding sound local features of the at least two audio data segments.

In a possible implementation manner, the terminal may perform local feature extraction on the MFCC features corresponding to each audio data segment through the first convolutional neural network, so as to obtain the sound local features of each audio data segment.

The first Convolutional Neural Network is a Convolutional Neural Network (CNN) for extracting local features of each audio data segment, and redundant feature information in the MFCC features of each audio feature data segment can be removed by performing local feature extraction using the Convolutional Neural Network.

For example, if the target time period is 2s and each audio data segment is 200ms, the ambient sound data collected in the target time period may be divided into 10 audio data segments of 200ms, and the MFCC features extracted from each audio data segment are respectively subjected to local feature extraction by the first convolutional neural network to obtain the sound local features corresponding to each audio data segment.

Step 405, processing the respective sound local features of the at least two audio data segments based on a self-attention mechanism according to the time domain sequence of the at least two audio data segments, and obtaining a global sound feature.

In the embodiment of the application, the terminal performs self-attention processing on the local sound features corresponding to the audio data segments according to the sequence from first to last in the time domain to obtain a global sound feature corresponding to the environmental sound data in the target time segment.

That is, the global sound feature is a global feature obtained by fusing the respective sound local features of the respective audio data segments based on the time sequence relationship between the respective audio data segments.

In a possible implementation manner, the terminal may process local sound features of at least two audio data segments according to a time-domain sequence of the at least two audio data segments based on a self-attention mechanism, so as to obtain attention weights of the at least two audio data segments; then, based on the attention weight of each of the at least two audio data segments, the local sound features of each of the at least two audio data segments are weighted to obtain global sound features.

The terminal can input the local sound features of at least two audio data segments into the first self-attention network according to the time domain sequence, and the first self-attention network extracts the time sequence related global features to obtain the global sound features.

Illustratively, if the target time period is 2s and each audio data segment is 200ms, the terminal inputs the sound local features corresponding to 10 audio data segments output by the first convolutional neural network into the first self-attention network according to the time domain sequence, the first self-attention network determines the attention weight corresponding to each audio data segment based on the self-attention mechanism, and performs weighting processing on the sound local features of the 10 audio data segments and the attention weights corresponding to the sound local features and the attention weights to obtain the global sound feature.

Based on the attention weight of each of the at least two audio data segments, the local sound features of each of the at least two audio data segments can be subjected to weighted summation processing or weighted splicing processing, so as to obtain the global sound features.

For example, taking A, B, C, which is an example of three local voice features, assuming that the attention weight obtained by the attention mechanism is (0.2, 0.6, 0.2), in one case, if 3 local voice features and the attention weights corresponding to the local voice features are subjected to weighted summation processing, the global voice feature D is a 0.2+ B0.6 + C0.2.

In another possible implementation manner, the terminal may also multiply and concatenate the attention weight corresponding to each audio data segment with the local sound feature corresponding to each audio data segment, so as to serve as the global sound feature.

For example, when the local sound feature is A, B, C, the attention weight obtained by the self-attention mechanism is (0.1, 0.3, 0.6), and the global sound feature D is (a × 0.2, B × 0.6, C × 0.2).

And 406, performing feature extraction on the at least two sensor data segments to obtain respective sensor local features of the at least two sensor data segments.

In the embodiment of the application, the inertial sensor data acquired in the target time period includes at least two sensor data segments, and local feature extraction is performed on the at least two sensor data segments respectively, so that sensor local features corresponding to the at least two sensor data segments can be obtained.

In a possible implementation manner, the terminal may perform local feature extraction on each sensor data segment through the second convolutional neural network to obtain a sensor local feature of each sensor data segment.

The second convolutional neural network is used for extracting local features of each sensor data segment, and redundant feature information in each sensor data segment can be removed by utilizing the convolutional neural network to extract the local features.

For example, if the target time period is 2s and each sensor data segment is 200ms, inertial sensor data acquired in the target time period may be divided into 10 sensor data segments of 200ms, and each sensor data segment is subjected to local feature extraction by the second convolutional neural network, so as to obtain a sensor local feature corresponding to each sensor data segment.

Step 407, processing the sensor local features of the at least two sensor data segments based on an attention-free mechanism according to the time domain sequence of the at least two sensor data segments, so as to obtain global inertial sensor features.

In the embodiment of the application, the terminal performs self-attention processing on the local sound features corresponding to the sensor data segments according to the sequence from first to last in the time domain to obtain a global inertial sensor feature corresponding to the inertial sensor data in the target time segment.

That is, the global inertial sensor feature is a global feature obtained by fusing the local sensor features of the respective sensor data segments based on the time sequence relationship between the respective sensor data segments.

In a possible implementation manner, the terminal may process the sensor local features of the at least two sensor data segments based on a self-attention mechanism according to a time domain sequence of the at least two sensor data segments, so as to obtain attention weights of the at least two sensor data segments; then, based on the attention weight of each of the at least two sensor data segments, the local sensor features of each of the at least two sensor data segments are weighted to obtain global inertial sensor features.

The terminal can input the local sensor features of the at least two sensor data segments into the second self-attention network according to the time domain sequence, and the second self-attention network extracts the time sequence related global features to obtain the global inertial sensor features.

Illustratively, if the target time period is 2s and each inertial sensor data segment is 200ms, the terminal inputs the local inertial sensor features corresponding to the 10 sensor data segments output by the second convolutional neural network into the second self-attention network in a time domain sequence, the second self-attention network determines the attention weight corresponding to each sensor data segment based on the self-attention mechanism, and performs weighting processing on the local inertial sensor features of the 10 sensor data segments and the attention weights corresponding to the local inertial sensor features and the attention weights to obtain the global inertial sensor feature.

Based on the attention weight of each of the at least two sensor data segments, the local features of the inertial sensors of the at least two sensor data segments can be subjected to weighted summation processing or weighted splicing processing, so as to obtain the global inertial sensor features.

For example, taking X, Y, Z, which is an example of three local features of the inertial sensor, assuming that the attention weight obtained by the self-attention mechanism is (0.1, 0.3, 0.6), and in one case, if the 3 local features of the inertial sensor and the attention weights corresponding to the local features of the inertial sensor are subjected to weighted summation processing, the global inertial sensor feature W becomes X0.1 + Y0.3 + Z0.6.

In another possible implementation manner, the terminal may also multiply and splice the attention weight corresponding to each inertial sensor data segment with the sensor local feature corresponding to each inertial sensor data segment, so as to use the multiplied attention weight as the global inertial sensor feature.

For example, when the local feature of the inertial sensor is X, Y, Z, the attention weight obtained by the self-attention mechanism is (0.1, 0.3, 0.6), and the global inertial sensor feature W is (X × 0.1, Y × 0.3, Z × 0.6).

Step 408, the global acoustic features and the global inertial sensor features are spliced.

In the embodiment of the application, the terminal performs feature splicing on the global sound feature and the global inertial sensor feature to obtain the spliced global feature.

In one possible implementation, the global sound feature includes a global sound sub-feature corresponding to each of at least two time periods within the target time period; the global inertial sensor features include global inertial sensor sub-features corresponding to at least two time periods within the target time period. And splicing the global sound sub-features and the global inertial sensor sub-features corresponding to at least two time periods in the target time period.

For example, if the global sound feature is D and the global inertial sensor feature is W, the global feature after concatenation may be (D, W); if D is (a × 0.2, B × 0.6, C × 0.2), the global phonon feature may be a × 0.2, B × 0.6, and C × 0.2; if W is (X × 0.1, Y × 0.3, Z × 0.6), the global inertial sensor sub-features may be X × 0.1, Y × 0.3, and Z × 0.6.

Wherein the number of dimensions of the global sound sub-features is the same as the number of dimensions of the global inertial sensor sub-features.

In one possible implementation, the respective number of dimensions of the global sound sub-features is determined by the output feature dimension of the first self-attention network, and the respective number of dimensions of the global inertial sensor sub-features is determined by the output feature dimension of the second self-attention network.

For example, the first convolutional neural network extracts the local sound feature of each audio data segment, and if the feature dimension of the local sound feature is N and 10 audio data segments are included in the target time segment, the first convolutional neural network may output a 10 × N local sound feature vector formed by 10 local features, input the 10 × N local sound feature vector into the first self-attention network, and extract a 10 × N global sound feature vector. The second convolutional neural network extracts the sensor local feature of each sensor data segment, and if the feature dimension of the sensor local feature is N and 10 sensor data segments are included in the target time segment, the second convolutional neural network may output a 10 × N sensor local feature vector formed by 10 local features, input the 10 × N sensor local feature vector into the second self-attention network, and extract a 10 × N global inertial sensor feature vector. Stacking 10 × N global sound feature vectors and 10 × N global inertial sensor feature vectors according to rows to obtain 20 × N spliced global feature vectors, inputting the 20 × N spliced global feature vectors into a third self-attention network, and finally obtaining 20 × N fusion feature vectors, wherein 20 is a time length, and the 20 × N fusion feature vectors indicate 20N-dimensional vector features.

And 409, processing the spliced global sound features and global inertia sensor features based on a self-attention mechanism to obtain respective attention weights of the global sound features and the global inertia sensor features, and acquiring fusion features based on the respective attention weights of the global sound features and the global inertia sensor features.

In the embodiment of the application, the terminal processes the spliced global features through a self-attention mechanism to obtain attention weights corresponding to two modes, namely a global sound feature and a global inertia sensor feature, in the global features. And the terminal performs feature fusion on the global features of the two modes based on the attention weights respectively corresponding to the two modes to obtain fusion features.

In one possible implementation manner, the spliced global sound features and global inertia sensor features are processed based on an attention mechanism, and respective attention weights of global sound sub-features and respective attention weights of global inertia sensor sub-features are obtained; and acquiring the fusion feature based on the attention weight of the global sound sub-feature and the attention weight of the global inertia sensor sub-feature.

The global sound sub-feature corresponding to each time segment may be a feature obtained after the local sound feature corresponding to each audio data segment is processed based on the self-attention mechanism. The global inertial sensor sub-feature corresponding to each time segment may be a feature obtained after the sensor local feature corresponding to each sensor data segment is processed based on the self-attention mechanism.

The global sound feature and the global inertia sensor feature after splicing can extract the relation between different modes through a third self-attention network, and the global feature is extracted to obtain the fusion feature by taking the influence of time sequence on the fusion feature into consideration.

In a possible implementation manner, the terminal performs weighted summation or weighted average on the global sound feature and the global inertial sensor feature based on the attention weights of the global sound feature and the global inertial sensor feature, so as to obtain a fusion feature.

The global sound feature and the global inertia sensor feature after splicing can extract the relation between different modes through a third self-attention network, and global feature extraction is carried out to obtain fusion features.

For example, based on the self-attention mechanism, it may be determined that the attention weights corresponding to the global sound feature and the global inertial sensor feature are 0.2 and 0.8, respectively, if the global feature after the concatenation is (D, W), the fusion feature obtained by performing weighted summation by the above method is D0.2 + W0.8, and the fusion feature obtained by performing weighted averaging by the above method is E ═ D0.2 + W0.8)/2.

In one possible implementation, the fusion feature is obtained by performing weighted summation or weighted average on the global sound sub-feature and the global inertial sensor sub-feature based on the attention weights of the global sound sub-feature and the global inertial sensor sub-feature, respectively.

For example, if the global feature after the concatenation is (a, B, X, Y), the attention weights corresponding to the global sound sub-feature and the global inertial sensor sub-feature may be determined to be 0.2, 0.3, 0.1, and 0.5, respectively, based on the self-attention mechanism, and the fusion feature obtained by multiplying the respective attention weights by the above method and then summing the results is a 0.2+ B0.3 + X0.1 + Y0.5. The weighted average result, which is obtained by multiplying the above-mentioned method by the respective attention weights and then performing the sum-and-average, is (a × 0.2+ B × 0.3+ X × 0.1+ Y × 0.5)/4. As can be seen, the resulting fusion is characterized by (E, E, E, E).

In another possible implementation manner, the terminal may also multiply the global sound feature and the global inertial sensor feature by the respective attention weights based on the respective attention weights of the global sound feature and the global inertial sensor feature, and concatenate the two features multiplied by the respective attention weights to serve as the fusion feature.

For example, based on the self-attention mechanism, it may be determined that the attention weights corresponding to the global sound feature and the global inertial sensor feature are 0.2 and 0.8, respectively, and if the global feature after the concatenation is (D, W), the fusion feature is (D × 0.2, W × 0.6).

For example, if the global features after stitching are (a × 0.2, B × 0.6, C × 0.2, X × 0.1, Y × 0.3, Z × 0.6), the fused features resulting from stitching the two features after multiplying the respective attention weights by the above method are (a × 0.2, B × 0.6.0.2, C × 0.2, X × 0.1 × 0.8, Y × 0.3 × 0.8, Z × 0.6 × 0.8).

Optionally, in this embodiment of the application, when the terminal performs feature fusion on the global features of the two modalities based on the attention weights respectively corresponding to the two modalities, the terminal may further perform feature fusion in combination with the smoothing parameter of the environmental sound data.

In a possible implementation manner, for at least two audio data segments in the environmental sound data, the terminal may obtain respective volume averages of the at least two audio data segments; acquiring the volume mean value of the environmental sound data based on the volume mean value of each of the at least two audio data segments; and then, according to the volume average value of the environment sound data and the volume average value of each of the at least two audio data segments, obtaining a smoothing parameter of the volume of the environment sound data, wherein the smoothing parameter is used for indicating the smoothing degree of the volume of the environment sound data. Before feature fusion is carried out on the global features of the two modes, the terminal can obtain an adjustment coefficient of the global sound features according to a smoothing parameter of the volume of the environment sound data, the adjusted global sound features are obtained by multiplying the adjustment coefficient and the global sound features, and the adjusted global sound features and the global inertial sensor features can be spliced when feature fusion is carried out on the global features of the two modes subsequently.

The adjustment coefficient is negatively correlated with a smoothing parameter of the volume of the environmental sound data, that is, the smoother the volume of the environmental sound data, the smaller the smoothing parameter, the larger the adjustment coefficient; accordingly, the larger the smoothing parameter of the volume of the ambient sound data,

the smaller the adjustment factor. Alternatively, the smoothing parameter may be a parameter such as a standard deviation or a variance, which indicates a degree of dispersion of the data set.

In view of the fact that irregular environmental noises, such as sudden noises, etc., may affect the accuracy of the global sound characteristics, are usually generated in the vehicle cabin during the operation of the public transportation vehicle, in the solution shown in the embodiment of the present application, the terminal may suppress or enhance the global sound characteristics according to the smoothness of the volume of the environmental sound data before fusing the global sound characteristics and the global inertial sensor characteristics, so as to dynamically adjust the specific gravity of the global sound characteristics in the fused characteristics. For example, the smoothing parameter of the environmental sound data is higher, which indicates that the more irregular noise contained in the environmental sound data, the greater the influence on the subsequent arrival detection, and at this time, the global sound feature may be suppressed by a smaller adjustment coefficient (e.g., 0.9) to reduce the specific gravity of the global sound feature in the fusion feature; on the contrary, the smoothing parameter of the ambient sound data is lower, which means that the influence on the subsequent arrival detection is smaller as the irregular noise contained in the ambient sound data is smaller, and at this time, the global sound feature can be enhanced by a larger adjustment coefficient (for example, 1.1) to improve the proportion of the global sound feature in the fusion feature.

By the scheme of combining the smooth parameters of the environmental sound data to perform feature fusion, the proportion of the global sound features in the fusion features can be flexibly adjusted, and the accuracy of subsequent arrival reminding judgment is further improved.

And step 410, acquiring traffic operation information based on the fusion features.

Wherein the traffic operation information is used to indicate an operation state of the public transportation means within the target time period.

In a possible implementation manner, the terminal can classify the fusion features through the full-connection network and the classifier, output the traffic operation information of the public transport means, and determine the operation state of the public transport means in the target time period.

For example, fig. 6 is a diagram of a classification model architecture according to an embodiment of the present application. As shown in fig. 6, the classification model is stored in the terminal and is used for determining the operation state of the public transportation vehicle based on the environmental sound data and the inertial sensor data collected by the terminal, and the classification model includes a first convolutional network layer 61, a second convolutional network layer 62, a first self-attention network layer 63, a second self-attention network layer 64, a third self-attention network layer 65, a full-connection network layer 66 and a classifier 67. The first convolution network layer 61 is used for performing local feature extraction on the environmental sound data, the 2s environmental sound data is divided into 10 audio data segments of 200ms, the MFCC feature extraction is calculated on the 10 audio data segments of 200ms through a feature extraction module, then the extracted MFCC features are sequentially input into the first convolution neural network layer 61, local feature extraction is performed to obtain the sound local features corresponding to the audio data segments, then the sound local features are input into the first self-attention network layer 63 according to the time domain sequence, and the global sound features are obtained through the time sequence-dependent global feature extraction of the first self-attention network layer 63. Meanwhile, the second convolutional network layer 62 is configured to perform local feature extraction on the inertial sensor data, 2s of the inertial sensor data is divided into 10 sensor data segments of 200ms, the 10 sensor data segments of 200ms are sequentially input to the second convolutional neural network layer 62, local feature extraction is performed to obtain local features of the inertial sensor corresponding to each sensor data segment, then the local features of each inertial sensor are input to the second self-attention network layer 64 according to a time domain sequence, and global features related to a time sequence of the second self-attention network layer 64 are extracted to obtain global inertial sensor features. The global inertial sensor features and the global sound features obtained within 2s are input into a third self-attention network layer 65, the third self-attention network layer 65 is used for performing self-attention weight distribution on multiple modes to extract global features, the third self-attention network layer 65 outputs fusion features, the fusion features are input into a fully-connected network 66 and a classifier 67, and traffic operation information corresponding to the fusion features is output.

The classifier 67 may adopt classifiers with different algorithms, such as a Support Vector Machine (SVM), a decision tree classification model algorithm, a binary classification model algorithm, and the like.

In a possible implementation manner, the training process of the classification model may be as follows: the model training equipment acquires sample environment sound data and sample inertial sensor data in a sample time period; inputting the sample environmental sound data and the sample inertial sensor data into a classification model to obtain predicted traffic operation information output by the classification model; obtaining a loss function value based on the predicted traffic operation information, the sample environment sound data and the traffic operation information label corresponding to the sample inertial sensor data; model parameters of the classification model are updated based on the loss function values.

And 411, performing arrival reminding based on the traffic operation information.

In the embodiment of the application, the terminal determines the running condition of the public transportation means in the target time period according to the acquired traffic running information in the target time period, and performs the arrival reminding based on the acquired running condition of the public transportation means.

Wherein the traffic operation information is used for indicating whether the operation state of the public transport means in the target time period is a stop state. And in the case that the traffic operation information indicates that the operation state of the public transport means in the target time period is a stop state, performing the arrival reminding.

In one possible implementation, in a case where the traffic operation information indicates that the operation state of the public transportation is a stopped state within the target time period, the terminal may acquire the current position of the public transportation; in the case where the current position of the public transportation matches the designated station on the target route, the arrival reminder is executed.

Wherein the designated site may be a destination site or a transfer site on the target route.

When the current position of the public transport means is obtained, if the public transport means is a rail transport means such as a subway or a high-speed rail, the terminal can determine the station corresponding to the current position by combining the stopping condition of the current time in the process of taking the public transport means.

Illustratively, in response to detecting that the public transportation means stops for the ith time (for example, if the duration of acquiring that the public transportation means is in a stopped state is greater than or equal to a first threshold value, the public transportation means is considered to stop), the terminal determines that the station where the public transportation means is located on the target line is the ith station after the starting station; responding to the fact that the ith station or the (i + 1) th station is a designated station, and determining the station type of the designated station, wherein the station type comprises a destination station and a transfer station; acquiring reminding information corresponding to the site type based on the site type of the specified site; and carrying out arrival reminding based on the reminding information.

Or, the terminal may also acquire the current position of the public transportation means when detecting that the operation state of the public transportation means is the stop state, and determine a stop corresponding to the current position of the public transportation means by combining the current position of the public transportation means and the positions of the respective stops in the target route.

When acquiring the current position of the public transportation means, the terminal can also acquire the current position of the terminal through a positioning system, or the current position of the terminal can be determined through inertial sensor data.

Since the public transportation means may stop at a non-stop position, for example, a bus may stop at a red light for example, and even a rail transportation means may stop at a non-stop position for route scheduling or a route failure for example, the stop at which the public transportation means is located cannot be accurately determined only by the number of stops of the public transportation means. In contrast, the method in the embodiment of the present application may further obtain the current position of the terminal through a positioning system (for example, obtain the current position through satellite positioning, cellular network positioning, wireless access point positioning, and the like); optionally, a position (for example, in an underground track without a signal) which cannot be located by the positioning system may exist on the target route of the public transportation vehicle, at this time, the terminal may further determine a moving track of the terminal through inertial sensor data from the terminal position obtained by the positioning system last time, and determine the current position of the terminal by combining map data built in the terminal and the moving track.

In one possible implementation manner, the arrival reminding is executed in the case that the traffic operation information acquired for N times continuously indicates that the operation state of the public transportation means is the stop state.

Fig. 7 is a flowchart of an arrival judging method according to an embodiment of the present application, as shown in fig. 7. Firstly, executing step 71, collecting environmental sound data in a target time period, then executing step 72, performing audio feature preprocessing on the environmental sound data, extracting MFCC features of each audio data segment corresponding to the environmental sound data, namely a matrix with 12 x 64 features corresponding to 200ms of data of each window in the above embodiment, then executing step 73, inputting the MFCC features of each extracted audio data segment and the collected inertial sensor data into a convolutional neural network and performing a multi-modal feature fusion from an attention network, wherein the environmental sound data and the inertial sensor data are both 2s, the attention network has a feature of extracting time sequence features, so that the 2s data are respectively split into 10 data segments of 200ms, each 200ms is data of an independent frame, the CNN extracts data local features of the independent frames, and then 10 local features are formed, and then the time sequence related features are input into the attention network to extract the overall features of the data, so that the respective advantages of the CNN and the attention network are combined, namely the CNN is better at extracting local features, and the attention network is better at extracting global time sequence features. After the fusion features are obtained, whether the subway arrives at a stop state within a target time period is judged, if so, step 74 is executed, and the running state of the public transport means is continuously detected, for example, if the subway arrival stop time is generally more than 20s, the length of a data window of an input model in the embodiment of the application is 2s, the subway and the running state can be continuously detected, and if the subway arrives at a stop state for 5 times, the subway is considered to arrive at a station.

If the public transportation means needs to be continuously in the stop state for 20s or more when entering the station, the target time period in the above embodiment is 2s, and under the condition that the transportation operation information is continuously obtained for 5 times and indicates that the operation state of the public transportation means is in the stop state, whether the current station is the designated station is determined, and if the current station is determined to be the designated station, the arrival reminding corresponding to the designated station is executed.

For example, if the terminal judges that the user arrives at the target station, the terminal reminds the user of coming to the station to pay attention to getting off and can push map information near the specified station, and if the terminal judges that the user arrives at the transfer station, the terminal reminds the user of transfer and can remind the first-shift time information and the last-shift time information of a transfer vehicle.

To sum up, in the embodiment of the application, the environment sound data and the inertial sensor data are collected in real time, time sequence-related global feature extraction is performed on the environment sound data and the inertial sensor data respectively, and then the feature extraction is performed by combining the relation between different modalities based on the global sound feature and the global inertial sensor feature, so that the condition of poor accuracy caused by external influence when the running state of the public transport vehicle is judged only through a single modality feature is avoided.

Fig. 8 shows a block diagram of a station arrival reminding device according to an exemplary embodiment of the present application. The arrival reminding device is used in a terminal, and comprises:

a data obtaining module 810, configured to obtain environmental sound data in a target time period and inertial sensor data in the target time period;

a first feature extraction module 820, configured to perform feature extraction on the environmental sound data based on a time sequence of the environmental sound data to obtain a global sound feature;

a second feature extraction module 830, configured to perform feature extraction on the inertial sensor data based on a time sequence of the inertial sensor data to obtain a global inertial sensor feature;

a feature fusion module 840, configured to perform fusion processing on the global sound feature and the global inertial sensor feature based on a self-attention mechanism to obtain a fusion feature;

an information obtaining module 850, configured to obtain traffic operation information based on the fusion feature; the traffic operation information is used for indicating the operation state of public transport means in the target time period;

and the reminding module 860 is used for executing the arrival reminding based on the traffic operation information.

In one possible implementation manner, the feature fusion module 840 includes:

the feature splicing submodule comprises:

the weight obtaining submodule includes:

the feature fusion submodule includes:

the first feature extraction module 820 includes:

the second feature extraction module 830 includes:

the reminder module 860 includes:

In a possible implementation manner, the reminding sub-module includes:

In one possible implementation, the apparatus further includes:

In a possible implementation manner, the reminding sub-module includes:

Fig. 9 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment of the present application. The terminal may be an electronic device installed and running with an application, such as a smart phone, a tablet computer, an electronic book, a portable personal computer, and the like. A terminal in the present application may include one or more of the following components: a processor 910, a memory 920, and a screen 930.

Processor 910 may include one or more processing cores. The processor 910 connects various parts within the entire terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 920 and calling data stored in the memory 920. Alternatively, the processor 910 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 910 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is responsible for rendering and drawing the content that the screen 930 needs to display; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 910, but may be implemented by a communication chip.

The Memory 920 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 920 includes a non-transitory computer-readable medium. The memory 920 may be used to store instructions, programs, code sets, or instruction sets. The memory 920 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, and the like), instructions for implementing various method embodiments described above, and the like, and the operating system may be an Android (Android) system (including a system based on Android system depth development), an IOS system developed by apple inc (including a system based on IOS system depth development), or other systems. The storage data area may also store data created by the terminal in use, such as a phonebook, audio-video data, chat log data, and the like.

The screen 930 may be a capacitive touch display screen for receiving a touch operation of a user on or near the screen using a finger, a stylus, or any other suitable object, and displaying a user interface of each application. The touch display screen is generally provided at a front panel of the terminal. The touch display screen may be designed as a full-face screen, a curved screen, or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

In addition, those skilled in the art will appreciate that the configurations of the terminals illustrated in the above-described figures do not constitute limitations on the terminals, as the terminals may include more or less components than those illustrated, or some components may be combined, or a different arrangement of components may be used. For example, the terminal further includes a radio frequency circuit, a shooting component, a sensor, an audio circuit, a Wireless Fidelity (WiFi) component, a power supply, a bluetooth component, and other components, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, in which at least one computer instruction is stored, and the at least one computer instruction is loaded and executed by a processor to implement the arrival reminding method according to the above embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An arrival reminding method, characterized in that the method comprises:

and executing the arrival reminding based on the traffic operation information.

2. The method according to claim 1, wherein the fusion processing of the global sound feature and the global inertial sensor feature based on the self-attention mechanism to obtain a fusion feature comprises:

stitching the global sound features and the global inertial sensor features;

processing the spliced global sound features and global inertia sensor features based on a self-attention mechanism to obtain respective attention weights of the global sound features and the global inertia sensor features;

and acquiring the fusion feature based on the attention weights of the global sound feature and the global inertial sensor feature.

3. The method of claim 2, wherein the global sound feature comprises a global sound sub-feature corresponding to each of at least two time periods within the target time period; the global inertial sensor features comprise global inertial sensor sub-features corresponding to at least two time periods in the target time period respectively;

the stitching the global sound feature and the global inertial sensor feature includes:

splicing the global sound sub-features and the global inertial sensor sub-features corresponding to at least two time periods in the target time period; the dimension number of the global sound sub-features is the same as that of the global inertial sensor sub-features;

the processing the spliced global sound feature and the global inertial sensor feature based on the self-attention mechanism to obtain respective attention weights of the global sound feature and the global inertial sensor feature includes:

processing the spliced global sound features and global inertia sensor features based on a self-attention mechanism to obtain respective attention weights of the global sound sub-features and respective attention weights of the global inertia sensor sub-features;

the obtaining the fusion feature based on the attention weights of the global sound feature and the global inertial sensor feature comprises:

and acquiring the fusion feature based on the attention weight of the global sound sub-feature and the attention weight of the global inertia sensor sub-feature.

4. The method of claim 3, wherein obtaining the fused features based on the respective attention weights of the global sound sub-features and the respective attention weights of the global inertial sensor sub-features comprises:

5. The method of claim 1, wherein the ambient sound data comprises at least two audio data segments; the extracting the features of the environmental sound data based on the time sequence of the environmental sound data to obtain the global sound features comprises:

respectively extracting audio features of at least two audio data segments to obtain respective Mel frequency cepstrum coefficient features of the at least two audio data segments;

carrying out feature extraction on the Mel frequency cepstrum coefficient features of the at least two audio data segments to obtain the sound local features of the at least two audio data segments;

and processing the local sound characteristics of the at least two audio data segments based on a self-attention mechanism according to the time domain sequence of the at least two audio data segments to obtain the global sound characteristics.

6. The method of claim 5, wherein the obtaining the global sound feature by processing the local sound feature of each of the at least two audio data segments based on a self-attention mechanism according to a time-domain order of the at least two audio data segments comprises:

processing the local sound characteristics of the at least two audio data segments based on an attention mechanism according to the time domain sequence of the at least two audio data segments to obtain the attention weight of the at least two audio data segments;

and weighting the local sound characteristics of the at least two audio data segments based on the attention weights of the at least two audio data segments to obtain the global sound characteristics.

7. The method of claim 1, wherein the inertial sensor data comprises at least two sensor data segments; the extracting features of the inertial sensor data based on the time sequence of the inertial sensor data to obtain global inertial sensor features comprises:

performing feature extraction on at least two sensor data segments to obtain respective sensor local features of the at least two sensor data segments;

and processing the local sensor characteristics of the at least two sensor data segments based on a self-attention mechanism according to the time domain sequence of the at least two sensor data segments to obtain the global inertial sensor characteristics.

8. The method of claim 7, wherein the obtaining the global inertial sensor characteristics by processing sensor local characteristics of each of at least two of the sensor data segments based on a self-attention mechanism in a time-domain order of the at least two of the sensor data segments comprises:

processing the local sensor characteristics of the at least two sensor data segments based on an attention mechanism according to the time domain sequence of the at least two sensor data segments to obtain the attention weight of the at least two sensor data segments;

and weighting the local sensor features of the at least two sensor data segments based on the attention weights of the at least two sensor data segments to obtain the global inertial sensor feature.

9. The method according to any one of claims 1 to 8, wherein the transportation operation information is used to indicate whether an operation state of a public transportation means within the target time period is a stopped state;

the executing of the arrival reminding based on the traffic operation information comprises:

and in the case that the traffic operation information indicates that the operation state of the public transport means in the target time period is a stop state, performing arrival reminding.

10. The method according to claim 9, wherein in the case that the traffic operation information indicates that the operation state of the public transportation means in the target time period is a stop state, performing the arrival reminding comprises:

acquiring the current position of the public transport means under the condition that the traffic operation information indicates that the operation state of the public transport means in the target time period is a stop state;

executing a station arrival reminder in the case that the current position of the public transportation means matches a designated station on a target route; the designated station is a destination station or a transfer station on the target route.

11. The method of claim 10, wherein prior to acquiring the ambient sound data over a target time period and the inertial sensor data over the target time period, further comprising:

displaying a route setting interface;

and predicting a route according to a starting station and a destination station set in the route setting interface by the user or according to the historical movement track of the user to obtain the target route.

12. The method according to claim 9, wherein in the case that the traffic operation information is used for indicating that the operation state of the public transportation means in the target time period is a stop state, the performing of the arrival reminding comprises:

and under the condition that the traffic running information acquired for N times continuously indicates that the running state of the public transport means is a stop state, executing arrival reminding.

13. An arrival reminding apparatus, the apparatus comprising:

14. A terminal, characterized in that the terminal comprises a processor and a memory; the memory has stored therein at least one computer instruction that is loaded and executed by the processor to implement the method of arrival alert according to any of claims 1 to 12.

15. A computer-readable storage medium having stored therein at least one computer instruction, which is loaded and executed by a processor to implement the method of arrival alert according to any one of claims 1 to 12.

16. A computer program, characterized in that the computer program comprises computer instructions which, when executed by a processor of a terminal, cause the terminal to perform the arrival reminding method according to any of claims 1 to 12.