CN110660201B

CN110660201B - Arrival reminding method, device, terminal and storage medium

Info

Publication number: CN110660201B
Application number: CN201910897452.7A
Authority: CN
Inventors: 刘文龙
Original assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2021-07-09
Anticipated expiration: 2039-09-23
Also published as: CN110660201A

Abstract

The embodiment of the application discloses a method, a device, a terminal and a storage medium for reminding a user of arriving at a station, and belongs to the field of artificial intelligence. The method comprises the following steps: collecting ambient sounds by a microphone while in a vehicle; identifying the environmental sound; when the environment sound is identified to contain the target alarm ring, adding one to the number of the running stations; and when the number of the running stations reaches the target number of the stations, reminding the station arrival. By adopting the method provided by the embodiment of the application, the number of the running stations is updated when the current environment sound is identified to contain the target alarm ring by acquiring the environment sound in real time and identifying whether the current environment sound contains the target alarm ring, and the station-entering reminding is carried out when the number of the running stations reaches the target number of the stations; because the alarm ring is used for giving out a warning to the passenger, the sound characteristic is obvious and the alarm ring is easy to identify, and therefore the accuracy and the effectiveness of the arrival reminding can be improved by carrying out the arrival reminding based on the alarm ring in the environment sound.

Description

Arrival reminding method, device, terminal and storage medium

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a method, a device, a terminal and a storage medium for reminding a user of arriving at a station.

Background

When people go out by taking public transport means such as a subway, people need to pay attention to whether a current stop station is a target station of the people at all times, and the arrival reminding function is a function of reminding passengers to get off the bus in time when the passengers arrive at the target station.

In the related art, a terminal generally acquires current station information according to arrival information broadcasted by a subway by using a voice recognition technology, judges whether the current station is a target station of a passenger, and reminds the passenger of the arrival if the current station is the target station.

However, when the above method is used to obtain the station information, the speaking voice of the passenger and the noise of the subway operation will have a great influence on the result of the voice recognition, which easily causes the reminding delay or inaccuracy.

Disclosure of Invention

The embodiment of the application provides a method, a device, a terminal and a storage medium for reminding arrival. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a station arrival reminding method, where the method includes:

collecting ambient sounds by a microphone while in a vehicle;

identifying the environmental sound;

when the environment sound is identified to contain a target alarm ring, adding one to the number of the running stations, wherein the target alarm ring is a door opening alarm ring or a door closing alarm ring;

and when the number of the running stations reaches the target number of stations, performing station arrival reminding, wherein the target number of stations is the number of stations between the starting station and the target station, and the target station is a transfer station or a destination station.

On the other hand, the embodiment of the present application provides a device for reminding of arriving at a station, the device includes:

the acquisition module is used for acquiring environmental sounds through the microphone when the vehicle is in the traffic state;

the identification module is used for identifying the environmental sound;

the counting module is used for adding one to the number of the running stations when the environment sound is identified to contain a target alarm ring, wherein the target alarm ring is a door opening alarm ring or a door closing alarm ring;

and the reminding module is used for reminding the user of arriving when the number of the stations which have traveled reaches the number of target stations, wherein the number of the target stations is the number of stations between the starting station and the target station, and the target station is a transfer station or a destination station.

In another aspect, an embodiment of the present application provides a terminal, where the terminal includes a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the arrival reminder method of the above aspect.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction, and the at least one instruction is used for being executed by a processor to implement the arrival reminding method in the foregoing aspect.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

in the embodiment of the application, the number of the running stations is updated when the current environment sound is identified to contain the target alarm ring by acquiring the environment sound in real time and identifying whether the current environment sound contains the target alarm ring, and the station-entering reminding is carried out when the number of the running stations reaches the target number of the stations; because the alarm ring is used for giving out a warning to the passenger, the sound characteristic is obvious and the alarm ring is easy to identify, and therefore the accuracy and the effectiveness of the arrival reminding can be improved by carrying out the arrival reminding based on the alarm ring in the environment sound.

Drawings

FIG. 1 is a flow chart illustrating a method of arrival reminders according to an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of arrival reminders according to another exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of arrival reminders according to another exemplary embodiment;

FIG. 4 is a flow diagram illustrating audio data pre-processing according to an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a voice recognition process in accordance with another illustrative embodiment;

FIG. 6 is a flow chart illustrating a method of arrival reminders according to another exemplary embodiment;

FIG. 7 is a flow diagram illustrating a primary sound recognition model for sound recognition in accordance with an exemplary embodiment;

FIG. 8 is a spectral diagram illustrating an ambient sound according to an exemplary embodiment;

FIG. 9 is a block diagram illustrating a two-level sound recognition model in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating the structure of a station arrival reminder according to an exemplary embodiment;

fig. 11 is a block diagram illustrating a structure of a terminal according to an exemplary embodiment.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Reference herein to a "module" generally refers to a program or instructions stored in memory that is capable of performing certain functions; reference herein to "a unit" generally refers to a logically partitioned functional structure, and the "unit" may be implemented by pure hardware or a combination of hardware and software.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The arrival reminding method provided by each embodiment of the application is used for a terminal with audio acquisition and processing functions, and the terminal can be a smart phone, a tablet computer, an electronic book reader, a personal portable computer and the like. In a possible implementation manner, the arrival reminding method provided by the embodiment of the application can be implemented as an application program or a part of the application program, and is installed in the terminal. The application can be manually started (or automatically started) when the user takes the vehicle, so that the user is reminded of arriving at the station through the application.

In the related art, a voice recognition technology is generally used, the station name of a current station where a vehicle is located is determined according to station announcement broadcast when the vehicle arrives at the station, and arrival reminding is performed on a user when the vehicle arrives at a target station. However, noise generated during driving of the vehicle and environmental voices such as speaking voices of passengers affect voice recognition, which easily causes errors in voice recognition results, and the voice recognition model is difficult to operate on a terminal and generally needs to operate by relying on a cloud terminal.

In addition, in the related art, whether a vehicle is in an acceleration or deceleration state is detected by using an accelerometer so as to judge whether the vehicle enters a station, however, the acceleration direction recorded by the accelerometer sensor in the terminal is related to the direction of a user holding the terminal, the user walking in the vehicle also affects the recording result of the sensor, the vehicle sometimes stops temporarily between two stations, and the position of the vehicle is difficult to accurately judge by using accelerometer timing.

In order to solve the above problem, an embodiment of the present application provides an arrival reminding method, and a flow of the arrival reminding method is shown in fig. 1. Before the terminal uses the arrival reminding function for the first time, executing a step 101, and storing a vehicle route map; when the terminal starts the arrival reminding function, firstly executing step 102, and determining a riding route; after entering the vehicle, executing step 103, and acquiring the environmental sound in real time through a microphone; step 104 is executed, the terminal identifies whether the environmental sound contains the target alarm ring, when the environmental sound does not contain the target alarm ring, the next section of environmental sound is continuously identified, and when the terminal identifies that the environmental sound contains the target alarm ring, step 105 is executed, and the number of the running stations is increased by one; and 106, judging whether the station is a destination station or not according to the number of the traveling stations, if so, executing 107, sending an arrival reminding, if not, executing 108, judging whether the station is a transfer station, if so, executing 107 again, sending the arrival reminding, and otherwise, continuously identifying the next section of environment sound.

Compared with the arrival reminding method provided in the related technology, the method and the device for reminding the arrival of the vehicle judge the station where the vehicle has traveled by identifying whether the current environment sound contains the target alarm ring, and the target alarm ring has obvious characteristics compared with other environment sounds and is less in influenced factors, so that the accuracy of the identification result is high; and the voice recognition is performed without using a complex voice recognition model, which is beneficial to reducing the power consumption of the terminal.

Referring to fig. 2, a flowchart of an arrival reminding method according to an embodiment of the present application is shown. In this embodiment, an arrival reminding method is described as an example for a terminal with audio acquisition and processing functions, and the method includes:

in step 201, ambient sounds are collected by a microphone while in a vehicle.

When the terminal is in a vehicle, the terminal starts the arrival reminding function and collects the environmental sound in real time through the microphone.

In a possible implementation mode, when the arrival reminding method is applied to a map navigation application program, the terminal acquires the position information of the user in real time, and when the user enters a vehicle according to the position information of the user, the terminal starts the arrival reminding function.

Optionally, when the user uses the payment application program to swipe a card to take the vehicle, the terminal confirms that the vehicle enters and starts the arrival reminding function.

Optionally, in order to reduce power consumption of the terminal, the terminal may use a low power consumption microphone to perform real-time acquisition.

Step 202, the environmental sounds are identified.

Optionally, the terminal converts the environmental sound acquired by the microphone in real time into audio data, performs data processing on the audio data, and identifies whether the processed audio data contains audio data of the target alarm ring.

In one possible implementation mode, when acquiring the vehicle route map of the city, the terminal acquires the alarm ring tones of different vehicles and stores the audio data of the alarm ring tones to the local. When the terminal cannot acquire the alarm ring of the current city or vehicle, the user needs to start the microphone to collect and store the alarm ring when taking the vehicle for the first time, so that the terminal can learn the alarm ring.

And step 203, when the environment sound is identified to contain the target alarm ring, adding one to the number of the running stations, wherein the target alarm ring is a door opening alarm ring or a door closing alarm ring.

And when the terminal identifies that the current environment sound contains the target alarm ring, the terminal indicates that the current vehicle arrives at a certain station, and then the number of the stations which have already run is increased by one. Since the vehicles usually send out alarm rings when the doors are opened and closed, in order to avoid count confusion, the terminal can be set in advance to only identify the door opening alarm ring or only identify the door closing alarm ring. Generally, the time interval between the door opening alarm ring and the door closing alarm ring is small, and therefore, in the case that the door opening alarm ring is the same as the door closing alarm ring, when two alarm rings are recognized within a fixed time zone, it is considered that the door is opened once or closed once.

And step 204, when the number of the driven stations reaches the target number of stations, performing arrival reminding, wherein the target number of stations is the number of stations between the starting station and the target station, and the target station is a transfer station or a destination station.

After the terminal performs one-time adding operation, if the number of the current running stations reaches the target number of stations, the current station is indicated as the target station, and the user is reminded of arriving at the station. The number of destination stations is the number of stations between the origin station and the destination station, i.e., the number of stations that the vehicle needs to travel from the origin station to the destination station, and the destination station includes a transit station and a destination station.

Optionally, in order to prevent the user from missing the time for getting off when the terminal sends the arrival reminding and the vehicle is closed to drive to the next station, a message prompt for sending the arrival message when the vehicle arrives at the station before the target station may be set, so that the user can make preparations for getting off in advance.

Optionally, the arrival reminding mode includes but is not limited to: voice prompt, vibration prompt and interface prompt.

Regarding the manner of acquiring the number of target stations, in a possible implementation manner, the terminal loads and stores a route map of a vehicle in a current city in advance, wherein the route map includes station information, transfer information, first and last shift time, a map near a station and the like of each line. Before the terminal starts a microphone to collect environmental sounds, the riding information of a user is firstly acquired, wherein the riding information comprises an initial station, a target station, a map near the station, first and last shift time and the like, and therefore the number of the target stations is determined according to the riding information.

Optionally, the method for the terminal to obtain the riding information may be manually input by the user, for example, names of the starting station and the target station, the terminal selects a proper riding route according to the riding information input by the user and the route map of the vehicle, and when the target station is reached, the terminal sends a message of reminding the user of arriving at the station and a map near the target station.

Alternatively, the riding information manually input by the user may be only the number of stations between the starting station and the target station. According to the method, the terminal judges the current station according to the alarm ring when the vehicle opens or closes the door, when the target alarm ring is identified, the number of the stations which have already traveled is added until the number of the stations which have already traveled is equal to the number of the stations which have to travel from the starting station to the target station, therefore, when a user has a determined bus route, the terminal can only input the number of the stations of the bus route, and can prompt the user to input the number of the stations between the starting station and the transfer station and the number of the stations between the transfer station and the destination station when the bus route is determined.

Optionally, the terminal may predict the riding route of the user according to the historical riding record of the user, use the riding route with the riding times reaching the riding time threshold as the priority selection route, and prompt the user to select.

In summary, in the embodiment of the application, by acquiring the environment sound in real time and identifying whether the current environment sound contains the target alarm ring, when the current environment sound contains the target alarm ring, the number of the driven stations is updated, and when the number of the driven stations reaches the target number, the station entry reminding is performed; because the alarm ring is used for giving out a warning to the passenger, the sound characteristic is obvious and the alarm ring is easy to identify, and therefore the accuracy and the effectiveness of the arrival reminding can be improved by carrying out the arrival reminding based on the alarm ring in the environment sound.

In a possible implementation manner, when identifying whether the environmental sound includes the target alarm ring, in order to improve the identification accuracy, it is necessary to pre-process the audio data corresponding to the environmental sound, and then input the processed audio data into the sound identification model, so as to determine whether the current environmental sound includes the target alarm ring according to the target alarm ring identification result output by the sound identification model. The following description will be made by using exemplary embodiments.

Referring to fig. 3, a flow chart of an arrival reminding method according to another embodiment of the present application is shown. In this embodiment, an arrival reminding method is described as an example for a terminal with audio acquisition and processing functions, and the method includes:

in step 301, ambient sounds are collected by a microphone while in a vehicle.

The implementation of step 301 may refer to step 201, and this embodiment is not described herein again.

Step 302, performing framing processing on the audio data corresponding to the environmental sounds to obtain audio data corresponding to different audio frames.

Since the voice recognition model cannot directly recognize the audio data, it is necessary to process the audio data in advance to obtain digital features that can be recognized by the voice recognition model. Because the terminal microphone collects the environmental sound in real time, the audio data of the terminal microphone is not stable on the whole, but the local part of the terminal microphone can be regarded as stable data, and the sound identification model can only identify the stable data, so that the corresponding audio data is firstly subjected to framing processing to obtain the audio data corresponding to different audio frames.

In one possible embodiment, the audio data pre-processing process is as shown in fig. 4, the audio data is first pre-emphasized by a pre-emphasis module 401, and the pre-emphasis process uses a high-pass filter, which only allows signal components above a certain frequency to pass through, and suppresses signal components below the certain frequency, so as to remove unnecessary low-frequency interference such as human talk sound, footstep sound, and mechanical noise in the audio data, and flatten the frequency spectrum of the audio signal. The mathematical expression for the high pass filter is:

H(z)＝1-az^-1

where a is a correction coefficient, generally ranging from 0.95 to 0.97, and z is an audio signal.

The audio data after the noise removal is subjected to framing processing by the framing windowing module 402, so as to obtain audio data corresponding to different audio frames.

Illustratively, in this embodiment, the audio data including 1024 data points is divided into one frame, and when the sampling frequency of the audio data is selected to be 16000Hz, the duration of one frame of audio data is 64 ms. In order to avoid overlarge change between two frames of data and avoid data loss at two ends of the audio frame after windowing, the audio data are directly divided into frames without adopting a back-to-back mode, and after one frame of data is taken, the frame of data is slid for 32ms and then taken down, namely two adjacent frames of data are overlapped for 32 ms.

Because discrete fourier transform is required to be performed on the audio data subjected to framing processing during subsequent feature extraction, and one frame of audio data has no obvious periodicity, that is, the left end and the right end of the frame are discontinuous, errors can be generated between the audio data subjected to fourier transform and original data, and the more frames are divided, the larger the error is, so that in order to make the audio data subjected to framing continuous and each frame of audio data shows the features of a periodic function, framing windowing processing needs to be performed through the framing windowing module 402.

In one possible implementation, a hamming window is used to window the audio frame. The hamming window function is multiplied by each frame of data, and the resulting audio data has significant periodicity. The Hamming window has the functional form:

n is an integer, the value range of n is 0 to M, M is the number of points of fourier transform, and illustratively, 1024 data points are taken as fourier transform points in this embodiment.

Step 303, performing feature extraction on the audio data corresponding to each audio frame to obtain a corresponding audio feature matrix.

After the audio data is subjected to frame windowing, feature extraction is required to be performed, and a feature matrix which can be identified by a sound identification model is obtained.

In a possible embodiment, Mel-Frequency Cepstral Coefficients (MFCCs) of an audio frame are extracted, and as shown in fig. 4, since it is difficult to obtain signal characteristics of an audio signal in a time domain, a time domain signal generally needs to be converted into energy distribution in a Frequency domain for processing, so that the terminal firstly inputs audio frame data into a fourier transform module 403 for fourier transform, and then inputs the fourier transformed audio frame data into an energy spectrum calculation module 404 for calculating an energy spectrum of the audio frame data. In order to convert the energy spectrum into a Mel spectrum conforming to the auditory sense of human ears, the energy spectrum needs to be input into a Mel filtering processing module for filtering processing; the mathematical expression of the filtering process is:

wherein f is a frequency point after Fourier transform.

After obtaining the mel spectrum of the audio frame, the terminal logarithms the mel spectrum through a Discrete Cosine Transform (DCT) module 406, and the obtained DCT coefficient is the MFCC characteristic.

Illustratively, the embodiment of the application selects a 40-dimensional MFCC feature, when a terminal actually extracts a feature, the input window length of audio data is selected to be 1056ms, the time length of one frame of signal is 64ms, and there is an overlapping portion of 32ms between two adjacent frames of data, so that each 1056ms of input window data corresponds to a generated matrix with a feature of 32 × 40.

Step 304, inputting the audio characteristic matrix into the voice recognition model to obtain a target alarm ring recognition result output by the voice recognition model, wherein the target alarm ring recognition result is used for indicating whether the audio frame contains the target alarm ring or not.

Optionally, the terminal inputs the audio feature matrix obtained after the audio frame is subjected to feature extraction into the sound recognition model, and the model recognizes whether the current audio frame contains the target alarm ring, and outputs the recognition result.

In a possible implementation manner, if the terminal cannot autonomously acquire the alarm ring of the vehicle in the current city, the user is required to acquire the target alarm ring in advance, and after the target alarm ring is acquired, the audio data including the target alarm ring is subjected to the framing processing and the feature extraction processes in the steps 302 to 303, and the audio feature matrices of different target alarm rings are stored locally.

Step 305, when the number of the audio frames containing the target alarm ring within the preset time reaches the number threshold, determining that the environmental sound contains the target alarm ring.

Because the terminal performs frame processing on the audio data before identifying the target alarm ring, and the time of one frame of audio is short, when a certain audio frame contains the target alarm ring, the situation that other similar sounds or errors are generated in the data processing process when the features are extracted cannot be eliminated, and the situation that the target alarm ring is contained in the environmental sound cannot be immediately determined. Therefore, the terminal sets a preset time length, and when the output result of the voice recognition model indicates that the number of the audio frames containing the target alarm ring within the preset time length reaches a number threshold value, the environmental sound is determined to contain the target alarm ring.

Illustratively, the terminal sets the predetermined time length to be 5 seconds, the number threshold value to be 2, and when the terminal recognizes that 2 or more than 2 audio frames contain the target alarm ring within 5 seconds, it is determined that the current environment sound contains the target alarm ring.

And step 306, acquiring the last alarm bell identification time, wherein the last alarm bell identification time is the time when the target alarm bell is included in the last identified environment sound.

When the output result of the voice recognition model indicates that the number of the audio frames containing the target alarm ring within the preset time reaches the number threshold, the terminal records the current moment and acquires the moment when the environmental sound is recognized last time and the target alarm ring is contained, namely the previous alarm ring recognition moment.

And 307, if the time interval between the previous alarm bell identification time and the current alarm bell identification time is larger than the time interval threshold, adding one to the number of the running stations.

In the actual riding process, the door-closing alarm ring and the door-opening alarm ring of the vehicle may be the same, which may cause the terminal to recognize the alarm ring twice at the same station, or other vehicles of the same vehicle may be the same as the alarm ring of the vehicle where the terminal is located, and when the vehicle where the terminal is located stops at a certain station, the nearby vehicle sends the same alarm ring, which may cause the terminal to count incorrectly, so the terminal sets a time interval threshold in advance, and if the time interval between the previous alarm ring recognition time and the current alarm ring recognition time is greater than the time interval threshold, the number of the stations that have traveled is incremented.

Illustratively, a time interval threshold is preset to be 1 minute, when the terminal identifies that the environment sound contains the target alarm bell each time, the current moment is recorded and the previous alarm bell identification moment is obtained, if the time interval between the current moment and the previous alarm bell identification moment is more than one minute, it is determined that the vehicle has driven one station, and an operation of adding one to the number of the driven stations is performed.

And 308, when the number of the driven stations reaches the target number of stations, reminding the user of arriving, wherein the target number of stations is the number of stations between the starting station and the target station, and the target station is a transfer station or a destination station.

The step 204 may be referred to in the implementation of the step 308, and this embodiment is not described herein again.

In the embodiment of the application, data which can be identified by a sound identification model is obtained by performing framing and windowing on the audio data of the environmental sound and performing feature extraction on the audio frame; by carrying out post-processing on the output result of the sound identification model, whether the identified alarm ring is the target alarm ring is determined, the alarm ring of other vehicles or similar sounds are prevented from being identified as the target alarm ring by mistake, and the accuracy of arrival reminding is improved.

The terminal needs to start the microphone all the time to acquire the environmental sound in the driving process of the vehicle and input the audio data of the environmental sound into the sound recognition model for recognition, so that the sound recognition model is always in a working state. In order to reduce the power consumption of the terminal, in one possible implementation, the terminal uses a simple similarity model as a primary voice recognition model for recognition, and uses a Convolutional Neural Network (CNN) model as a secondary voice recognition model for improving the accuracy of the voice recognition model. The voice recognition process is as shown in fig. 5, the terminal performs step 501, inputs the environmental sound, before recognizing the environmental sound, extracts the audio data feature through step 502, inputs the extracted audio feature matrix into the primary voice recognition model, determines the similarity through step 503, inputs the audio feature matrix into the secondary voice recognition model when the similarity determination result cannot determine whether the environmental sound contains the target alert ringtone, classifies through step 504CNN, performs step 505 post-processing and step 506 to determine whether to add an operation if the recognition result of the CNN model is that the environmental sound contains the target alert ringtone, and continues to recognize the next frame of audio data if the recognition result is that the environmental sound does not contain the target alert ringtone.

In a possible implementation, on the basis of fig. 3, as shown in fig. 6, the step 304 includes steps 304a to 304 d.

Step 304a, inputting the audio characteristic matrix into the primary sound recognition model to obtain a first recognition result output by the primary sound recognition model.

In a possible implementation manner, a cosine similarity model is used as a primary sound recognition model, the model establishment and recognition process is shown in fig. 7, the upper branch is the model establishment process, firstly, audio feature extraction is performed on the collected audio data of the target alarm ring through step 701 (the specific process can refer to steps 302 to 303), further, the extracted audio features are averaged through step 702, the obtained 32 x 40 audio feature matrix of each target alarm ring is converted into a one-dimensional feature vector containing 1280 elements, the one-dimensional feature vector is added into a database, and finally, a feature database is generated through step 703. After the model is built, the specific identification process is shown as the lower branch of fig. 7, after the environmental audio data is input, audio feature extraction is performed through step 704, then through step 705, a feature vector is generated according to the extracted audio features (the audio feature matrix corresponding to the audio frame is converted into a one-dimensional feature vector containing 1280 elements), and finally through step 706, the one-dimensional feature vector Y in the feature database is traversed_iCalculatingThe cosine similarity with the one-dimensional feature vector X of the current audio frame is calculated by the following formula:

where n represents the number of categories of the target alarm ring, i.e. the total number of all one-dimensional feature vectors in the feature database.

In other possible embodiments, the primary sound recognition model may use other similarity models, such as a manhattan distance model, a mahalanobis distance model, a euclidean distance model, and the like. The embodiments of the present application only take the cosine similarity model as an example for schematic illustration.

After the terminal obtains the first recognition result of the primary sound recognition model on the audio feature matrix, whether the current environmental sound contains the target alarm ring or not cannot be directly determined, and whether the first recognition result of the primary sound recognition model meets the output condition or not needs to be judged.

In one possible implementation, step 304a includes the following steps one through two.

Firstly, if the first recognition result indicates that the similarity is greater than a first similarity threshold or less than a second similarity threshold, the first recognition result is determined to meet the output condition, and the first similarity threshold is greater than the second similarity threshold.

In one possible implementation mode, when the first identification result indicates that the similarity is greater than or equal to a first similarity threshold, the terminal determines that the current environment sound contains a target alarm ring; and when the first identification result indicates that the similarity is smaller than a second similarity threshold, the terminal determines that no target alarm ring exists in the current environment sound, and the first similarity threshold is larger than the second similarity threshold. The two results meet the output condition.

Illustratively, the first similarity threshold is set to 0.9, the second similarity threshold is set to 0.5, and when the first recognition result indicates that the similarity is greater than or equal to 0.9 or less than 0.5, it is determined that the first recognition result meets the output condition

And secondly, if the first recognition result indicates that the similarity is greater than the second similarity threshold and smaller than the first similarity threshold, determining that the first recognition result does not meet the output condition.

When the first recognition result indicates that the similarity is greater than the second similarity threshold and smaller than the first similarity threshold, the primary sound recognition model cannot determine whether the environmental sound contains the target alarm ring, the first recognition result does not meet the output condition, and the secondary sound recognition model is required to further recognize the audio feature matrix of the current audio frame.

Illustratively, the first similarity threshold is set to be 0.9, the second similarity threshold is set to be 0.5, and when the first recognition result is smaller than 0.9 and larger than 0.5, it is determined that the first recognition result does not meet the output condition.

And step 304b, if the first identification result meets the output condition, taking the first identification result as the identification result of the target alarm ring and outputting the identification result.

When the first recognition result meets the output condition, the first-level sound recognition model can determine whether the current environmental sound contains the target alarm ring, and therefore the terminal takes the first recognition result as the target alarm ring recognition result and outputs the target alarm ring recognition result.

Illustratively, when the cosine similarity of the first recognition result is 0.95, it is determined that the current environmental sound contains the target alarm ring, and the first recognition result is output to be 0.95.

And step 304c, if the first recognition result does not meet the output condition, inputting the audio characteristic matrix into the secondary sound recognition model to obtain a second recognition result output by the secondary sound recognition model.

And when the first recognition result does not accord with the output condition, the first-stage sound recognition model is indicated to be incapable of determining whether the current environmental sound contains the target alarm bell or not, at the moment, the second-stage sound recognition model is required to further recognize, and the audio characteristic matrix corresponding to the audio frame of which the first recognition result does not accord with the output condition is input into the second-stage sound recognition model to obtain a second recognition result output by the second-stage sound recognition model.

In one possible implementation, the secondary sound recognition model adopts a CNN classification model, and the model training process is as follows:

firstly, converting the collected environment sound containing the target alarm ring into a spectrogram.

As shown in fig. 8, the difference between the target ring tone and other environmental sounds is clearly seen, and the short line in the black box in the figure is the frequency spectrum of the target ring tone, and the target ring tone is marked as a positive sample, and the rest of the environmental sounds are negative samples.

And secondly, extracting the characteristics of the collected environmental sounds.

The feature extraction method of the environmental sound collected in advance is the same as that in the above embodiment, the audio feature matrix corresponding to each audio frame is used as a training sample, the label corresponding to the audio feature matrix of the target alarm ring is 0, and the labels corresponding to the audio feature matrices of the other environmental sounds are 1.

And thirdly, constructing a CNN model.

In a possible embodiment, the CNN model structure is as shown in fig. 9, the first convolutional layer 901 and the second convolutional layer 902 are used to extract features of an input audio feature matrix, the first fully-connected layer 903 and the second fully-connected layer 904 integrate information with category distinction in the convolutional layers 901 and 902, and finally, a normalized exponential function (Softmax)905 is connected to classify the information integrated by the fully-connected layers to obtain a second recognition result.

And fourthly, constructing a loss function of the model.

As the target alarm bell is only about 5 seconds when the vehicle is running, and the rest environmental sounds are as long as several minutes, the positive and negative sample data are very unbalanced, so that the problem of sample imbalance is solved by selecting a Focal loss function (Focal loss), and the Focal loss formula is as follows:

wherein y' is the probability output by the CNN classification model, y is the label corresponding to the training sample, and alpha and gamma are manual adjusting parameters for adjusting the proportion of the positive sample and the negative sample.

And fifthly, importing training samples to perform model training.

In one possible implementation, the CNN classification model may be trained using the Tensorflow system and using the Focal loss and gradient descent algorithms until the model converges.

In a possible implementation manner, the secondary sound recognition model may also use other conventional machine learning classifiers or deep learning classification models, which is not limited in this embodiment.

And step 304d, taking the second identification result as an identification result of the target alarm ring and outputting the identification result.

The secondary sound recognition model is a CNN classification model with higher precision, so that after a second recognition result of the CNN classification model is obtained, the second recognition result is used as a target alarm ringtone recognition result and is output.

In the embodiment of the application, two sound recognition models are adopted, the first-level sound recognition model is low in power consumption and easy to realize, the audio data of the environment sound is opened in real time and recognized, the second-level sound recognition model is high in precision and high in power consumption, when the first-level sound recognition model cannot determine whether the current environment sound contains the target alarm ring, the second-level sound recognition model is opened again to recognize the current environment sound, and the power consumption of the terminal is reduced while the accuracy of the model recognition result is improved.

Referring to fig. 10, a block diagram of a station arrival reminding apparatus according to an exemplary embodiment of the present application is shown. The apparatus may be implemented as all or a portion of the terminal in software, hardware, or a combination of both. The device includes:

an acquisition module 1001 for acquiring ambient sounds through a microphone when the vehicle is in use;

an identifying module 1002, configured to identify the environmental sound;

the counting module 1003 is configured to add one to the number of stations that have traveled when the environmental sound is identified to include a target alarm ring, where the target alarm ring is a door opening alarm ring or a door closing alarm ring;

and a reminding module 1004, configured to perform arrival reminding when the number of stations that have traveled reaches a target number of stations, where the target number of stations is the number of stations between an origin station and a target station, and the target station is a transit station or a destination station.

Optionally, the identifying module 1002 includes:

the framing processing unit is used for framing the audio data corresponding to the environmental sound to obtain audio data corresponding to different audio frames;

the feature extraction unit is used for extracting features of the audio data corresponding to each audio frame to obtain a corresponding audio feature matrix;

and the voice identification unit is used for inputting the audio characteristic matrix into a voice identification model to obtain a target alarm ring identification result output by the voice identification model, and the target alarm ring identification result is used for indicating whether the audio frame contains the target alarm ring or not.

Optionally, the sound recognition model includes a primary sound recognition model and a secondary sound recognition model, and the recognition accuracy of the secondary sound recognition model is higher than that of the primary sound recognition model;

the voice recognition unit is further configured to:

inputting the audio characteristic matrix into the primary sound recognition model to obtain a first recognition result output by the primary sound recognition model;

if the first identification result meets the output condition, taking the first identification result as the identification result of the target alarm ring and outputting the identification result;

if the first recognition result does not accord with the output condition, inputting the audio characteristic matrix into the secondary sound recognition model to obtain a second recognition result output by the secondary sound recognition model;

and taking the second identification result as the identification result of the target alarm ring and outputting the identification result.

Optionally, the primary sound identification model is configured to calculate a similarity between the audio feature matrix and a sample feature matrix in a feature database, where the sample feature matrix is an audio feature matrix of the target alarm ring;

the voice recognition unit is further configured to:

if the first recognition result indicates that the similarity is greater than a first similarity threshold or less than a second similarity threshold, determining that the first recognition result meets the output condition, wherein the first similarity threshold is greater than the second similarity threshold;

and if the first recognition result indicates that the similarity is greater than the second similarity threshold and smaller than the first similarity threshold, determining that the first recognition result does not meet the output condition.

Optionally, the identifying module 1002 further includes:

and the determining unit is used for determining that the environmental sound contains the target alarm ring when the number of the audio frames containing the target alarm ring within the preset time reaches a number threshold.

Optionally, the counting module 1003 includes:

the acquiring unit is used for acquiring the last alarm bell identification time, wherein the last alarm bell identification time is the time when the target alarm bell is included in the environment sound identified last time;

and the comparison unit is used for adding one to the number of the running stations if the time interval between the last alarm bell identification time and the current alarm bell identification time is greater than a time interval threshold value.

In this embodiment, by acquiring the environmental sound in real time and identifying whether the current environmental sound contains the target alarm ring, the arrival reminding device provided by the embodiment updates the number of the stations which have traveled when the current environmental sound contains the target alarm ring, and carries out arrival reminding when the number of the stations which have traveled reaches the target number of the stations; because the alarm ring is used for giving out a warning to the passenger, the sound characteristic is obvious and the alarm ring is easy to identify, and therefore the accuracy and the effectiveness of the arrival reminding can be improved by carrying out the arrival reminding based on the alarm ring in the environment sound.

Referring to fig. 11, a block diagram of a terminal 1100 according to an exemplary embodiment of the present application is shown. The terminal 1100 may be an electronic device in which an application is installed and run, such as a smart phone, a tablet computer, an electronic book, a portable personal computer, and the like. Terminal 1100 in the present application may include one or more of the following components: a processor 1110, a memory 1120, and a screen 1130.

Processor 1110 may include one or more processing cores. The processor 1110 interfaces with various interfaces and circuitry throughout the various portions of the terminal 1100, and performs various functions of the terminal 1100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1120, and invoking data stored in the memory 1120. Alternatively, the processor 1110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1110 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is responsible for rendering and drawing the content that the screen 1130 needs to display; the modem is used to handle wireless communications. It is to be appreciated that the modem can be implemented by a single communication chip without being integrated into the processor 1110.

The Memory 1120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1120 includes a non-transitory computer-readable medium. The memory 1120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1120 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, and the like), instructions for implementing the above-described method embodiments, and the like, and the operating system may be an Android (Android) system (including a system based on Android system depth development), an IOS system developed by apple inc (including a system based on IOS system depth development), or other systems. The stored data area may also store data created by terminal 1100 during use (e.g., phone book, audio-visual data, chat log data), etc.

The screen 1130 may be a capacitive touch display screen for receiving a touch operation of a user thereon or nearby using any suitable object such as a finger, a stylus, or the like, and displaying a user interface of each application. The touch display screen is generally provided on the front panel of the terminal 1100. The touch display screen may be designed as a full-face screen, a curved screen, or a profiled screen. The touch display screen can also be designed to be a combination of a full-face screen and a curved-face screen, and a combination of a special-shaped screen and a curved-face screen, which is not limited in the embodiment of the present application.

In addition, those skilled in the art will appreciate that the configuration of terminal 1100 illustrated in the above-described figures does not constitute a limitation of terminal 1100, and that terminals may include more or less components than those illustrated, or some components may be combined, or a different arrangement of components. For example, the terminal 1100 further includes a radio frequency circuit, a shooting component, a sensor, an audio circuit, a Wireless Fidelity (WiFi) component, a power supply, a bluetooth component, and other components, which are not described herein again.

The embodiment of the present application further provides a computer-readable medium, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the arrival reminding method according to the above embodiments.

The embodiment of the present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the arrival reminding method according to the above embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An arrival reminding method, characterized in that the method comprises:

collecting ambient sounds by a microphone while in a vehicle;

recognizing the environmental sound by utilizing a primary sound recognition model;

if the similarity indicated by the first identification result output by the primary sound identification model meets the output condition, taking the first identification result as a target alarm ring identification result and outputting the target alarm ring identification result, wherein the target alarm ring identification result is used for indicating whether an audio frame corresponding to the environmental sound contains a target alarm ring, and the target alarm ring is a door opening alarm ring or a door closing alarm ring;

if the first recognition result does not meet the output condition, recognizing the environmental sound by using a secondary sound recognition model, wherein the recognition accuracy of the secondary sound recognition model is higher than that of the primary sound recognition model;

taking a second recognition result output by the secondary sound recognition model as the target alarm ring recognition result and outputting the second recognition result;

when the environment sound is identified to contain the target alarm ring, adding one to the number of the running stations;

2. The method of claim 1, wherein before the identifying the environmental sound using a primary sound recognition model, the method comprises:

performing framing processing on the audio data corresponding to the environmental sounds to obtain audio data corresponding to different audio frames;

extracting the characteristics of the audio data corresponding to each audio frame to obtain a corresponding audio characteristic matrix;

the recognizing the environmental sound by utilizing the primary sound recognition model comprises the following steps:

inputting the audio characteristic matrix into the primary sound recognition model to obtain the first recognition result;

the recognizing the environmental sound by using the secondary sound recognition model comprises the following steps:

and inputting the audio characteristic matrix into the secondary sound recognition model to obtain the second recognition result.

3. The method of claim 2, wherein the primary sound recognition model is used to calculate the similarity between the audio feature matrix and a sample feature matrix in a feature database, and the sample feature matrix is the audio feature matrix of the target alarm ring;

after the audio feature matrix is input into the primary sound recognition model and the first recognition result is obtained, the method further includes:

4. The method of claim 1, wherein the secondary sound recognition model is a binary model using a Convolutional Neural Network (CNN), the binary model is trained from positive and negative samples and is trained by a gradient descent algorithm with focus loss focalloss as a loss function.

5. The method of any of claims 2 to 4, further comprising:

and when the number of the audio frames containing the target alarm ring within the preset time reaches a number threshold, determining that the environmental sound contains the target alarm ring.

6. The method according to any one of claims 1 to 4, wherein said adding one to the number of stations traveled comprises:

acquiring the last alarm bell identification moment, wherein the last alarm bell identification moment is the moment when the target alarm bell is included in the environment sound which is identified last time;

and if the time interval between the last alarm bell identification time and the current alarm bell identification time is larger than a time interval threshold value, adding one to the number of the running stations.

7. An arrival reminding apparatus, the apparatus comprising:

the recognition module is used for recognizing the environmental sound by utilizing a primary sound recognition model; if the similarity indicated by the first identification result output by the primary sound identification model meets the output condition, taking the first identification result as a target alarm ring identification result and outputting the target alarm ring identification result, wherein the target alarm ring identification result is used for indicating whether an audio frame corresponding to the environmental sound contains a target alarm ring, and the target alarm ring is a door opening alarm ring or a door closing alarm ring; if the first recognition result does not meet the output condition, recognizing the environmental sound by using a secondary sound recognition model, wherein the recognition accuracy of the secondary sound recognition model is higher than that of the primary sound recognition model; taking a second recognition result output by the secondary sound recognition model as the target alarm ring recognition result and outputting the second recognition result;

the counting module is used for adding one to the number of the running stations when the environment sound is identified to contain the target alarm ring;

8. The apparatus of claim 7, wherein the identification module comprises:

the voice recognition unit is used for inputting the audio characteristic matrix into the primary voice recognition model to obtain the first recognition result; and inputting the audio characteristic matrix into the secondary sound recognition model to obtain the second recognition result.

9. The apparatus of claim 8, wherein the primary sound recognition model is configured to calculate similarity between the audio feature matrix and a sample feature matrix in a feature database, and the sample feature matrix is an audio feature matrix of the target alarm ring;

the voice recognition unit is further configured to:

10. The apparatus of claim 7, wherein the secondary sound recognition model is a binary model using CNN, the binary model being trained from positive and negative samples and from a gradient descent algorithm with focalloss as a loss function.

11. The apparatus of any one of claims 8 to 10, wherein the identification module further comprises:

12. The apparatus of any one of claims 7 to 10, wherein the counting module comprises:

13. A terminal, characterized in that the terminal comprises a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the arrival alert method of any of claims 1 to 6.

14. A computer-readable storage medium having stored thereon at least one instruction for execution by a processor to perform the method of arrival alert according to any one of claims 1 to 6.