CN115662396A

CN115662396A - Method, device, equipment and storage medium for determining starting and stopping time of sound event

Info

Publication number: CN115662396A
Application number: CN202211294647.0A
Authority: CN
Inventors: 姜彦吉; 郭丁旭; 郭佳鑫; 郑四发
Original assignee: Suzhou Automotive Research Institute of Tsinghua University
Current assignee: Suzhou Automotive Research Institute of Tsinghua University
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-31

Abstract

The invention discloses a method, a device, equipment and a storage medium for determining the starting and ending time of a sound event. The method comprises the following steps: determining the acoustic time sequence characteristics of the audio data to be detected based on the convolutional neural network and the cyclic convolutional neural network; determining a sound event identification result of the audio data to be detected according to the acoustic time sequence characteristics and the predetermined sound event relation characteristics; wherein the sound event relation characteristics are determined based on a pre-constructed sound event relation graph; the sound event relation graph is determined according to the statistical result of the sound events in the audio data set; determining a target frame in the audio data to be detected according to the sound event identification result; and determining the starting and ending time of the sound event in the audio data to be detected according to the number of the audio frames spaced between the adjacent target frames. The technical scheme solves the problem of low accuracy of positioning the start-stop time of the sound event, and can greatly improve the accuracy of positioning the start-stop time.

Description

Method, device, equipment and storage medium for determining starting and stopping time of sound event

Technical Field

The present invention relates to the field of voice detection technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a start-stop time of a sound event.

Background

Along with the development of the automobile towards the direction of intellectualization and networking, higher and higher requirements are put forward on the perception capability, besides the visual technology, the sound signal can also provide a lot of useful information, and the automobile has the advantages of no need of illumination and no interference from rainy days and dark days. The audio detection equipment with the technology for determining the starting time and the ending time of the sound event has the characteristics of low cost, small volume, convenience and quickness in installation, strong reliability, difficulty in damage, simplicity in maintenance and the like, and can be widely applied to a cabin of an intelligent vehicle.

At present, the positioning scheme of the start and end time of the sound event is mainly to determine the type of the sound event based on a deep learning model, and then determine the start and end time of the sound event according to the audio frame related to the sound event. The convolutional neural network is good at capturing local characteristics of the sound signals and can well classify the sound signals. Combining the CNN and RNN to form a convolutional recurrent neural network can obtain more ideal detection performance, and becomes the mainstream model for determining the start and stop time of a sound event at present.

However, the existing solutions do not consider the relationship between the sound events in the sound scene, so there is a certain gap in the positioning accuracy of the sound event start-stop time in the scenario of determining the start-stop time of the polyphonic sound event, especially when a plurality of sound events overlap in time.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for determining the start-stop time of a sound event, which are used for solving the problem of low accuracy of positioning the start-stop time of the sound event and can greatly improve the accuracy of positioning the start-stop time.

According to an aspect of the present invention, there is provided a method of determining a sound event start-stop time, the method comprising:

determining the acoustic time sequence characteristics of the audio data to be detected based on the convolutional neural network and the cyclic convolutional neural network;

determining a sound event identification result of the audio data to be detected according to the acoustic time sequence characteristics and the predetermined sound event relation characteristics; wherein the sound event relation characteristics are determined based on a pre-constructed sound event relation graph; the sound event relation graph is determined according to the statistical result of the sound events in the audio data set;

determining a target frame in the audio data to be detected according to the sound event identification result;

and determining the starting and ending time of the sound event in the audio data to be detected according to the number of the audio frames spaced between the adjacent target frames.

According to another aspect of the present invention, there is provided a sound event start-stop time determination apparatus, the apparatus comprising:

the acoustic time sequence characteristic determining module is used for determining the acoustic time sequence characteristics of the audio data to be detected based on the convolutional neural network and the cyclic convolutional neural network;

the recognition result determining module is used for determining a sound event recognition result of the audio data to be detected according to the acoustic time sequence characteristics and the predetermined sound event relation characteristics; wherein the sound event relation characteristics are determined based on a pre-constructed sound event relation graph; the sound event relation graph is determined according to the statistical result of the sound events in the audio data set;

the target frame determining module is used for determining a target frame in the audio data to be detected according to the sound event recognition result;

and the starting and ending time determining module is used for determining the starting and ending time of the sound event in the audio data to be detected according to the number of the audio frames spaced between the adjacent target frames.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of determining a sound event start and stop time according to any embodiment of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the method for determining the start and stop time of a sound event according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme of the embodiment of the invention, the sound event recognition result of the audio data to be detected is determined through the acoustic time sequence characteristics and the prior sound event relation characteristics of the audio data to be detected; and determining target frames in the audio data to be detected according to the sound event identification result, and further determining the start-stop time of the sound event in the audio data to be detected according to the number of audio frames spaced between adjacent target frames. According to the technical scheme, the problem of low accuracy of positioning the start-stop time of the sound event is solved, and the accuracy of positioning the start-stop time can be greatly improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1A is a flowchart of a method for determining a start-stop time of a sound event according to an embodiment of the present invention;

FIG. 1B is a schematic diagram of a relationship between sound events provided according to an embodiment of the present invention;

fig. 2A is a flowchart of a method for determining a start-stop time of a sound event according to a second embodiment of the present invention;

FIG. 2B is a schematic diagram of positioning the start-stop time of a sound event according to an embodiment of the present invention;

FIG. 2C is a diagram illustrating a positioning result of a start-stop time of a sound event according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for determining a start-stop time of a sound event according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device implementing the method for determining the start-stop time of the sound event according to the embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

Example one

Fig. 1A is a flowchart of a method for determining a start-stop time of a sound event according to an embodiment of the present invention, which is applicable to a scenario for determining a start-stop time of a sound event in a vehicle cabin, and which can be performed by a device for determining a start-stop time of a sound event, and which can be implemented in hardware and/or software, and which can be configured in an electronic device. As shown in fig. 1A, the method includes:

s110, determining the acoustic time sequence characteristics of the audio data to be detected based on the convolutional neural network and the circular convolutional neural network.

The scheme can be executed by a vehicle-mounted system, and the vehicle-mounted system can acquire the audio data to be detected in the vehicle through voice pickup equipment such as a microphone. The vehicle-mounted system can perform framing, windowing, discrete Fourier transform and other processing on the audio data to be detected, and extract the acoustic characteristics such as amplitude, phase and the like of the audio data to be detected. The vehicle-mounted system can process the acoustic features obtained by processing the voice signals based on the convolutional neural network and the cyclic convolutional neural network to obtain the acoustic time sequence features.

The convolutional neural network can be sequentially connected with the cyclic neural network, acoustic features obtained by processing the voice signals are input into the convolutional neural network, and deep acoustic features extracted by the convolutional neural network are used as input of the cyclic convolutional neural network to obtain acoustic time sequence features. It is easy to understand that the convolutional neural network can extract deep acoustic features on the basis of acoustic features output by speech signal processing, and the circular convolutional neural network can extract time sequence features in the acoustic features. The convolutional neural network can be constructed based on network structures such as AlexNet, VGG-Net or ResNet, and the network structures can also be customized according to actual application needs. The cyclic convolution neural network may be constructed based on a Gated Recurrent Unit network (GRU), a Bidirectional Recurrent neural network (Bi-RNN), or a Long-Short Term Memory network (LSTM).

And S120, determining a sound event identification result of the audio data to be detected according to the acoustic time sequence characteristics and the predetermined sound event relation characteristics.

It is understood that the sound event may be a sound for representing information carrying a physical event, such as a whistle, brake, and speaking. The sound event relation characteristic may be pre-stored in the vehicle-mounted system. The sound event relationship feature may be determined based on a pre-constructed sound event relationship graph. The sound event relationship graph may be determined based on statistics of sound events in the audio data set. The audio data set may be a public audio data set or an audio data set that is acquired and produced by a developer on an as-needed basis. The audio data set may include audio data and tag information associated with the audio data, wherein the audio data in the audio data set may be collected in a cabin of a vehicle, and the tag information of the audio data may include information about types of sound events in the audio, start and stop times of each sound event, and the like. By using electronic devices such as a computer, a server and the like, the sound events existing in the audio data set can be counted to obtain the occurrence frequency of each type of sound event, the simultaneous occurrence frequency of each type of sound event, the occurrence frequency of each type of sound event in sequence and the like. According to the statistical result of the sound events in the audio data set, the electronic equipment can construct a sound time relation graph which is used for representing the incidence relation among various types of sound events.

Fig. 1B is a schematic diagram of a sound event relationship provided according to an embodiment of the present invention, and after obtaining the sound event relationship, the electronic device may draw the sound event relationship diagram shown in fig. 1B. The electronic equipment can extract the sound event relation characteristics contained in the sound event relation graph so as to assist the acoustic characteristics of the audio data to be detected and realize the sound event identification in the audio data to be detected.

The vehicle-mounted system can fuse the extracted acoustic features with predetermined sound event relation features. According to the fusion characteristics, based on the sound event classification model, the vehicle-mounted system can determine the sound event recognition result of the audio data to be detected. The sound event identification result may include the types of sound events existing in the audio data to be detected, and the occurrence probability of each type of sound event.

In this embodiment, optionally, the determining process of the sound event relationship characteristic includes:

acquiring label information of each audio data in the audio data set;

determining a time sequence relation between each type of sound event according to the label information of each audio data, determining the occurrence probability matched with the time sequence relation, and determining the common degree of each type of sound event;

constructing a sound event relation graph according to the common degree of each type of sound event, the time sequence relation among the sound events and the occurrence probability matched with the time sequence relation;

and determining the relation characteristics of the sound events based on the graph convolution neural network according to the sound event relation graph.

The tag information may include distribution information of sound events in the audio data, i.e., start and end times of the audio data, a timing relationship of the sound events, and start and end times of the sound events. According to the distribution information of the sound events, the electronic equipment can count the time sequence relation of each type of sound event in the audio data set, the occurrence probability of each time sequence relation and the popularity of each type of sound event. Wherein, the degree of commonalities can be used to evaluate the degree of commonalities of the sound events, and can be determined according to the occurrence frequency and/or duration of the sound events.

In a specific example, 10000 pieces of audio data exist in an audio data set, wherein a type I sound event and a type II sound event exist in 500 pieces of audio data. Of the 500 pieces of audio data, 300 pieces of type I sound events occur before type II sound events, and 200 pieces of type II sound events occur before type I sound events. The electronic device can count that the probability of the type I sound event occurring before the type II sound event is 0.6, and the probability of the type II sound event occurring before the type I sound event is 0.4.

According to the start and stop time of each piece of audio data, the duration time of each piece of audio data is 30s, the total duration time of 10000 pieces of audio data is 300000s, according to the start and stop time of the I-type sound event in 500 pieces of audio data, the duration time of the I-type sound event is 5000s, and the popularity of the I-type sound event can be 5000/300000 ≈ 0.017.

According to the sound event types existing in 10000 pieces of audio data, the electronic equipment can determine the total occurrence number of each type of sound event, wherein 7 types of sound events exist in the 10000 pieces of audio data, and 15000 times of the 7 types of sound events, wherein the II type of sound event occurs 500 times, and the degree of commonness of the II type of sound event can be expressed as 500/15000 ≈ 0.033.

According to the common degree of each type of sound event, the electronic equipment can correct the time sequence relation among the sound events and the occurrence probability matched with the time sequence relation so as to construct a reliable sound event relation graph. After correcting the time sequence relation and the occurrence probability, the electronic device may construct a sound event relation graph according to each occurrence probability of the corrected time sequence relation.

The electronic equipment can use various types of sound events as nodes, edges among the nodes are constructed according to the time sequence relation among the sound events, and the occurrence probability matched with the time sequence relation can be used as the weight of the edges among the nodes, so that a sound event relation graph is obtained. The time sequence relation between the sound events can be directed or undirected, and therefore, the sound event relation graph can be a directed graph or an undirected graph.

After obtaining the sound event relationship diagram, the electronic device may determine the adjacency matrix, the degree matrix, and the feature vector matrix of the node according to the sound event relationship diagram. And extracting the relation characteristics based on the graph convolution neural network according to the adjacency matrix, the degree matrix and the characteristic vector matrix of the node, so that the electronic equipment can obtain the sound event relation characteristics.

On the basis of the scheme, the label information comprises the type of the sound event existing in the audio data and the starting and ending time of the sound event;

the determining a time sequence relation between the sound events of the types according to the label information of the audio data, determining an occurrence probability matched with the time sequence relation, and determining the popularity of the sound events of the types includes:

determining the occurrence frequency of each type of sound event according to the sound event type existing in each audio data; determining the time sequence relation between the sound events of various types according to the sound event types existing in the audio data and the start-stop time of the sound events in the audio data;

determining the occurrence probability matched with the time sequence relation according to the time sequence relation among the sound events of various types and the occurrence frequency of the sound events of various types;

determining the occurrence frequency of each type of sound event according to the occurrence frequency of each type of sound event, and determining the time ratio of each type of sound event to all audio data according to the start-stop time of the sound event;

and determining the popularity of each type of sound event according to the occurrence frequency and the time ratio.

The electronic device such as a computer or a server can obtain the tag information of the audio data from the audio data set. And counting the occurrence frequency of each type of sound event, the total occurrence frequency of the sound event, the simultaneous occurrence frequency and the successive occurrence frequency of the multiple types of sound events in each audio data according to the label information. Specifically, the electronic device may respectively record each type of sound event as L _i I =0,1,2, 3., i denotes a sound event type index. Let L denote the presence of both types of sound events in an audio data stream _ij I, j =0,1,2,3,. And i ≠ j, i and j each represent a sound event type index. L is _ij Representing a sound event L _i With the sound event L _j The two types of sound events occur simultaneously and successively in the same audio data. In order to ensure that the calculated sound event sequence in the same audio data conforms to the real law, the duration of the single audio data can be set within a certain time range, such as 30s-60s, so as to ensure the soundIntegrity and relevance of the frequency data information. According to the type of the sound event existing in each audio data, counting the occurrence frequency of each type of sound event, and marking as X _i Represents a sound event L _i The number of occurrences. The total occurrence times of the sound events can be obtained by adding the occurrence times of the sound events of the various types, and X represents the total occurrence times of the sound events. The number of times both sound events occurred, denoted X _ij Represents a sound event L _i With the sound event L _j The number of times each occurred in the same audio data.

The electronic equipment can calculate the occurrence probability of each type of sound event according to the occurrence frequency of each type of sound event, and then calculate the probability of the simultaneous occurrence and the successive occurrence of each type of sound event through a prior probability formula. Specifically, the frequency of occurrence of each type of sound event is calculated

According to the sound event L _i Number of occurrences X _i And the total number X of sound events, calculating the sound event L _i Probability of occurrence of

Calculating the probability P of simultaneous occurrence and successive occurrence of various types of sound events _ij According to the occurrence times X of each type of sound event _i And the number of times X that both types of sound events occurred _ij At the sound event L _i Sound event L on occurrence _j Probability of occurrence P _ij ＝P(L _j |L _i )＝X _ij /X _i . It should be noted that P _ij And P _ji Are not necessarily equal and the meanings expressed are different, P _ij Is represented in the sound event L _i Sound event L on occurrence _j Probability of occurrence, P _ji Is represented in the sound event L _j Sound event L on occurrence _i The probability of occurrence.

The electronic equipment can calculate the proportion of the occurrence duration of each type of sound event to the total duration of all the sound events in the audio data set

Specifically, the electronic device may calculate the occurrence duration of each type of sound event according to the start-stop time of the sound event

And calculates the total duration X of all sound events ^time Calculating the sound event L _i Time ratio of

Since some types of sound events occur in the audio data many times, almost all of the audio data, such as engine operating sounds, while some types of sound events occur very rarely, such as severe collision sounds of car accidents. Therefore, it is not comprehensive to determine whether a sound event is common only by the occurrence frequency of the sound event. To complete the prior information of the relationship of the sound events, the electronic device may determine the occurrence frequency Y of the sound events _fre Time length of sound event _time In combination, the specific gravities of the two are adjusted by setting an adjusting parameter sigma.

In one possible implementation, the popularity calculation formula is:

wherein the content of the first and second substances,

which is indicative of the frequency of occurrence of,

represents the time ratio, sigma represents the adjusting parameter, sigma is equal to 0,1]And i denotes a sound event type index.

The scheme can quantize the common degree of the sound event, and adjusts the sound event relation graph through the common degree of the sound event type, so that the prior information of the sound event relation can be adjusted, and the reliability of the prior information is ensured.

In a preferred embodiment, the constructing a sound event relationship diagram according to the degree of common occurrence of each type of sound event, the time sequence relationship between each type of sound event, and the occurrence probability matched with the time sequence relationship includes:

if the common degree of each type of sound event is greater than a preset common degree threshold value, constructing a sound event relation graph according to the time sequence relation among the sound events and the occurrence probability matched with the time sequence relation;

if the popularity of the target type sound event is smaller than or equal to a preset popularity threshold and the target type sound event is in a preset important uncommon event set, correcting the occurrence probability matched with the time sequence relation associated with the target type sound event according to a preset probability correction principle, and constructing a sound event relation graph according to the time sequence relation among the various types of sound events and the corrected occurrence probability matched with the time sequence relation.

If the common degree of each type of sound event is larger than the common degree threshold value, the sound event is a common event, and the current time sequence relation and the occurrence probability can be used as the construction basis of the sound event relation graph. The electronic equipment can use each type of sound event as a node, determine directed edges between the nodes according to the time sequence relation between each type of sound event, and use the occurrence probability matched with the time sequence relation as the weight of each directed edge, thereby realizing the construction of the sound event relational graph.

If the popularity of at least one target type sound event exists in each type of sound event and is less than or equal to a preset popularity threshold, and the target type sound event is in a preset important uncommon event set, it indicates that the target type sound event must be preserved, in order to strengthen the time sequence relation of the important uncommon event, the electronic device may directly set a larger value, for example 1, for the occurrence probability of the time sequence relation matching associated with the target type sound event, or may gradually increase the occurrence probability by a preset correction step length until the occurrence probability of the time sequence relation matching associated with the target type sound event is greater than or equal to the preset occurrence probability threshold. After the correction is completed, the electronic device may construct a sound event relationship diagram according to the timing relationship between the sound events of the respective types and the corrected occurrence probability matched with the timing relationship.

And if the degree of common of at least one target type sound event exists in each type of sound event is less than or equal to a preset degree of common threshold, and the target type sound event is not in a preset important uncommon event set, indicating that the target type sound event is a rare event. The relevance of the rare events is not strong, in order to avoid the accuracy of the sound event relation, the electronic equipment can directly delete the time sequence relation and the occurrence probability associated with the target type sound event, and a sound event relation graph is constructed according to the corrected time sequence relation and the occurrence probability.

According to the scheme, the sound event relation graph can be corrected aiming at the defects of the audio data set, so that the accuracy of the sound event relation prior information is guaranteed, and the accuracy of sound event identification is improved.

And S130, determining a target frame in the audio data to be detected according to the sound event identification result.

The audio data to be detected may include at least two audio frames, and the sound event identification result may include a sound event type identification result of each audio frame in the audio data to be detected, where each audio frame may have one type of sound event or may have multiple types of sound events. The vehicle-mounted system may use the audio frame including the target type sound event in the sound event type recognition result as a target frame matched with the target type sound event, that is, the target type sound event exists in the audio frame.

S140, determining the starting and ending time of the sound event in the audio data to be detected according to the number of the audio frames spaced between the adjacent target frames.

It can be understood that the frequency range of normal human hearing is about 20Hz-20kHz, therefore, the sampling frequency of the audio data to be detected is usually set to 44.1kHz, and the frame length of the audio frame is 512 sampling points. The human beings can hardly feel the sound with the interval time less than 0.1s, the starting and ending time of the sound event is directly determined according to the target frame, and the complete and smooth sound event can not be presented. Meanwhile, the voice event recognition result has a certain misjudgment probability, the start-stop time of the voice event is directly determined according to the target frame, and the start-stop time positioning error caused by the recognition error cannot be avoided. Therefore, the vehicle-mounted system can eliminate the misjudged target frames according to the number of the audio frames spaced between the adjacent target frames, and the integrity and the fluency of the sound events in the starting time and the ending time are ensured.

Specifically, the vehicle-mounted system can determine a target frame associated with a target type sound event in the audio data to be detected, and take every two adjacent target frames and a video frame at an interval between the two target frames as an analysis group. If the number of audio frames spaced between a current set of adjacent target frames is greater than a preset spacing threshold, the ending time of the current sound event may be determined according to a preceding target frame of the adjacent target frames, and the starting time of the next sound event may be determined according to a following target frame of the adjacent target frames. If the number of audio frames spaced between the current set of adjacent target frames is less than the preset number of intervals, it may be determined that the sound event continuously occurs in the current set of audio frames, and the number of audio frames spaced between the next set of adjacent target frames and the preset number of intervals may be continuously compared to determine the termination time of the current sound event.

The technical scheme includes that a sound event recognition result of audio data to be detected is determined through acoustic time sequence characteristics and priori sound event relation characteristics of the audio data to be detected; and determining target frames in the audio data to be detected according to the sound event identification result, and further determining the starting and ending time of the sound event in the audio data to be detected according to the number of audio frames spaced between adjacent target frames. The technical scheme solves the problem of low accuracy of positioning the start-stop time of the sound event, and can greatly improve the accuracy of positioning the start-stop time.

Example two

Fig. 2A is a flowchart of a method for determining a start-stop time of a sound event according to a second embodiment of the present invention, which is further detailed based on the above-mentioned embodiments. As shown in fig. 2A, the method includes:

s210, determining the acoustic time sequence characteristics of the audio data to be detected based on the convolutional neural network and the circular convolutional neural network.

S220, fusing the acoustic time sequence characteristics and the sound event relation characteristics to obtain combined characteristics.

The vehicle-mounted system can perform feature fusion on the acoustic time sequence features and the sound event relation features output by the convolutional neural network. The vehicle-mounted system can directly splice and combine the acoustic time sequence characteristics and the sound event relation characteristics to obtain the combined characteristics. The vehicle-mounted system can also perform matrix calculation on the acoustic time sequence characteristics and the sound event relation characteristics to obtain the joint characteristics. In one specific example, the sound event relationship features may be represented in a matrix of N × M, where N represents the number of sound event types and M represents the feature dimension. For sufficient feature fusion with the sound event relationship features, the acoustic timing features may be an M × N matrix. On the basis of the above scheme, the feature fusion mode may include the following three types:

(1) Adjusting the sound event relation characteristic and the acoustic time sequence characteristic into characteristics with the same dimensionality through matrix transposition, and adding the sound event relation characteristic with the same dimensionality and corresponding elements in the acoustic time sequence characteristic to obtain a combined characteristic;

(2) Adjusting the sound event relation characteristic and the acoustic time sequence characteristic into characteristics with the same dimension through matrix transposition, and multiplying the sound event relation characteristic with the same dimension by corresponding elements in the acoustic time sequence characteristic to obtain a combined characteristic;

(3) And directly carrying out matrix multiplication on the sound event relation characteristic and the acoustic time sequence characteristic to obtain a joint characteristic.

The scheme can perform feature fusion on the prior sound event relation features and the acoustic time sequence features to enrich the features of the sound events and further improve the detection probability of the sound events.

And S230, determining the recognition probability of each type of sound event in each audio frame based on the sound event recognition network according to the joint characteristics.

As will be readily appreciated, the voice event recognition result may include recognition probabilities of various types of voice events; the audio data to be detected may comprise at least two audio frames. The vehicle-mounted system can take the joint characteristics as the input of the voice event recognition network, and determine the voice event recognition result of the audio data to be detected according to the output result of the voice event recognition network. The voice event recognition network may be trained in advance, and is used to realize recognition of voice events. The sound event recognition network can be built based on a neural network or a convolutional neural network. The output of the voice event recognition network may include recognition probabilities for each type of voice event.

S240, determining target frames of the various types of sound events according to the recognition probability of the various types of sound events in the various audio frames.

And if the recognition probability of the target type sound event in the current audio frame is greater than a preset probability threshold, for example 0.5, determining the current audio frame as the target frame of the target type sound event.

And S250, determining the starting and ending time of the sound event in the audio data to be detected according to the number of the audio frames spaced between the adjacent target frames.

In a possible solution, the determining the start-stop time of the sound event in the audio data to be detected according to the number of audio frames spaced between adjacent target frames includes:

if the number of the audio frames spaced between the adjacent target frames of the current group is greater than or equal to a preset spacing threshold, determining the ending time of the current sound event according to the previous target frame in the adjacent target frames, and determining the starting time of the next sound event according to the next target frame in the adjacent target frames;

and if the number of the audio frames spaced between the current group of adjacent target frames is less than the preset number of intervals, determining the termination time of the current sound event according to the number of the audio frames spaced between at least one group of adjacent target frames behind the current group.

The vehicle-mounted system can determine target frames related to target type sound events in the audio data to be detected, and two adjacent target frames and video frames spaced by the two target frames are used as an analysis group. If the number of audio frames spaced between a current set of adjacent target frames is greater than a preset spacing threshold, the ending time of the current sound event may be determined according to a preceding target frame in the adjacent target frames, and the starting time of the next sound event may be determined according to a following target frame in the adjacent target frames. If the number of audio frames spaced between the current set of adjacent target frames is less than the preset number of intervals, it may be determined that the sound event continuously occurs in the current set of audio frames, and the number of audio frames spaced between the next set of adjacent target frames and the preset number of intervals may be continuously compared to determine the termination time of the current sound event.

Fig. 2B is a schematic diagram illustrating positioning of the start and stop time of a sound event according to an embodiment of the present invention, where a, B, and C in fig. 2B respectively show target frame distribution of a target type sound event, where gray circles represent target frames and white circles represent non-target frames. In a specific example, the interval threshold may be set to 5, such as the target frame distribution shown as a in fig. 2B, the second and third target frames are separated by 4 video frames, and if the interval threshold is less than 5, it is determined that the target-type sound event persists between the second and third target frames. As shown in the target frame distribution B in fig. 2B, 6 video frames are spaced between the second and third target frames, and if the spacing threshold is greater than 5, it is determined that the target-type sound event is interrupted between the second and third target frames, the end time of the current sound event is determined according to the second target frame, and the start time of the next sound event is determined according to the third target frame.

It should be noted that, for the above-mentioned determining the start time or the end time of the sound event according to the target frame, the in-vehicle system may directly use the time point associated with the target frame as the start time or the end time of the sound event, or may determine the start time or the end time of the sound event according to the time point associated with the target frame and according to a preset time compensation rule. For example, a preset duration is added or subtracted to or from the time point associated with the target frame as the start time or the end time of the sound event.

The scheme can adapt to the perception time interval of human hearing, and is favorable for realizing the completeness and fluency of sound event presentation.

In a preferred embodiment, after determining the start-stop time of the sound event in the audio data to be detected according to the number of audio frames spaced between adjacent target frames, the method further includes:

and if the number of the audio frames associated with each sound event is greater than the preset continuous number, correcting the start-stop time of each sound event according to the number of target frames in each continuous number of audio frames in the audio frames associated with each sound event.

On the basis of the scheme, the vehicle-mounted system can correct the starting time and the ending time of each sound event. The in-vehicle system may determine the audio frames covered by each sound event and calculate the number of audio frames. If the number of the audio frames of the sound event is larger than the preset continuous number, the duration of the sound event is too long, and the target frame may be judged incorrectly. The in-vehicle system may determine whether the start-stop time of the sound event needs to be corrected according to the number of target frames in each consecutive number of audio frames in the audio frames associated with the sound event.

In a specific example, the continuous threshold may be set to 10, such as the target frame distribution shown in C in fig. 2B, and the sound event continues to occur between the second target frame and the fourth target frame according to the positioning manner of the start-stop time of the sound event. Only two of the 10 consecutive audio frames from the second target frame are target frames, which indicates that the third target frame in C in fig. 2B is misjudged and the sound event needs to be divided. The in-vehicle system may determine an end time of a preceding sound event based on the second target frame and a start time of a following sound event based on the fourth target frame.

Fig. 2C is a schematic diagram of the positioning result of the start-stop time of the sound event according to the embodiment of the present invention, and based on the above scheme, as shown in fig. 2C, each bar-shaped lattice may represent one audio frame, and the start-stop time may be determined according to the audio frame covered by the sound event.

The method and the device can realize accurate positioning of the starting time and the ending time of the sound event, and are favorable for avoiding the influence of misjudgment of the target frame on the integrity and the fluency of the sound event.

The technical scheme includes that a sound event recognition result of audio data to be detected is determined through acoustic time sequence characteristics and priori sound event relation characteristics of the audio data to be detected; and determining target frames in the audio data to be detected according to the sound event identification result, and further determining the starting and ending time of the sound event in the audio data to be detected according to the number of audio frames spaced between adjacent target frames. According to the technical scheme, the problem of low accuracy of positioning the start-stop time of the sound event is solved, and the accuracy of positioning the start-stop time can be greatly improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a device for determining a start-stop time of a sound event according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes:

the acoustic time sequence characteristic determining module 310 is configured to determine an acoustic time sequence characteristic of the audio data to be detected based on the convolutional neural network and the cyclic convolutional neural network;

the recognition result determining module 320 is configured to determine a sound event recognition result of the audio data to be detected according to the acoustic time sequence characteristic and a predetermined sound event relation characteristic; wherein the sound event relation characteristics are determined based on a pre-constructed sound event relation graph; the sound event relation graph is determined according to the statistical result of the sound events in the audio data set;

the target frame determining module 330 is configured to determine a target frame in the audio data to be detected according to the sound event identification result;

the start-stop time determining module 340 is configured to determine start-stop time of a sound event in the audio data to be detected according to the number of audio frames spaced between adjacent target frames.

In this embodiment, optionally, the apparatus further includes a relationship characteristic determining module, where the relationship characteristic determining module includes:

the tag information acquisition unit is used for acquiring tag information of each audio data in the audio data set;

the system comprises a popularity determining unit, a probability determining unit and a popularity determining unit, wherein the popularity determining unit is used for determining the time sequence relation among various types of sound events according to the label information of various audio data, determining the occurrence probability matched with the time sequence relation and determining the popularity of various types of sound events;

the relation graph determining unit is used for constructing a sound event relation graph according to the common degree of each type of sound event, the time sequence relation among each type of sound event and the occurrence probability matched with the time sequence relation;

and the relation characteristic determining unit is used for determining the relation characteristic of the sound event based on the graph convolution neural network according to the sound event relation graph.

On the basis of the scheme, the tag information comprises the type of the sound event existing in the audio data and the starting and ending time of the sound event;

the popularity determination unit includes:

the time sequence relation determining subunit is used for determining the occurrence frequency of each type of sound event according to the sound event type existing in each audio data; determining the time sequence relation between the sound events of various types according to the sound event types existing in the audio data and the start-stop time of the sound events in the audio data;

the occurrence probability determining subunit is used for determining the occurrence probability matched with the time sequence relation according to the time sequence relation among the various types of sound events and the occurrence times of the various types of sound events;

the frequency and time ratio determining subunit is used for determining the occurrence frequency of each type of sound event according to the occurrence frequency of each type of sound event and determining the time ratio of each type of sound event to all audio data according to the starting and ending time of the sound event;

and the popularity determining subunit is used for determining the popularity of each type of sound event according to the occurrence frequency and the time ratio.

On the basis of the scheme, the common degree calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

which is indicative of the frequency of occurrence of,

In this embodiment, optionally, the relationship diagram determining unit includes:

the first relation graph determining subunit is used for constructing a sound event relation graph according to the time sequence relation among the sound events of the types and the occurrence probability matched with the time sequence relation if the common degree of the sound events of the types is greater than a preset common degree threshold value;

and the second relation graph determining subunit is used for correcting the occurrence probability matched with the time sequence relation associated with the target type sound event according to a preset probability correction principle if the popularity of the target type sound event is smaller than or equal to a preset popularity threshold and the target type sound event is in a preset important uncommon event set, and constructing a sound event relation graph according to the time sequence relation among the various types of sound events and the corrected occurrence probability matched with the time sequence relation.

Optionally, the sound event identification result includes identification probabilities of various types of sound events; the audio data to be detected comprises at least two audio frames;

the recognition result determining module 320 includes:

the joint characteristic determining unit is used for fusing the acoustic time sequence characteristic and the sound event relation characteristic to obtain a joint characteristic;

the recognition probability determining unit is used for determining the recognition probability of each type of sound event in each audio frame based on the sound event recognition network according to the joint characteristics;

the target frame determining module 330 is specifically configured to:

and determining the target frame of each type of sound event according to the identification probability of each type of sound event in each audio frame.

On the basis of the above scheme, the start-stop time determining module 340 is specifically configured to:

In a preferred aspect, the apparatus further comprises:

and the time correction module is used for correcting the start-stop time of each sound event according to the number of target frames in each continuous number of audio frames in the audio frames associated with each sound event if the number of the audio frames associated with each sound event is greater than the preset continuous number.

The device for determining the starting and ending time of the sound event, provided by the embodiment of the invention, can execute the method for determining the starting and ending time of the sound event, provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 4 shows a schematic block diagram of an electronic device 410 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 410 includes at least one processor 411, and a memory communicatively connected to the at least one processor 411, such as a Read Only Memory (ROM) 412, a Random Access Memory (RAM) 413, and the like, wherein the memory stores computer programs executable by the at least one processor, and the processor 411 may perform various appropriate actions and processes according to the computer programs stored in the Read Only Memory (ROM) 412 or the computer programs loaded from the storage unit 418 into the Random Access Memory (RAM) 413. In the RAM 413, various programs and data required for the operation of the electronic device 410 can also be stored. The processor 411, the ROM 412, and the RAM 413 are connected to each other through a bus 414. An input/output (I/O) interface 415 is also connected to bus 414.

A number of components in the electronic device 410 are connected to the I/O interface 415, including: an input unit 416 such as a keyboard, a mouse, or the like; an output unit 417 such as various types of displays, speakers, and the like; a storage unit 418, such as a magnetic disk, optical disk, or the like; and a communication unit 419 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 419 allows the electronic device 410 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Processor 411 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 411 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 411 performs the various methods and processes described above, such as the determination of the sound event start and stop times.

In some embodiments, the method of determining the start and stop times of a sound event may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 418. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto electronic device 410 via ROM 412 and/or communications unit 419. When the computer program is loaded into RAM 413 and executed by processor 411, one or more steps of the method for determining a start-stop time of a sound event described above may be performed. Alternatively, in other embodiments, the processor 411 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of determining the sound event start and stop times.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining a start-stop time of a sound event, the method comprising:

determining a sound event identification result of the audio data to be detected according to the acoustic time sequence characteristics and the predetermined sound event relation characteristics; wherein the sound event relation characteristic is determined based on a pre-constructed sound event relation graph; the sound event relation graph is determined according to the statistical result of the sound events in the audio data set;

2. The method of claim 1, wherein the determining of the sound event relationship characteristic comprises:

acquiring label information of each audio data in the audio data set;

determining a time sequence relation between each type of sound event according to the label information of each audio data, determining the occurrence probability matched with the time sequence relation, and determining the degree of commonness of each type of sound event;

3. The method of claim 2, wherein the tag information includes a type of sound event present in the audio data and a start-stop time of the sound event;

4. The method according to claim 2, wherein the constructing a sound event relation graph according to the degree of common occurrence of each type of sound event, the time sequence relation between each type of sound event, and the occurrence probability matched with the time sequence relation comprises:

if the common degree of each type of sound event is greater than a preset common degree threshold, constructing a sound event relation graph according to the time sequence relation among the sound events and the occurrence probability matched with the time sequence relation;

5. The method of claim 1, wherein the voice event recognition result comprises recognition probabilities of each type of voice event; the audio data to be detected comprises at least two audio frames;

the determining the sound event recognition result of the audio data to be detected according to the acoustic time sequence characteristics and the predetermined sound event relation characteristics comprises the following steps:

fusing the acoustic time sequence characteristics and the sound event relation characteristics to obtain combined characteristics;

determining the recognition probability of each type of sound event in each audio frame based on a sound event recognition network according to the joint characteristics;

the determining the target frame in the audio data to be detected according to the sound event recognition result comprises:

and determining the target frame of each type of sound event according to the recognition probability of each type of sound event in each audio frame.

6. The method according to claim 6, wherein determining the start-stop time of the sound event in the audio data to be detected according to the number of audio frames spaced between the adjacent target frames comprises:

7. The method of claim 3, wherein after determining a start-stop time of a sound event in the audio data to be detected based on the number of audio frames spaced between adjacent target frames, the method further comprises:

8. An apparatus for determining a start-stop time of a sound event, comprising:

the target frame determining module is used for determining a target frame in the audio data to be detected according to the sound event identification result;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of determining a sound event start and stop time of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to implement the method for determining a sound event start and stop time of any one of claims 1-7 when executed.