CN114299997A - Audio data processing method and device, electronic equipment, storage medium and product - Google Patents
Audio data processing method and device, electronic equipment, storage medium and product Download PDFInfo
- Publication number
- CN114299997A CN114299997A CN202111539880.6A CN202111539880A CN114299997A CN 114299997 A CN114299997 A CN 114299997A CN 202111539880 A CN202111539880 A CN 202111539880A CN 114299997 A CN114299997 A CN 114299997A
- Authority
- CN
- China
- Prior art keywords
- jump
- identifier
- audio data
- sequence
- decoding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 60
- 238000000034 method Methods 0.000 claims abstract description 53
- 230000008859 change Effects 0.000 claims abstract description 10
- 230000009191 jumping Effects 0.000 claims description 142
- 238000013507 mapping Methods 0.000 claims description 13
- 230000015654 memory Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 5
- 238000001514 detection method Methods 0.000 abstract description 14
- 230000003993 interaction Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 230000002093 peripheral effect Effects 0.000 description 10
- 230000001133 acceleration Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 239000000919 ceramic Substances 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides an audio data processing method, an audio data processing device, electronic equipment, a storage medium and a product, and belongs to the technical field of voice interaction. The method comprises the following steps: receiving input audio data, identifying the audio data and outputting an identification result; under the condition that the recognition result comprises the awakening word, acquiring a decoding graph of the audio data, wherein the decoding graph comprises a jump identification sequence of a decoding path corresponding to the audio data, and the jump identification sequence is used for representing the phoneme change condition between adjacent audio frames in the audio data; determining a target jump identifier from the jump identifier sequence, wherein the target jump identifier meets a target condition, and the target condition indicates that the jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake phoneme sequence of a wake word; and determining the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification. The scheme realizes the end point detection of the phoneme level and can accurately detect the head end point of the awakening audio data.
Description
Technical Field
The present application relates to the field of voice interaction technologies, and in particular, to an audio data processing method, an audio data processing apparatus, an electronic device, a storage medium, and a product.
Background
Endpoint detection is an important step in audio processing, and can determine a head end point (i.e., a starting point of a target audio segment) and a tail end point (i.e., a termination point of the target audio segment) of a target audio segment in audio through endpoint detection, so that the target audio segment can be intercepted from a large segment of audio for further processing based on the intercepted target audio segment. For example, a wake-up audio (i.e., an audio corresponding to a wake-up word) is extracted from a large segment of input audio, and then the wake-up audio can be applied to model training. Therefore, how to accurately detect the endpoint of the wake-up audio becomes a technical problem that needs to be solved at present.
Disclosure of Invention
The embodiment of the application provides an audio data processing method, an audio data processing device, electronic equipment, a storage medium and a product, which can improve the accuracy of determining the audio head end point of a wakeup word. The technical scheme is as follows:
in one aspect, a method for processing audio data is provided, the method comprising:
receiving input audio data, identifying the audio data, and outputting an identification result;
under the condition that the recognition result comprises a wake-up word, acquiring a decoding graph of the audio data, wherein the decoding graph comprises a jump identification sequence of a decoding path corresponding to the audio data, and the jump identification sequence is used for representing the phoneme change condition between adjacent audio frames in the audio data;
determining a target jump identifier from the jump identifier sequence, wherein the target jump identifier meets a target condition, and the target condition indicates that the jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake-up phoneme sequence of a wake-up word;
and determining the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification.
In one possible implementation manner, the decoding map includes a plurality of decoding paths corresponding to the audio data and jump identification sequences of the plurality of decoding paths;
the determining the target jump identifier from the jump identifier sequence includes:
selecting a decoding path satisfying a parameter condition from the plurality of decoding paths based on decoding parameters of the plurality of decoding paths;
and determining the target jump identification from the jump identification sequence of the selected decoding path.
In another possible implementation manner, the selecting, from the plurality of decoding paths, a decoding path that satisfies a parameter condition based on decoding parameters of the plurality of decoding paths includes:
selecting a decoding path with the largest decoding parameter from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or,
selecting a decoding path of which the decoding parameter exceeds a parameter threshold from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or,
selecting a first target number of decoding paths from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths, wherein the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of other decoding paths except the first target number of decoding paths in the plurality of decoding paths.
In another possible implementation manner, the identification result further includes an index of a decoding path, where the index indicates that the processing result is obtained by decoding the decoding path;
the determining the target jump identifier from the jump identifier sequence includes:
determining a corresponding decoding path based on the index;
and determining the target jump identification from the jump identification sequence of the determined decoding path.
In another possible implementation manner, the determining a target skip identifier from the skip identifier sequence includes:
sequentially inquiring the jump identifiers in the jump identifier sequence;
and under the condition that the inquired jump identification does not meet the target condition, continuously inquiring the next jump identification until the jump identification meeting the target condition is inquired from the jump identification sequence.
In another possible implementation manner, the sequentially querying the skip identifiers in the skip identifier sequence includes:
sequentially inquiring the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from front to back; or,
and sequentially inquiring the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from back to front.
In another possible implementation manner, the querying the jump identifier sequence sequentially includes:
sequentially inquiring a plurality of continuous jump marks in the jump mark sequence;
under the condition that the inquired jump identifier does not meet the target condition, continuously inquiring the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the method comprises the following steps:
and under the condition that the inquired continuous jumping identifications do not meet the target condition, continuously inquiring the continuous jumping identifications until the continuous jumping identifications meeting the target condition are inquired from the jumping identification sequence, wherein the target condition indicates that each jumping identification of the continuous jumping identifications in the jumping identification sequence belongs to the jumping identification corresponding to the awakening phoneme sequence, and jumping paths shown by the continuous jumping identifications comprise all or part of jumping paths of the awakening phoneme sequence.
In another possible implementation manner, the determining a head end of the wake-up audio data corresponding to the wake-up word based on the target jump identifier includes:
determining a head end point jumping identifier from the plurality of target jumping identifiers based on a jumping path represented by each target jumping identifier, wherein the jumping path represented by the head end point jumping identifier comprises a first phoneme in the wake-up phoneme sequence;
and determining the audio data corresponding to the head end jumping identification as the head end of the awakening audio data.
In another possible implementation manner, the querying step of querying a next hop identifier until a hop identifier satisfying the target condition is queried from the hop identifier sequence includes:
acquiring a mapping relation between a queried jump identifier and a phoneme;
determining a phoneme corresponding to the skip identifier based on the mapping relation;
and under the condition that the determined phoneme does not belong to the awakening phoneme sequence, continuously inquiring a next jump identifier until a jump identifier of which the corresponding phoneme belongs to the awakening phoneme sequence is inquired from the jump identifier sequence.
In another possible implementation manner, the target jump identifier is a jump identifier that is the same as any jump identifier in a jump identifier set corresponding to the wake-up phoneme sequence and queried from the jump identifier sequence; under the condition that the inquired jump identifier does not meet the target condition, continuously inquiring the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the method comprises the following steps:
acquiring a jump identifier set corresponding to the awakening phoneme sequence, wherein jump identifiers in the jump identifier set are used for representing jump paths of adjacent phonemes in the awakening phoneme sequence;
and under the condition that a jump identifier different from any jump identifier in the jump identifier set is inquired, continuously inquiring the next jump identifier until the jump identifier same as any jump identifier in the jump identifier set is inquired from the jump identifier sequence.
In another possible implementation manner, the method further includes:
and under the condition that the inquired jump identification does not meet the target condition, discarding the audio data corresponding to the jump identification from the audio data.
In another possible implementation manner, the determining a head end of the wake-up audio data corresponding to the wake-up word based on the target jump identifier includes:
and determining the audio data corresponding to the target jump identification as the head end point of the awakening audio data.
In another possible implementation manner, the receiving input audio data, performing recognition processing on the audio data, and outputting a recognition result includes:
receiving the audio data, the audio data comprising a plurality of audio frames;
when the received amount of the audio frames in the audio data reaches a second target amount, carrying out identification processing on the received audio data to obtain an identification result of the received audio data;
and outputting the identification result.
In another aspect, an audio data processing apparatus is provided, the apparatus comprising:
the processing module is used for receiving input audio data, identifying the audio data and outputting an identification result;
an obtaining module, configured to obtain a decoding graph of the audio data when the recognition result includes a wakeup word, where the decoding graph includes a skip identifier sequence of a decoding path corresponding to the audio data, and the skip identifier sequence is used to represent a phoneme change condition between adjacent audio frames in the audio data;
a first determining module, configured to determine a target jump identifier from the jump identifier sequence, where the target jump identifier meets a target condition, and the target condition indicates that a jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake-up phoneme sequence of a wake-up word;
and the second determining module is used for determining the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification.
In one possible implementation manner, the decoding map includes a plurality of decoding paths corresponding to the audio data and jump identification sequences of the plurality of decoding paths; the first determining module includes:
a selecting unit, configured to select, based on the decoding parameters of the multiple decoding paths, a decoding path that satisfies a parameter condition from the multiple decoding paths;
and the determining unit is used for determining the target jump identifier from the jump identifier sequence of the selected decoding path.
In another possible implementation manner, the selecting unit is configured to select, based on the decoding parameters of the multiple decoding paths, a decoding path with a largest decoding parameter from the multiple decoding paths; or,
the selecting unit is configured to select, based on the decoding parameters of the multiple decoding paths, a decoding path whose decoding parameter exceeds a parameter threshold from the multiple decoding paths; or,
the selecting unit is configured to select, based on the decoding parameters of the multiple decoding paths, a first target number of decoding paths from the multiple decoding paths, where the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of other decoding paths except for the first target number of decoding paths in the multiple decoding paths.
In another possible implementation manner, the identification result further includes an index of a decoding path, where the index indicates that the processing result is obtained by decoding the decoding path;
the first determining module is configured to determine a corresponding decoding path based on the index; and determining the target jump identification from the jump identification sequence of the determined decoding path.
In another possible implementation manner, the target jumping identifier is a jumping identifier that is queried from the jumping identifier sequence and satisfies the target condition, and the first determining module is configured to query the jumping identifiers in the jumping identifier sequence in sequence; and under the condition that the inquired jump identification does not meet the target condition, continuously inquiring the next jump identification until the jump identification meeting the target condition is inquired from the jump identification sequence.
In another possible implementation manner, the first determining module is configured to sequentially query the hop identifiers in the hop identifier sequence of the decoding path according to a sequence of the decoding path from front to back; or,
the first determining module is configured to sequentially query the skip identifiers in the skip identifier sequence of the decoding path according to a sequence of the decoding path from back to front.
In another possible implementation manner, the target jump identifier is a plurality of consecutive jump identifiers which are queried from the jump identifier sequence and satisfy the target condition;
the first determining module is used for sequentially inquiring a plurality of continuous jump identifiers in the jump identifier sequence; and under the condition that the inquired continuous jumping identifications do not meet the target condition, continuously inquiring the continuous jumping identifications until the continuous jumping identifications meeting the target condition are inquired from the jumping identification sequence, wherein the target condition indicates that each jumping identification of the continuous jumping identifications in the jumping identification sequence belongs to the jumping identification corresponding to the awakening phoneme sequence, and jumping paths shown by the continuous jumping identifications comprise all or part of jumping paths of the awakening phoneme sequence.
In another possible implementation manner, the second determining module is configured to determine a head-end jumping identifier from the target jumping identifier, where the jumping path represented by the head-end jumping identifier includes a first phoneme in the wake-up phoneme sequence; and determining the audio data corresponding to the head end jumping identification as the head end of the awakening audio data.
In another possible implementation manner, the target jump identifier is a jump identifier that a corresponding phoneme queried from the jump identifier sequence belongs to the wake-up phoneme sequence, and the first determining module is configured to obtain a mapping relationship between the queried jump identifier and the phoneme; determining a phoneme corresponding to the skip identifier based on the mapping relation; and under the condition that the determined phoneme does not belong to the awakening phoneme sequence, continuously inquiring a next jump identifier until a jump identifier of which the corresponding phoneme belongs to the awakening phoneme sequence is inquired from the jump identifier sequence.
In another possible implementation manner, the target jump identifier is a jump identifier that is the same as any jump identifier in a jump identifier set corresponding to the wake-up phoneme sequence and queried from the jump identifier sequence; the first determining module is configured to obtain a skip identifier set corresponding to the wake-up phoneme sequence, where a skip identifier in the skip identifier set is used to indicate a skip path of an adjacent phoneme in the wake-up phoneme sequence; and under the condition that a jump identifier different from any jump identifier in the jump identifier set is inquired, continuously inquiring the next jump identifier until the jump identifier same as any jump identifier in the jump identifier set is inquired from the jump identifier sequence.
In another possible implementation manner, the apparatus further includes:
and the discarding module is used for discarding the audio data corresponding to the jumping identification from the audio data under the condition that the searched jumping identification does not meet the target condition.
In another possible implementation manner, the second determining module is configured to determine the audio data corresponding to the target jump identifier as a head end point of the wake-up audio data.
In another possible implementation manner, the processing module includes:
a receiving unit configured to receive the audio data, the audio data including a plurality of audio frames;
the processing unit is used for identifying the received audio data to obtain an identification result of the received audio data when the received amount of the audio frames in the audio data reaches a second target amount;
and the output unit is used for outputting the identification result.
In another aspect, an electronic device is provided, which includes one or more processors and one or more memories, and at least one program code is stored in the one or more memories, and the at least one program code is loaded and executed by the one or more processors to implement the audio data processing method according to any one of the above implementations.
In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the audio data processing method according to any of the above-mentioned implementations.
In another aspect, a computer program product is provided, which comprises at least one program code, which is loaded and executed by a processor, to implement the audio data processing method according to any of the implementations described above.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
the embodiment of the application provides an audio data processing method, which can search the head end point of the awakening audio based on a decoding path, and because the unique phoneme can be determined by the skip identifier in the decoding path, the scheme realizes detection of the head end point at the phoneme level, can more accurately detect the head end point of the awakening audio, and improves the accuracy of the head end point of the awakening audio.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
fig. 2 is a flowchart of a method for processing audio data according to an embodiment of the present application;
FIG. 3 is a flow chart of a method for processing audio data according to an embodiment of the present application;
fig. 4 is a block diagram of an audio data processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an audio data processing apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes an electronic device 101 and a server 102. The electronic device 101 is installed with a client that the server 102 provides services, and a user can realize functions such as data transmission, voice interaction and the like between the client and the server 102 on the electronic device 101. The client has at least an audio recognition function, for example, to recognize whether the input audio data wakes up the electronic device 101, and the client may also have a voice control function. Wherein the client may be a voice assistant or a voice control application, etc.
In one possible implementation, the electronic device 101 receives input audio data, identifies the audio data, finds a head point of a wake word after the wake word is identified, and reports the head point to the server 102. In another possible implementation, the electronic device 101 receives input audio data, sends the audio data to the server 102, and the server 102 recognizes the audio data, and after recognizing the wake word, finds the head of the wake word.
The electronic device 101 may be a computer, a mobile phone, a stereo, an air conditioner, a television, or other electronic devices. The server 102 may be a server, a server cluster composed of several servers, or a cloud computing service center.
Fig. 2 is a flowchart of an audio data processing method provided in an embodiment of the present application, and referring to fig. 2, the method includes:
201. and receiving input audio data, identifying the audio data, and outputting an identification result.
202. And under the condition that the recognition result comprises a wake-up word, acquiring a decoding graph of the audio data, wherein the decoding graph comprises a jump identification sequence of a decoding path corresponding to the audio data, and the jump identification sequence is used for representing the phoneme change condition between adjacent audio frames in the audio data.
203. And determining a target jump identifier from the jump identifier sequence, wherein the target jump identifier meets a target condition, and the target condition indicates that the jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake-up phoneme sequence of a wake-up word.
204. And determining the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification.
The audio data processing method provided by the embodiment of the application can be used for searching the head end point of the awakening audio based on the decoding path, and the unique phoneme can be determined by the skip identifier in the decoding path, so that the scheme realizes detection of the head end point at the phoneme level, can detect the head end point of the awakening audio more accurately, and improves the accuracy of the head end point of the awakening audio.
In one possible implementation manner, the decoding map includes a plurality of decoding paths corresponding to the audio data and jump identification sequences of the plurality of decoding paths;
the determining the target jump identifier from the jump identifier sequence includes:
selecting a decoding path satisfying a parameter condition from the plurality of decoding paths based on decoding parameters of the plurality of decoding paths;
and determining the target jump identification from the jump identification sequence of the selected decoding path.
In another possible implementation manner, the selecting, from the plurality of decoding paths, a decoding path that satisfies a parameter condition based on decoding parameters of the plurality of decoding paths includes:
selecting a decoding path with the largest decoding parameter from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or,
selecting a decoding path of which the decoding parameter exceeds a parameter threshold from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or,
selecting a first target number of decoding paths from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths, wherein the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of other decoding paths except the first target number of decoding paths in the plurality of decoding paths.
In another possible implementation manner, the identification result further includes an index of a decoding path, where the index indicates that the processing result is obtained by decoding the decoding path;
the determining the target jump identifier from the jump identifier sequence includes:
determining a corresponding decoding path based on the index;
and determining the target jump identification from the jump identification sequence of the determined decoding path.
In another possible implementation manner, the determining a target skip identifier from the skip identifier sequence includes:
sequentially inquiring the jump identifiers in the jump identifier sequence;
and under the condition that the inquired jump identification does not meet the target condition, continuously inquiring the next jump identification until the jump identification meeting the target condition is inquired from the jump identification sequence.
In another possible implementation manner, the sequentially querying the skip identifiers in the skip identifier sequence includes:
sequentially inquiring the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from front to back; or,
and sequentially inquiring the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from back to front.
In another possible implementation manner, the querying the jump identifier sequence sequentially includes:
sequentially inquiring a plurality of continuous jump marks in the jump mark sequence;
under the condition that the inquired jump identifier does not meet the target condition, continuously inquiring the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the method comprises the following steps:
and under the condition that the inquired continuous jumping identifications do not meet the target condition, continuously inquiring the continuous jumping identifications until the continuous jumping identifications meeting the target condition are inquired from the jumping identification sequence, wherein the target condition indicates that each jumping identification of the continuous jumping identifications in the jumping identification sequence belongs to the jumping identification corresponding to the awakening phoneme sequence, and jumping paths shown by the continuous jumping identifications comprise all or part of jumping paths of the awakening phoneme sequence.
In another possible implementation manner, the determining a head end of the wake-up audio data corresponding to the wake-up word based on the target jump identifier includes:
determining a head end point jumping identifier from the target jumping identifier based on a jumping path represented by the target jumping identifier, wherein the jumping path represented by the head end point jumping identifier comprises a first phoneme in the wake-up phoneme sequence;
and determining the audio data corresponding to the head end jumping identification as the head end of the awakening audio data.
In another possible implementation manner, the querying step of querying a next hop identifier until a hop identifier satisfying the target condition is queried from the hop identifier sequence includes:
acquiring a mapping relation between a queried jump identifier and a phoneme;
determining a phoneme corresponding to the skip identifier based on the mapping relation;
and under the condition that the determined phoneme does not belong to the awakening phoneme sequence, continuously inquiring a next jump identifier until a jump identifier of which the corresponding phoneme belongs to the awakening phoneme sequence is inquired from the jump identifier sequence.
In another possible implementation manner, the target jump identifier is a jump identifier that is the same as any jump identifier in a jump identifier set corresponding to the wake-up phoneme sequence and queried from the jump identifier sequence; under the condition that the inquired jump identifier does not meet the target condition, continuously inquiring the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the method comprises the following steps:
acquiring a jump identifier set corresponding to the awakening phoneme sequence, wherein jump identifiers in the jump identifier set are used for representing jump paths of adjacent phonemes in the awakening phoneme sequence;
and under the condition that a jump identifier different from any jump identifier in the jump identifier set is inquired, continuously inquiring the next jump identifier until the jump identifier same as any jump identifier in the jump identifier set is inquired from the jump identifier sequence.
In another possible implementation manner, the method further includes:
and under the condition that the inquired jump identification does not meet the target condition, discarding the audio data corresponding to the jump identification from the audio data.
In another possible implementation manner, the determining a head end of the wake-up audio data corresponding to the wake-up word based on the target jump identifier includes:
and determining the audio data corresponding to the target jump identification as the head end point of the awakening audio data.
In another possible implementation manner, the receiving input audio data, performing recognition processing on the audio data, and outputting a recognition result includes:
receiving the audio data, the audio data comprising a plurality of audio frames;
when the received amount of the audio frames in the audio data reaches a second target amount, carrying out identification processing on the received audio data to obtain an identification result of the received audio data;
and outputting the identification result.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
Fig. 3 is a flowchart of an audio data processing method provided in an embodiment of the present application, which is exemplarily illustrated by taking an execution subject as an electronic device, and with reference to fig. 3, the method includes:
301. the electronic equipment receives input audio data, identifies the audio data and outputs an identification result.
In this embodiment of the application, the audio data may be audio data acquired by the electronic device, audio data acquired from other devices such as a server, or audio data stored locally, and the audio data is not limited in this embodiment of the application. Optionally, the electronic device has an audio capture function, and the audio data is audio data captured by the electronic device in real time.
The recognition result includes text data representing the content expressed by the human voice in the audio data. Therefore, the electronic equipment processes the audio data, and can identify the sentence spoken by the human voice in the audio data. For example, a user says "today is good weather" to the electronic device, the electronic device acquires audio data through the audio acquisition function, performs recognition processing on the audio data, and outputs a recognition result "today is good weather".
In the embodiment of the present application, the process of identifying and processing the audio data by the electronic device is not limited, and the process of identifying and processing the audio data by the electronic device is only exemplarily described in the following manner.
In one possible implementation, the identification process of the audio data may be performed by a decoder. Optionally, the electronic device includes a decoder, and the electronic device receives input audio data, performs recognition processing on the audio data, and outputs a recognition result, including: the decoder receives input audio data, performs recognition processing on the audio data, and outputs a recognition result.
The decoder may be a decoder using chain model. Optionally, the decoder performs recognition processing on the audio data through the generated decoding graph to obtain a recognition result. The generated decoding map may be WFST (Weighted Finite State transmitters) decoding map. The generated decoding graph can be obtained by training a chain model.
For example, the electronic device divides audio data into a plurality of audio frames, performs feature extraction on each audio frame to obtain an audio frame feature of each audio frame, and processes the audio frame feature of each audio frame to obtain a probability distribution corresponding to each audio frame, where the probability distribution represents a pronunciation action corresponding to the audio frame as a probability of each phoneme. The change of the phoneme between the adjacent audio frames can be represented by a skip identifier (transition id), and the decoder can score the skip path of the phoneme according to the probability of the phoneme, the probabilities of other phonemes before the phoneme and the probabilities of other phonemes after the phoneme, and finally decode the path with the highest score to obtain a recognition result.
In addition, in the embodiment of the application, when the audio data is identified, the audio data may be identified after the audio data is completely acquired, or the acquired audio data may be identified in real time in the process of acquiring the audio data. Optionally, the electronic device receives input audio data, performs recognition processing on the audio data, and outputs a recognition result, including: receiving audio data, the audio data comprising a plurality of audio frames; when the received amount of the audio frames in the audio data reaches a second target amount, carrying out identification processing on the received audio data to obtain an identification result of the received audio data; and outputting the recognition result. Therefore, the embodiment of the application can process the acquired audio data without waiting for the audio data to be completely received.
The second target number may be any number, for example, the second target number is 1, 3, 5, 10, etc. The second target number may be an empirical value, or may be any value set by a technician, or a default value of the system, and the second target number is not limited in the embodiment of the present application.
In addition, in the embodiment of the application, the electronic device may perform the identification processing on the audio data through one decoder, or may perform the identification processing on the audio data through a plurality of decoders. In a possible implementation manner, the electronic device performs recognition processing on the audio data through a decoder, and outputs a recognition result. In another possible implementation manner, the audio data is divided into a plurality of audio data segments, and the electronic device performs identification processing on each audio data segment through the plurality of decoders to obtain an identification result output by each decoder. In order to avoid that the wake-up audio is divided into two audio data segments, when the audio data is divided, the segmentation point of the audio data can be divided into other audio data segments.
For example, the audio data of 0 second to 10 seconds is divided into one audio data segment, the audio data of 10 seconds to 20 seconds is divided into one audio data segment, and in order to avoid the wake-up audio in the 10 th second, the audio data of 5 seconds to 15 seconds can be divided into one audio data segment.
302. And the electronic equipment acquires a decoding graph of the audio data under the condition that the recognition result comprises the awakening word, wherein the decoding graph comprises a jump identification sequence of a decoding path corresponding to the audio data, and the jump identification sequence is used for representing the phoneme change condition between adjacent audio frames in the audio data.
If the recognition result comprises the awakening word, the user is explained to speak the awakening word. Because the audio data may include a mute audio, an audio corresponding to another sentence, and the like in addition to the wake audio corresponding to the wake word, in this embodiment of the application, the electronic device may further detect a head end of the wake audio from the audio data, so as to extract the wake audio from the audio data subsequently. The extracted wake-up audio can be used for storage, transmission, model training and the like, and the application of the wake-up audio is not limited in the embodiment of the application.
A wake word is a word used to wake up a certain client, a certain function, or a certain device. For example, the wake word is "hello," and the electronic device wakes up the voice assistant after recognizing "hello. It should be noted that the wakeup word may be set by a user or may be default by the system, and the embodiment of the present application does not limit the wakeup word.
The decoded picture of the audio data in this step 302 is obtained by inputting the relevant information of the audio data into the decoded picture that has been generated by the electronic device. The decoding graph includes nodes and edges connecting the nodes. Wherein, the nodes are used for representing the states of the phonemes, and one node corresponds to one phoneme state. The edges connecting the nodes represent the transitions between phoneme states.
After the electronic device divides the audio data into a plurality of audio frames, an audio frame feature corresponding to each audio frame may be extracted, where the audio frame feature may be an MFCC (Mel Frequency Cepstrum Coefficient) feature, an FBank (Filter Bank) feature, or an energy feature. The audio frame characteristics of each audio frame are processed, so that a probability distribution corresponding to each audio frame can be obtained, and the probability distribution represents the probability that the pronunciation corresponding to the audio frame acts as each phoneme. When decoding, the decoder will score the jumping paths between phonemes based on the probability of each phoneme in the current video frame, the probability of each phoneme in the aforementioned audio frame and the probability of each phoneme in the aforementioned audio frame, and the jumping path with the highest score can be regarded as the most likely jumping path. Therefore, the decoder can decode based on the decoding path with the highest score and output the identification result.
The decoding path is any jump path from the start node to the end node of the decoding graph. The jump identification sequence of the decoding path is jump identification corresponding to each edge in the decoding path, is used for representing jump paths of a plurality of nodes in the decoding path, and is determined based on phoneme change conditions between adjacent audio frames in the audio data.
In the process of decoding the decoding graph to obtain the recognition result, the phoneme state corresponding to each audio frame can be input into the decoding graph, and then the input data is decoded based on the decoding graph to sequentially obtain the skip identification sequence, the phoneme sequence, the word sequence and the sentence.
303. And the electronic equipment sequentially inquires the jump identifications in the jump identification sequence.
Since the skip identifiers can be used to determine unique phonemes, embodiments of the present application may employ skip identifiers to implement phoneme-level endpoint detection.
In a possible implementation manner, the electronic device sequentially queries the skip identifiers in the skip identifier sequence, including: and the electronic equipment sequentially inquires the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from front to back. In another possible implementation manner, the sequentially querying, by the electronic device, the hop identifiers in the hop identifier sequence includes: and the electronic equipment sequentially inquires the jumping identifications in the jumping identification sequence of the decoding path according to the sequence of the decoding path from back to front.
In the embodiment of the application, the audio head and end point detection is wakened up under the condition that the output recognition result includes a wakening word, and the electronic device in the embodiment of the application can also process the input audio data in real time in the input process of the audio data. Therefore, the electronic device can stop processing the audio data when the output recognition result includes the wake-up word, and sequentially query the jump marks in the jump mark sequence according to the sequence from back to front of the decoding path, so that the head end of the wake-up audio can be detected more quickly.
It should be noted that the decoding graph includes a plurality of decoding paths, and the electronic device may perform head end point detection based on only one decoding path, or may perform head end point detection based on a plurality of decoding paths. The embodiment of the present application does not limit this.
In one possible implementation manner, the decoding map includes a plurality of decoding paths corresponding to the audio data and a jump identification sequence of the plurality of decoding paths. The electronic equipment selects a decoding path meeting the condition from the plurality of decoding paths according to a certain condition. The electronic equipment sequentially inquires the jump identifiers in the jump identifier sequence, and the method comprises the following steps: the electronic equipment selects a decoding path meeting the parameter condition from the multiple decoding paths based on the decoding parameters of the multiple decoding paths; and the electronic equipment sequentially inquires the jump identification in the jump identification sequence of the decoding path in the selected decoding path.
Optionally, the electronic device selects, based on the decoding parameters of the multiple decoding paths, a decoding path that satisfies a parameter condition from the multiple decoding paths, including: selecting a decoding path with the maximum decoding parameter from the multiple decoding paths based on the decoding parameters of the multiple decoding paths; or, based on the decoding parameters of the plurality of decoding paths, selecting the decoding path of which the decoding parameter exceeds the parameter threshold from the plurality of decoding paths; or, based on the decoding parameters of the multiple decoding paths, selecting a first target number of decoding paths from the multiple decoding paths, where the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of other decoding paths except the first target number of decoding paths in the multiple decoding paths.
Wherein the first target number may be any number, e.g., 2, 3, 5, etc. The first target number is not limited in the embodiment of the present application. The first target amount may be an empirical value, any value set by a technician, or a default value of the electronic device, etc. The parameter threshold may be any value, e.g., 10, 20, 30, etc. The embodiment of the present application does not limit the parameter threshold. The parameter threshold may be an empirical value, any value set by a technician, or a default value of the electronic device.
The electronic equipment can select a decoding path with higher division from the multiple decoding paths based on the decoding parameters and parameter conditions of the multiple decoding paths, so that the reliability of the decoding path is ensured, and the reliability of the detected head end is further ensured.
In a possible implementation manner, the identification result output by the electronic device further includes an index of the decoding path, and the index indicates that the processing result is obtained by decoding the decoding path. The electronic equipment decodes based on the decoding path to obtain a recognition result, and under the condition that the recognition result comprises the awakening word, the electronic equipment detects the head and the end based on the decoding path. Optionally, the sequentially querying, by the electronic device, the skip identifiers in the skip identifier sequence includes: the electronic equipment determines a corresponding decoding path based on the index; and the electronic equipment sequentially inquires the jump identifications in the jump identification sequence of the decoding path in the decoding path.
304. And under the condition that the inquired jump identifier does not meet the target condition, the electronic equipment continuously inquires the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the target condition indicates that the jump identifier in the jump identifier sequence belongs to the jump identifier corresponding to the awakening phoneme sequence of the awakening word.
In the embodiment of the application, after querying the jump identifier, if the jump identifier does not satisfy the target condition, the electronic device indicates that the audio data corresponding to the jump path of the jump identifier is not necessarily the wake-up audio data, and therefore, the electronic device may continue to query the next jump identifier until the jump identifier satisfying the target condition is queried, and then determine the head end point of the wake-up audio data based on the jump identifier satisfying the target condition. The jump identifier meeting the target condition can be regarded as a target jump identifier in the jump identifier sequence, and the jump identifier meeting the target condition is inquired from the jump identifier sequence, namely the target jump identifier is determined from the jump identifier sequence.
In the embodiment of the application, the target condition indicates that the jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake phoneme sequence of a wake-up word. The jumping identification corresponding to the awakening phoneme sequence of the awakening word can awaken the identification of the jumping path of any two or more adjacent phonemes in the phoneme sequence; or the identifier of the jump path of the first phoneme or the first N phonemes (N is an integer greater than 1) of any phoneme and wake-up phoneme sequence; it may also be the identification of the jump path of the last or last N (N is an integer greater than 1) phonemes of the wake-up phoneme sequence with any phoneme.
In addition, in this embodiment of the present application, the skip identifier in the skip identifier sequence of the decoding path may be a skip identifier corresponding to a single phoneme, or may be a skip identifier corresponding to a triphone, which is not limited in this embodiment of the present application. Where triphone refers to a combination of three phonemes.
Because the jump identifier can determine a unique phoneme, in the embodiment of the present application, a phoneme corresponding to the jump identifier can be determined through the jump identifier in the jump identifier sequence, and then it is determined whether the phoneme is a phoneme in the wake-up phoneme sequence to detect the head end of the wake-up audio data. In a possible implementation manner, the target jump identifier is a jump identifier in which a corresponding phoneme queried from the jump identifier sequence belongs to the wake-up phoneme sequence, and the electronic device continues querying a next jump identifier until a jump identifier meeting a target condition is queried from the jump identifier sequence when the queried jump identifier does not meet the target condition, including: acquiring a mapping relation between a queried jump identifier and a phoneme; determining a phoneme corresponding to the jumping identification based on the mapping relation; and under the condition that the determined phoneme does not belong to the awakening phoneme sequence, continuously inquiring a next jump identifier until a jump identifier of which the corresponding phoneme belongs to the awakening phoneme sequence is inquired from the jump identifier sequence.
It should be noted that when the user wakes up by the wake word, other sentences are rarely spoken except the wake word. Therefore, in a possible implementation manner, after querying the jump identifier of the wake phoneme sequence to which the corresponding phoneme belongs, the electronic device may consider that the wake audio data is found. In another possible implementation manner, in order to more accurately detect the head end point of the wake-up audio data, after querying the jump identifier of the wake-up phoneme sequence to which the corresponding phoneme belongs, the electronic device continues to query until a plurality of continuous jump identifiers of which the corresponding phonemes are the wake-up phoneme sequence are found, and then determines the head end point of the wake-up audio data based on the plurality of continuous jump identifiers.
In addition, the electronic device may further obtain a jump identifier set corresponding to the wake-up phoneme sequence, compare the queried jump identifier with each jump identifier in the jump identifier set, and if the queried jump identifier is the same as any jump identifier in the jump identifier set, it indicates that the jump path represented by the queried jump identifier includes the wake-up phoneme, and the jump path represented by the queried jump identifier includes the wake-up phoneme, so that the electronic device finds the audio data related to the wake-up audio data. In another possible implementation manner, the target jump identifier is a jump identifier which is the same as any jump identifier in a jump identifier set corresponding to the wake-up phoneme sequence and is queried from a jump identifier sequence; under the condition that the inquired jump identifier does not meet the target condition, the electronic equipment continues to inquire the next jump identifier until the jump identifier meeting the target condition is inquired, and the method comprises the following steps: acquiring a jump identifier set corresponding to the awakening phoneme sequence, wherein jump identifiers in the jump identifier set are used for representing jump paths of adjacent phonemes in the awakening phoneme sequence; and under the condition that a jump identifier different from any jump identifier in the jump identifier set is inquired, continuously inquiring the next jump identifier until the jump identifier same as any jump identifier in the jump identifier set is inquired from the jump identifier sequence.
The skip identifier set may be obtained by querying, by the electronic device, a mapping relationship between a skip identifier and a phoneme, or may be input by a technician, which is not limited in the embodiment of the present application.
It should be noted that when the user wakes up by the wake word, other sentences are rarely spoken except the wake word. Therefore, in a possible implementation manner, after querying the jump identification which is the same as any jump identification in the jump identification set, the electronic device can consider that the wake-up audio data is found. In another possible implementation manner, in order to detect the head end point of the wake-up audio data more accurately, after querying the jump identifier identical to any jump identifier in the jump identifier set, the electronic device will continue to query until a plurality of consecutive jump identifiers identical to the jump identifiers in the jump identifier set are found.
305. And the electronic equipment determines the head end point of the awakening audio data corresponding to the awakening word based on a target jump identifier, wherein the target jump identifier is a jump identifier meeting a target condition.
In a possible implementation manner, the determining, by the electronic device, a head end of wake-up audio data corresponding to a wake-up word based on the target jump identifier includes: and determining the audio data corresponding to the target jump identification as the head end point of the awakening audio data.
Optionally, the audio data corresponding to the target skip identifier is an audio frame, and determining the audio data corresponding to the target skip identifier as the head end of the wake-up audio data is to determine the audio frame corresponding to the target skip identifier as the head frame of the wake-up audio data. Optionally, the head end of the wake-up audio data is represented in terms of a point in time. The electronic equipment determines the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification, and the method comprises the following steps: and the electronic equipment determines the starting time point of the audio data corresponding to the target jump identification as the head end point of the awakening audio data.
It should be noted that, the above steps 303 to 305 only take the target jump identifier as an example, and the detection process of the head end point is exemplarily described, and in another embodiment, the target jump identifier is multiple. In a possible implementation manner, the electronic device sequentially queries the skip identifiers in the skip identifier sequence, including: and sequentially inquiring a plurality of continuous jump marks in the jump mark sequence.
When the electronic device sequentially inquires a plurality of continuous jump marks in the jump mark sequence, the electronic device can move backwards by one jump mark every time, and can also move backwards by a plurality of jump marks every time. For example, the electronic device queries the 1 st to 5 th skip identifiers in the skip identifier sequence, and under the condition that the 1 st to 5 th skip identifiers do not satisfy the target condition, the electronic device may query the 2 nd to 6 th skip identifiers, and may also query the 6 th to 10 th skip identifiers.
In a possible implementation manner, the target jump identifier is a plurality of continuous jump identifiers which are inquired in the jump identifier sequence and meet the target condition; under the condition that the inquired jump identifier does not meet the target condition, the electronic equipment continues to inquire the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, and the method comprises the following steps: under the condition that the inquired continuous jumping identifications do not meet the target condition, continuously inquiring the continuous jumping identifications until the continuous jumping identifications meeting the target condition are inquired from the jumping identification sequence, wherein the target condition indicates that each jumping identification of the continuous jumping identifications in the jumping identification sequence belongs to the jumping identification corresponding to the awakening phoneme sequence, and the continuous jumping identifications indicate the jumping path of the awakening phoneme sequence.
Wherein, the jump path where the plurality of consecutive jump tags in the jump tag sequence represent the wake-up phoneme sequence may be: the jumping paths represented by a plurality of continuous jumps in the jumping identification sequence comprise complete jumping paths of the wake-up phoneme sequence; the method can also be as follows: the jump path represented by the plurality of consecutive jumps in the jump identification sequence comprises a partial jump path of the wake-up phoneme sequence.
For example, if the wake-up phoneme sequence is "nihao" and the jump path represented by the target jump identifications is "h → ai → n → i → h", the target jump identifications are considered to satisfy the target condition.
In one possible implementation manner, the electronic device determines a head end of wake-up audio data corresponding to a wake-up word based on the target jump identifier, including: the electronic equipment determines a head end point jumping identifier from the target jumping identifier based on a jumping path represented by the target jumping identifier, wherein the jumping path represented by the head end point jumping identifier comprises a first phoneme in the wake-up phoneme sequence; and determining the audio data corresponding to the head end jumping identification as the head end of the awakening audio data.
Optionally, the audio data corresponding to the head end point jump identifier is an audio frame, and determining the audio data corresponding to the head end point jump identifier as the head end point of the wake-up audio data is to determine the audio frame corresponding to the head end point jump identifier as the head frame of the wake-up audio data. Optionally, the head end of the wake-up audio data is represented in terms of a point in time. The electronic equipment determines the audio data corresponding to the head end jumping identification as the head end for waking up the audio data, and the method comprises the following steps: and the electronic equipment determines the starting time point of the audio data corresponding to the head end jumping identification as the head end point of the awakening audio data.
It should be noted that, in an embodiment of the present application, in a possible implementation manner, the audio data processing method further includes: and under the condition that the inquired jump identification does not meet the target condition, discarding the audio data corresponding to the jump identification from the audio data. In this way, other audio data before the wake-up audio data in the audio data can be discarded for subsequent acquisition of the wake-up audio data.
Another point to be explained is that, in the embodiment of the present application, only "the electronic device sequentially queries the jump identifier in the jump identifier sequence, and continues querying the next jump identifier until the target jump identifier satisfying the target condition is queried" when the queried jump identifier does not satisfy the target condition, and "determining the target jump identifier from the jump identifier sequence" is exemplarily explained.
In another embodiment, the electronic device performs "determining a target jump identification from the jump identification sequence after acquiring the decoding map of the audio data; and determining the head end point of the awakening audio data corresponding to the awakening word on the basis of the target jump identification.
It should be noted that the decoding graph includes a plurality of decoding paths, and the electronic device may perform head end point detection based on only one decoding path, or may perform head end point detection based on a plurality of decoding paths. The embodiment of the present application does not limit this.
In one possible implementation manner, the decoding map includes a plurality of decoding paths corresponding to the audio data and jump identification sequences of the plurality of decoding paths; determining a target jump identifier from the jump identifier sequence, including: selecting a decoding path satisfying a parameter condition from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; and determining the target jump identification from the jump identification sequence of the selected decoding path.
The electronic equipment can select a decoding path with higher division from the multiple decoding paths based on the decoding parameters and parameter conditions of the multiple decoding paths, so that the reliability of the decoding path is ensured, and the reliability of the detected head end is further ensured.
Optionally, selecting, based on the decoding parameters of the multiple decoding paths, a decoding path that satisfies a parameter condition from the multiple decoding paths includes: selecting a decoding path with the largest decoding parameter from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or, based on the decoding parameters of the plurality of decoding paths, selecting the decoding path of which the decoding parameter exceeds the parameter threshold from the plurality of decoding paths; or, based on the decoding parameters of the multiple decoding paths, selecting a first target number of decoding paths from the multiple decoding paths, where the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of other decoding paths except the first target number of decoding paths in the multiple decoding paths.
In another possible implementation manner, the identification result output by the electronic device further includes an index of the decoding path, where the index indicates that the processing result is obtained by decoding the decoding path. The electronic equipment decodes based on the decoding path to obtain a recognition result, and under the condition that the recognition result comprises the awakening word, the electronic equipment detects the head and the end based on the decoding path. Optionally, the identification result further includes an index of a decoding path, where the index indicates that the processing result is obtained by decoding the decoding path; the electronic equipment determines a target jump identifier from the jump identifier sequence, and the method comprises the following steps: determining a corresponding decoding path based on the index; and determining the target jump identification from the jump identification sequence of the determined decoding path.
The embodiment of the application provides an audio data processing method, which can search the head end point of the awakening audio based on a decoding path, and because the unique phoneme can be determined by the skip identifier in the decoding path, the scheme realizes detection of the head end point at the phoneme level, can more accurately detect the head end point of the awakening audio, and improves the accuracy of the head end point of the awakening audio.
In addition, in the embodiment of the application, the single decoder is adopted to process the audio data, so that not only can more operation resources be saved, but also the awakening words can be prevented from being segmented into different audio segments, and further the head end of the awakening audio can be detected more accurately.
Fig. 4 is a block diagram of an audio data processing apparatus provided in an embodiment of the present application, and referring to fig. 4, the apparatus includes:
the processing module 401 is configured to receive input audio data, perform recognition processing on the audio data, and output a recognition result;
an obtaining module 402, configured to obtain a decoding graph of the audio data when the recognition result includes a wakeup word, where the decoding graph includes a skip identifier sequence of a decoding path corresponding to the audio data, and the skip identifier sequence is used to represent a phoneme change condition between adjacent audio frames in the audio data;
a first determining module 403, configured to determine a target skip identifier from the skip identifier sequence, where the target skip identifier meets a target condition, and the target condition indicates that a skip identifier in the skip identifier sequence belongs to a skip identifier corresponding to a wake-up phoneme sequence of a wake-up word;
a second determining module 404, configured to determine, based on the target jump identifier, a head end of the wake-up audio data corresponding to the wake-up word.
The embodiment of the application provides an audio data processing device, which can search the head end of the awakening audio based on a decoding path, and because the unique phoneme can be determined by the skip identifier in the decoding path, the scheme realizes the detection of the head end at the phoneme level, can more accurately detect the head end of the awakening audio, and improves the accuracy of the head end of the awakening audio.
As shown in fig. 5, in a possible implementation manner, the decoding map includes a plurality of decoding paths corresponding to the audio data and jump identification sequences of the plurality of decoding paths; the first determining module 403 includes:
a selecting unit 4031, configured to select, based on the decoding parameters of the multiple decoding paths, a decoding path that satisfies a parameter condition from the multiple decoding paths;
a determining unit 4032, configured to determine the target hop identifier from the hop identifier sequence of the selected decoding path.
In another possible implementation manner, the selecting unit 4031 is configured to select, based on the decoding parameters of the multiple decoding paths, a decoding path with a largest decoding parameter from the multiple decoding paths; or,
the selecting unit 4031 is configured to select, based on the decoding parameters of the multiple decoding paths, a decoding path whose decoding parameter exceeds a parameter threshold from the multiple decoding paths; or,
the selecting unit 4031 is configured to select, based on the decoding parameters of the multiple decoding paths, a first target number of decoding paths from the multiple decoding paths, where the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of decoding paths other than the first target number of decoding paths in the multiple decoding paths.
In another possible implementation manner, the identification result further includes an index of a decoding path, where the index indicates that the processing result is obtained by decoding the decoding path;
the first determining module 403 is configured to determine a corresponding decoding path based on the index; and determining the target jump identification from the jump identification sequence of the determined decoding path.
In another possible implementation manner, the target jumping identifier is a jumping identifier that is queried from the jumping identifier sequence and satisfies the target condition, and the first determining module 403 is configured to query the jumping identifiers in the jumping identifier sequence in sequence; and under the condition that the inquired jump identification does not meet the target condition, continuously inquiring the next jump identification until the jump identification meeting the target condition is inquired from the jump identification sequence.
In another possible implementation manner, the first determining module 403 is configured to sequentially query the hop identifiers in the hop identifier sequence of the decoding path according to a sequence of the decoding path from front to back; or,
the first determining module 403 is configured to sequentially query the skip identifiers in the skip identifier sequence of the decoding path according to a sequence of the decoding path from back to front.
In another possible implementation manner, the target jump identifier is a plurality of consecutive jump identifiers which are queried from the jump identifier sequence and satisfy the target condition;
the first determining module 403 is configured to sequentially query a plurality of consecutive skip identifiers in the skip identifier sequence; and under the condition that the inquired continuous jumping identifications do not meet the target condition, continuously inquiring the continuous jumping identifications until the continuous jumping identifications meeting the target condition are inquired from the jumping identification sequence, wherein the target condition indicates that each jumping identification of the continuous jumping identifications in the jumping identification sequence belongs to the jumping identification corresponding to the awakening phoneme sequence, and jumping paths shown by the continuous jumping identifications comprise all or part of jumping paths of the awakening phoneme sequence.
In another possible implementation manner, the second determining module 404 is configured to determine a head-end jumping identifier from the target jumping identifier based on a jumping path represented by the target jumping identifier, where the jumping path represented by the head-end jumping identifier includes a first phoneme in the wake-up phoneme sequence; and determining the audio data corresponding to the head end jumping identification as the head end of the awakening audio data.
In another possible implementation manner, the target jump identifier is a jump identifier that a corresponding phoneme queried from the jump identifier sequence belongs to the wake-up phoneme sequence, and the first determining module 403 is configured to obtain a mapping relationship between the queried jump identifier and the phoneme; determining a phoneme corresponding to the skip identifier based on the mapping relation; and under the condition that the determined phoneme does not belong to the awakening phoneme sequence, continuously inquiring a next jump identifier until a jump identifier of which the corresponding phoneme belongs to the awakening phoneme sequence is inquired from the jump identifier sequence.
In another possible implementation manner, the target jump identifier is a jump identifier that is the same as any jump identifier in a jump identifier set corresponding to the wake-up phoneme sequence and queried from the jump identifier sequence; the first determining module 403 is configured to obtain a skip identifier set corresponding to the wake-up phoneme sequence, where a skip identifier in the skip identifier set is used to represent a skip path of an adjacent phoneme in the wake-up phoneme sequence; and under the condition that a jump identifier different from any jump identifier in the jump identifier set is inquired, continuously inquiring the next jump identifier until the jump identifier same as any jump identifier in the jump identifier set is inquired from the jump identifier sequence.
In another possible implementation manner, the apparatus further includes:
a discarding module 405, configured to discard, from the audio data, the audio data corresponding to the skip identifier when the queried skip identifier does not satisfy the target condition.
In another possible implementation manner, the second determining module 404 is configured to determine the audio data corresponding to the target jump identifier as a head end point of the wake-up audio data.
In another possible implementation manner, the processing module 401 includes:
a receiving unit 4011 configured to receive the audio data, where the audio data includes a plurality of audio frames;
the processing unit 4012 is configured to, each time the received amount of audio frames in the audio data reaches a second target amount, perform identification processing on the received audio data to obtain an identification result of the received audio data;
and the output unit 4013 is configured to output the identification result.
Fig. 6 shows a block diagram of an electronic device 600 according to an exemplary embodiment of the present invention. The electronic device 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Electronic device 600 may also be referred to by other names as user equipment, portable electronic device, laptop electronic device, desktop electronic device, and so on.
In general, the electronic device 600 includes: a processor 601 and a memory 602.
The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 602 is used to store at least one instruction for execution by the processor 601 to implement the audio data processing method provided by the method embodiments herein.
In some embodiments, the electronic device 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera 606, an audio circuit 607, a positioning component 608, and a power supply 609.
The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the electronic device 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the electronic device 600 or in a foldable design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The positioning component 608 is used to locate a current geographic Location of the electronic device 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.
The power supply 609 is used to supply power to various components in the electronic device 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the electronic device 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.
The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic device 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the electronic device 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the electronic device 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
The pressure sensor 613 may be disposed on a side bezel of the electronic device 600 and/or on a lower layer of the display screen 605. When the pressure sensor 613 is disposed on a side frame of the electronic device 600, a user's holding signal of the electronic device 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the electronic device 600. When a physical button or vendor Logo is provided on the electronic device 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.
The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.
Proximity sensor 616, also referred to as a distance sensor, is typically disposed on the front panel of electronic device 600. The proximity sensor 616 is used to capture the distance between the user and the front of the electronic device 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front of the electronic device 600 gradually decreases, the processor 601 controls the display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the electronic device 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the electronic device 600, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.
An embodiment of the present application further provides a computer-readable storage medium, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the audio data processing method according to any of the above-mentioned implementation manners.
An embodiment of the present application further provides a computer program product, which includes at least one program code, and the at least one program code is loaded and executed by a processor to implement the audio data processing method according to any of the above implementation manners.
In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.
The present application is intended to cover various modifications, alternatives, and equivalents, which may be included within the spirit and scope of the present application.
Claims (17)
1. A method of audio data processing, the method comprising:
receiving input audio data, identifying the audio data, and outputting an identification result;
under the condition that the recognition result comprises a wake-up word, acquiring a decoding graph of the audio data, wherein the decoding graph comprises a jump identification sequence of a decoding path corresponding to the audio data, and the jump identification sequence is used for representing the phoneme change condition between adjacent audio frames in the audio data;
determining a target jump identifier from the jump identifier sequence, wherein the target jump identifier meets a target condition, and the target condition indicates that the jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake-up phoneme sequence of the wake-up word;
and determining the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification.
2. The method according to claim 1, wherein the decoding map comprises a plurality of decoding paths corresponding to the audio data and jump identification sequences of the plurality of decoding paths;
the determining the target jump identifier from the jump identifier sequence includes:
selecting a decoding path satisfying a parameter condition from the plurality of decoding paths based on decoding parameters of the plurality of decoding paths;
and determining the target jump identification from the jump identification sequence of the selected decoding path.
3. The method according to claim 2, wherein the selecting, based on the decoding parameters of the plurality of decoding paths, a decoding path from the plurality of decoding paths that satisfies a parameter condition comprises:
selecting a decoding path with the largest decoding parameter from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or,
selecting a decoding path of which the decoding parameter exceeds a parameter threshold from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths; or,
selecting a first target number of decoding paths from the plurality of decoding paths based on the decoding parameters of the plurality of decoding paths, wherein the decoding parameters of the first target number of decoding paths are greater than the decoding parameters of other decoding paths except the first target number of decoding paths in the plurality of decoding paths.
4. The method of claim 1, wherein the recognition result further comprises an index of a decoding path, wherein the index indicates that the processing result is obtained by decoding the decoding path;
the determining the target jump identifier from the jump identifier sequence includes:
determining a corresponding decoding path based on the index;
and determining the target jump identification from the jump identification sequence of the determined decoding path.
5. The method as claimed in claim 1, wherein the target jumping identity is a jumping identity that satisfies the target condition and is searched from the jumping identity sequence, and the determining a target jumping identity from the jumping identity sequence comprises:
sequentially inquiring the jump identifiers in the jump identifier sequence;
and under the condition that the inquired jump identification does not meet the target condition, continuously inquiring the next jump identification until the jump identification meeting the target condition is inquired from the jump identification sequence.
6. The method of claim 5, wherein the sequentially querying the hop identifiers in the sequence of hop identifiers comprises:
sequentially inquiring the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from front to back; or,
and sequentially inquiring the jump identifiers in the jump identifier sequence of the decoding path according to the sequence of the decoding path from back to front.
7. The method as claimed in claim 5 or 6, wherein the target hop identifier is a plurality of consecutive hop identifiers that satisfy the target condition and are queried from the hop identifier sequence, and the sequentially querying the hop identifiers in the hop identifier sequence comprises:
sequentially inquiring a plurality of continuous jump marks in the jump mark sequence;
under the condition that the inquired jump identifier does not meet the target condition, continuously inquiring the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the method comprises the following steps:
and under the condition that the inquired continuous jumping identifications do not meet the target condition, continuously inquiring the continuous jumping identifications until the continuous jumping identifications meeting the target condition are inquired from the jumping identification sequence, wherein the target condition indicates that each jumping identification of the continuous jumping identifications in the jumping identification sequence belongs to the jumping identification corresponding to the awakening phoneme sequence, and jumping paths shown by the continuous jumping identifications comprise all or part of jumping paths of the awakening phoneme sequence.
8. The method of claim 7, wherein the determining a head end of wake-up audio data corresponding to the wake-up word based on the target jump identification comprises:
determining a head end point jumping identifier from the target jumping identifier based on a jumping path represented by the target jumping identifier, wherein the jumping path represented by the head end point jumping identifier comprises a first phoneme in the wake-up phoneme sequence;
and determining the audio data corresponding to the head end jumping identification as the head end of the awakening audio data.
9. The method as claimed in claim 5, wherein the target jumping identity is a jumping identity where a corresponding phoneme queried from the jumping identity sequence belongs to the wake-up phoneme sequence, and in the case that the queried jumping identity does not satisfy the target condition, continuing querying a next jumping identity until a jumping identity satisfying the target condition is queried from the jumping identity sequence, comprising:
acquiring a mapping relation between a queried jump identifier and a phoneme;
determining a phoneme corresponding to the skip identifier based on the mapping relation;
and under the condition that the determined phoneme does not belong to the awakening phoneme sequence, continuously inquiring a next jump identifier until a jump identifier of which the corresponding phoneme belongs to the awakening phoneme sequence is inquired from the jump identifier sequence.
10. The method according to claim 5, wherein the target jump identification is a jump identification which is the same as any jump identification in a jump identification set corresponding to the wake-up phoneme sequence and is queried from the jump identification sequence; under the condition that the inquired jump identifier does not meet the target condition, continuously inquiring the next jump identifier until the jump identifier meeting the target condition is inquired from the jump identifier sequence, wherein the method comprises the following steps:
acquiring a jump identifier set corresponding to the awakening phoneme sequence, wherein jump identifiers in the jump identifier set are used for representing jump paths of adjacent phonemes in the awakening phoneme sequence;
and under the condition that a jump identifier different from any jump identifier in the jump identifier set is inquired, continuously inquiring the next jump identifier until the jump identifier same as any jump identifier in the jump identifier set is inquired from the jump identifier sequence.
11. The method of claim 5, further comprising:
and under the condition that the inquired jump identification does not meet the target condition, discarding the audio data corresponding to the jump identification from the audio data.
12. The method of claim 1, wherein the determining a head end of wake-up audio data corresponding to the wake-up word based on the target jump identification comprises:
and determining the audio data corresponding to the target jump identification as the head end point of the awakening audio data.
13. The method of claim 1, wherein the receiving input audio data, performing recognition processing on the audio data, and outputting a recognition result comprises:
receiving the audio data, the audio data comprising a plurality of audio frames;
when the received amount of the audio frames in the audio data reaches a second target amount, carrying out identification processing on the received audio data to obtain an identification result of the received audio data;
and outputting the identification result.
14. An audio data processing apparatus, characterized in that the apparatus comprises:
the processing module is used for receiving input audio data, identifying the audio data and outputting an identification result;
an obtaining module, configured to obtain a decoding graph of the audio data when the recognition result includes a wakeup word, where the decoding graph includes a skip identifier sequence of a decoding path corresponding to the audio data, and the skip identifier sequence is used to represent a phoneme change condition between adjacent audio frames in the audio data;
a first determining module, configured to determine a target jump identifier from the jump identifier sequence, where the target jump identifier meets a target condition, and the target condition indicates that a jump identifier in the jump identifier sequence belongs to a jump identifier corresponding to a wake-up phoneme sequence of a wake-up word;
and the second determining module is used for determining the head end point of the awakening audio data corresponding to the awakening word based on the target jump identification.
15. An electronic device, characterized in that the electronic device comprises:
a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded and executed by the processor to implement the audio data processing method of any of claims 1 to 13.
16. A computer-readable storage medium, characterized in that at least one program code is stored in the storage medium, which is loaded and executed by a processor, to implement the audio data processing method according to any one of claims 1 to 13.
17. A computer program product, characterized in that the computer program product comprises at least one program code which is loaded and executed by a processor for implementing the audio data processing method as claimed in any one of claims 1 to 13.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111539880.6A CN114299997A (en) | 2021-12-15 | 2021-12-15 | Audio data processing method and device, electronic equipment, storage medium and product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111539880.6A CN114299997A (en) | 2021-12-15 | 2021-12-15 | Audio data processing method and device, electronic equipment, storage medium and product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114299997A true CN114299997A (en) | 2022-04-08 |
Family
ID=80968121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111539880.6A Pending CN114299997A (en) | 2021-12-15 | 2021-12-15 | Audio data processing method and device, electronic equipment, storage medium and product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114299997A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115188370A (en) * | 2022-06-27 | 2022-10-14 | 北京声智科技有限公司 | Voice wake-up method and device and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005242182A (en) * | 2004-02-27 | 2005-09-08 | Toshiba Corp | Speech detecting device, speech recognizing device, speech detecting method, and speech recognizing method |
US20100179811A1 (en) * | 2009-01-13 | 2010-07-15 | Crim | Identifying keyword occurrences in audio data |
CN105869628A (en) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | Voice endpoint detection method and device |
CN111429921A (en) * | 2020-03-02 | 2020-07-17 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN111862963A (en) * | 2019-04-12 | 2020-10-30 | 阿里巴巴集团控股有限公司 | Voice wake-up method, device and equipment |
CN111968680A (en) * | 2020-08-14 | 2020-11-20 | 北京小米松果电子有限公司 | Voice processing method, device and storage medium |
CN112151015A (en) * | 2020-09-03 | 2020-12-29 | 腾讯科技(深圳)有限公司 | Keyword detection method and device, electronic equipment and storage medium |
CN113496696A (en) * | 2020-04-03 | 2021-10-12 | 中国科学院深圳先进技术研究院 | Speech function automatic evaluation system and method based on voice recognition |
-
2021
- 2021-12-15 CN CN202111539880.6A patent/CN114299997A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005242182A (en) * | 2004-02-27 | 2005-09-08 | Toshiba Corp | Speech detecting device, speech recognizing device, speech detecting method, and speech recognizing method |
US20100179811A1 (en) * | 2009-01-13 | 2010-07-15 | Crim | Identifying keyword occurrences in audio data |
CN105869628A (en) * | 2016-03-30 | 2016-08-17 | 乐视控股(北京)有限公司 | Voice endpoint detection method and device |
CN111862963A (en) * | 2019-04-12 | 2020-10-30 | 阿里巴巴集团控股有限公司 | Voice wake-up method, device and equipment |
CN111429921A (en) * | 2020-03-02 | 2020-07-17 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN113496696A (en) * | 2020-04-03 | 2021-10-12 | 中国科学院深圳先进技术研究院 | Speech function automatic evaluation system and method based on voice recognition |
CN111968680A (en) * | 2020-08-14 | 2020-11-20 | 北京小米松果电子有限公司 | Voice processing method, device and storage medium |
CN112151015A (en) * | 2020-09-03 | 2020-12-29 | 腾讯科技(深圳)有限公司 | Keyword detection method and device, electronic equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
DING HAO,YAO TIANREN: "《Endpoint detection based on mel-scale features and phoneme segmentation》", 《7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING》, 1 January 2004 (2004-01-01), pages 667 - 670 * |
LI KAI; XU QIANG-QIANG; ZUO WAN-LI: "《Research of speech endpoint detection based on the variety of fractal feature》", 《JOURNAL OF CHINESE COMPUTER SYSTEMS》, vol. 28, no. 8, 1 January 2007 (2007-01-01), pages 1523 - 1526 * |
朱杰,韦晓东: "《采用HMM模型方法进行语音信号的端点检测》", 《 1999年中国神经网络与信号处理学术会议》, 1 December 1999 (1999-12-01), pages 413 - 416 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115188370A (en) * | 2022-06-27 | 2022-10-14 | 北京声智科技有限公司 | Voice wake-up method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108615526B (en) | Method, device, terminal and storage medium for detecting keywords in voice signal | |
CN111933112B (en) | Awakening voice determination method, device, equipment and medium | |
CN109151593B (en) | Anchor recommendation method, device and storage medium | |
CN111564152B (en) | Voice conversion method and device, electronic equipment and storage medium | |
CN110572716B (en) | Multimedia data playing method, device and storage medium | |
CN110556127A (en) | method, device, equipment and medium for detecting voice recognition result | |
CN111027490B (en) | Face attribute identification method and device and storage medium | |
CN110956971A (en) | Audio processing method, device, terminal and storage medium | |
CN111681655A (en) | Voice control method and device, electronic equipment and storage medium | |
CN114299933A (en) | Speech recognition model training method, device, equipment, storage medium and product | |
CN112261491B (en) | Video time sequence marking method and device, electronic equipment and storage medium | |
CN111613213B (en) | Audio classification method, device, equipment and storage medium | |
CN111081277B (en) | Audio evaluation method, device, equipment and storage medium | |
CN112667844A (en) | Method, device, equipment and storage medium for retrieving audio | |
CN112992127A (en) | Voice recognition method and device | |
CN110837557A (en) | Abstract generation method, device, equipment and medium | |
CN114360494A (en) | Rhythm labeling method and device, computer equipment and storage medium | |
CN114299935A (en) | Awakening word recognition method, awakening word recognition device, terminal and storage medium | |
CN113744736A (en) | Command word recognition method and device, electronic equipment and storage medium | |
CN108831423A (en) | Extract method, apparatus, terminal and the storage medium of theme track in audio data | |
CN114299997A (en) | Audio data processing method and device, electronic equipment, storage medium and product | |
CN110992954A (en) | Method, device, equipment and storage medium for voice recognition | |
CN113160802B (en) | Voice processing method, device, equipment and storage medium | |
CN113301444B (en) | Video processing method and device, electronic equipment and storage medium | |
CN111611414A (en) | Vehicle retrieval method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |