CN113162837B - Voice message processing method, device, equipment and storage medium - Google Patents

Voice message processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113162837B
CN113162837B CN202010013975.3A CN202010013975A CN113162837B CN 113162837 B CN113162837 B CN 113162837B CN 202010013975 A CN202010013975 A CN 202010013975A CN 113162837 B CN113162837 B CN 113162837B
Authority
CN
China
Prior art keywords
audio data
audio
audio frame
frame
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010013975.3A
Other languages
Chinese (zh)
Other versions
CN113162837A (en
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010013975.3A priority Critical patent/CN113162837B/en
Publication of CN113162837A publication Critical patent/CN113162837A/en
Application granted granted Critical
Publication of CN113162837B publication Critical patent/CN113162837B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/18Commands or executable codes

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for processing voice messages, and belongs to the technical field of computers. The method comprises the following steps: acquiring first audio data and a first reference number of continuous second audio data, wherein the first audio data corresponds to a first audio frame to be processed currently in a voice message, and the first reference number of continuous second audio data corresponds to a second audio frame; determining validity of the first audio frame based on the first audio data and the second audio data; and responding to the invalidation of the first audio frame, determining the validity of the audio frame positioned behind the first audio frame in the voice message until the valid audio frame is obtained, and acquiring the audio data to be played, which corresponds to the valid audio frame. Based on the above process, only the audio data to be played corresponding to the effective audio frame is obtained, so that the playing quality can be ensured, the time consumption of playing is effectively shortened, and the processing effect of the voice message is better in the process of playing the voice message based on the audio data to be played.

Description

Voice message processing method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a method, a device, equipment and a storage medium for processing voice messages.
Background
With the rapid development of intelligent terminal technology, application programs are more and more kinds, and functions tend to be diversified. Currently, most instant messaging applications support a voice message function, so that a user's terminal transmits a voice message to other users ' terminals or receives a voice message transmitted from other users ' terminals through the voice message function. The terminal can process the voice message to obtain audio data to be played, and then play the voice message based on the audio data to be played.
In the related art, in the process of processing a voice message, a terminal sequentially decodes each audio frame in the voice message to obtain audio data corresponding to each audio frame, and then takes the audio data corresponding to each audio frame as audio data to be played.
In carrying out the present application, the inventors have found that the related art has at least the following problems:
in the related art, the audio data corresponding to each audio frame are all used as the audio data to be played, so that the playing time length of the voice message is the same as the recording time length of the voice message, the playing process of the voice message consumes longer time, and the processing effect of the voice message is poorer.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for processing voice messages, which can be used for improving the processing effect of the voice messages. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for processing a voice message, where the method includes:
acquiring first audio data and a first reference number of continuous second audio data, wherein the first audio data corresponds to a first audio frame to be processed currently in a voice message, the first reference number of continuous second audio data corresponds to a second audio frame, and the second audio frame is a continuous audio frame positioned behind the first audio frame in the voice message;
determining validity of the first audio frame based on the first audio data and the second audio data;
and responding to the invalidation of the first audio frame, determining the validity of the audio frame positioned behind the first audio frame in the voice message until the valid audio frame is obtained, and acquiring the audio frequency to be played corresponding to the valid audio frame.
In another aspect, there is provided a processing apparatus for a voice message, the apparatus comprising:
the first acquisition module is used for acquiring first audio data and a first reference number of continuous second audio data, wherein the first audio data corresponds to a first audio frame to be processed currently in a voice message, the first reference number of continuous second audio data corresponds to a second audio frame, and the second audio frame is a continuous audio frame positioned behind the first audio frame in the voice message;
A determining module configured to determine validity of the first audio frame based on the first audio data and the second audio data;
the determining module is further configured to determine, in response to the first audio frame being invalid, validity of an audio frame located after the first audio frame in the voice message until a valid audio frame is obtained;
and the second acquisition module is used for acquiring the audio data to be played corresponding to the effective audio frames.
In one possible implementation manner, the second obtaining module is further configured to splice the first audio data and the target audio data to obtain spliced audio data, and use the spliced audio data as audio data to be played corresponding to the first audio frame.
In one possible implementation manner, the second obtaining module is further configured to perform windowing on a first sampling point set in the target audio data and a second sampling point set in the first audio data to obtain a third sampling point set, where the first sampling point set includes a second reference number of sampling points located at an end portion in the target audio data, and the second sampling point set includes a second reference number of sampling points located at a start portion in the first audio data;
And based on the third sampling point set, splicing the first audio data and the target audio data to obtain spliced audio data.
In one possible implementation manner, the determining module is configured to obtain a detection result corresponding to the first audio data and a detection result corresponding to each second audio data, where the detection result corresponding to any audio data is used to indicate whether the any audio data is voice signal data; and determining the validity of the first audio frame based on the detection result corresponding to the first audio data and the detection result corresponding to each second audio data.
In one possible implementation manner, the determining module is configured to determine that the first audio frame is invalid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meeting an invalidation condition; and determining that the first audio frame is valid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data not meeting the invalid condition.
In one possible implementation manner, the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meet an invalidation condition, including: the detection results corresponding to the first audio data indicate that the first audio data are non-voice signal data, and the detection results corresponding to the second audio data indicate that the second audio data are non-voice signal data.
In one possible implementation, the first reference number of consecutive second audio data includes a first portion of second audio data and a second portion of second audio data, the first portion of second audio data corresponding to other audio frames of the second audio frame than the last audio frame, the second portion of second audio data corresponding to the last audio frame of the second audio frame; the first acquisition module is used for extracting the first audio data and the first part of second audio data from the cache; and decoding the code stream of the last audio frame in the second audio frame to obtain the second part of second audio data.
In one possible implementation manner, the first obtaining module is configured to decode a code stream of the first audio frame to obtain the first audio data; and respectively decoding the code streams of all the audio frames in the second audio frames to obtain the first reference number of continuous second audio data.
In another aspect, a computer device is provided, where the computer device includes a processor and a memory, where the memory stores at least one program code, where the at least one program code is loaded and executed by the processor to implement any of the above-mentioned methods for processing a voice message.
In another aspect, there is provided a computer readable storage medium having at least one program code stored therein, the at least one program code loaded and executed by a processor to implement any of the above-described methods of processing a voice message.
The technical scheme provided by the embodiment of the application at least has the following beneficial effects:
and determining the validity of the first audio frame according to the first audio data and the second audio data, and determining the validity of the subsequent audio frame when the first audio frame is invalid until the valid audio frame is obtained, and acquiring the audio data to be played corresponding to the valid audio frame. In the processing process of the voice message, only the audio data to be played corresponding to the effective audio frame is acquired, and the mode can ensure the playing quality and effectively shorten the playing time consumption in the process of playing the voice message based on the audio data to be played, so that the processing effect of the voice message is good.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an implementation environment of a voice message processing method according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for processing a voice message according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a Hanning window function provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a voice message processing procedure according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a process for processing a first audio frame according to an embodiment of the present application;
fig. 6 is a schematic diagram of a voice message processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a voice message processing device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
With the rapid development of intelligent terminal technology, application programs are more and more kinds, and functions tend to be diversified. Currently, most instant messaging applications support a voice message function, so that a user's terminal transmits a voice message to other users ' terminals or receives a voice message transmitted from other users ' terminals through the voice message function. The terminal can process the voice message in the instant messaging process to obtain audio data to be played, and then play the voice message in the instant messaging process based on the audio data to be played.
In this regard, an embodiment of the present application provides a method for processing a voice message, please refer to fig. 1, which is a schematic diagram illustrating an implementation environment of the method for processing a voice message provided in the embodiment of the present application. The implementation environment may include: a terminal 11 and a server 12.
The terminal 11 is provided with an instant messaging application program supporting a voice message function, and the terminal 11 can transmit or receive a voice message based on the application program. Before playing the voice message in the instant messaging process, the terminal 11 may process the voice message in the instant messaging process by using the method provided by the embodiment of the present application to obtain the audio data to be played. The server 12 may be a background server supporting an instant messaging application program of a voice message function, and may be capable of providing data support for an application program installed in the terminal 11.
In one possible implementation, the terminal 11 may be a smart device such as a cell phone, tablet, personal computer, or the like. The server 12 may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center. The terminal 11 establishes a communication connection with the server 12 through a wired or wireless network.
Those skilled in the art will appreciate that the above-described terminal 11 and server 12 are only examples, and that other terminals or servers that may be present in the present application or in the future are applicable and within the scope of the present application and are incorporated herein by reference.
Based on the implementation environment shown in fig. 1, the embodiment of the application provides a method for processing a voice message, which is applied to a terminal as an example. As shown in fig. 2, the method provided by the embodiment of the present application may include the following steps:
in step 201, first audio data and a first reference number of consecutive second audio data are acquired.
The first audio data corresponds to a first audio frame to be processed currently in the voice message, the first reference number of continuous second audio data corresponds to a second audio frame, and the second audio frame is a continuous audio frame positioned behind the first audio frame in the voice message. The second audio frames comprise a first reference number of audio frames, and each audio frame corresponds to one second audio data.
The voice message in the embodiment of the application refers to any voice message to be played in the instant messaging process, and the voice message can be a voice message sent by a terminal or a voice message received by the terminal.
Because the voice message is obtained by encoding the recorded voice signal, the voice message can be divided into a plurality of audio frames which are sequentially arranged according to the sequence of the recording time. The process of dividing the voice message into a plurality of audio frames is a process of dividing the complete code stream corresponding to the voice message into a plurality of code streams, and each audio frame corresponds to one code stream. In one possible implementation, a plurality of audio frames may be numbered sequentially such that each audio frame corresponds to a sequence number, each sequence number may be used to uniquely identify an audio frame.
Before playing the voice message, each audio frame in the voice message needs to be processed to obtain audio data to be played, which corresponds to the voice message. In the embodiment of the application, the audio frame to be processed currently in the voice message is taken as the first audio frame. The first audio data is data obtained by decoding a code stream of the first audio frame.
The first reference number of consecutive second audio data corresponds to the second audio frame, that is, the first reference number of consecutive second audio data is data obtained by decoding a code stream of the second audio frame. The second audio frame includes a first reference number of audio frames in the voice message that follow the first audio frame and are consecutive to the first audio frame. It should be noted that the first reference number may be set empirically, or may be freely adjusted according to practical situations, which is not limited in the embodiment of the present application. Illustratively, when the number of unprocessed audio frames other than the first audio frame in the voice message is not less than n (an integer not less than 0), the first reference number may be set to n; when the number of unprocessed audio frames in the voice message other than the first audio frame is smaller than n, the first reference number may be set equal to the number of unprocessed audio frames in the voice message other than the first audio frame.
It should be noted that, when the first audio frame is the last audio frame in the voice message, the first reference number is 0, and no second audio data exists at this time, that is, when the first audio frame is the last audio frame in the voice message, only the first audio data needs to be acquired.
In one possible implementation, when the first audio frame is the first audio frame in the voice message, the terminal may perform an operation of acquiring the first audio data and the first reference number of consecutive second audio data based on a selected instruction of the voice message. The selection instruction of the voice message may be a trigger instruction of the user or an automatic selection instruction, which is not limited in the embodiment of the present application.
In one possible implementation, when the first audio frame is not the first audio frame in the voice message, the case where the terminal performs the operation of acquiring the first audio data and the first reference number of consecutive second audio data includes, but is not limited to, the following two:
case one: and automatically performing the operation of acquiring the first audio data and the first reference number of consecutive second audio data in response to the previous audio frame of the first audio frame being processed.
The preconditions for this to occur are: and when the processing of the previous audio frame of the first audio frame is finished, detecting a pause playing instruction of the voice message.
And a second case: and responding to a continuous playing instruction of the voice message, and executing an operation of acquiring the first audio data and the first reference number of continuous second audio data.
The preconditions for this to occur are: and detecting a play suspension instruction of the voice message when the previous audio frame of the first audio frame is processed. Under the precondition, after detecting a pause playing instruction of the voice message, the terminal records the identification information of the first audio frame to be processed currently, and when detecting a continuous playing instruction of the voice message, the terminal executes the operation of acquiring the first audio data and the first reference number of continuous second audio data. For example, the identification information of the first audio frame may refer to a sequence number of the first audio frame.
In one possible implementation, the manner in which the terminal obtains the first audio data and the first reference number of consecutive second audio data includes, but is not limited to, the following two types:
mode one: the first reference number of consecutive second audio data includes a first portion of second audio data and a second portion of second audio data. Extracting first audio data and a first portion of second audio data from the buffer; and decoding the code stream of the last audio frame in the second audio frame to obtain second part of second audio data.
The first part of the second audio data corresponds to other audio frames except the last audio frame in the second audio frame, and the second part of the second audio data corresponds to the last audio frame in the second audio frame. It should be noted that, in the case of sequentially numbering each audio frame, the last audio frame in the second audio frame refers to an audio frame having a difference between the sequence number and the sequence number of the first audio frame as the first reference number.
In one possible implementation, the conditions under which this occurs are: the first audio frame is not the first audio frame in the voice message and the audio data obtained in the previous processing is stored in the buffer data.
In the case where the above condition is satisfied, the code stream of the first audio frame has been decoded into the first audio data, which has been stored in the buffer data, in the course of processing the audio frame preceding the first audio frame, and therefore, the first audio data can be directly extracted from the buffer. Likewise, the streams of the other audio frames of the second audio frame than the last audio frame have also been decoded into the first part of the second audio data during the processing of the audio frame preceding the first audio frame, which has also been stored in the buffer data, so that the first part of the second audio data can be extracted directly from the buffer.
However, since the process of decoding the stream of the last audio frame in the second audio frame is not involved in the process of processing the audio frame preceding the first audio frame, the process of decoding the stream of the last audio frame in the second audio frame is required to obtain the second portion of the second audio data.
That is, in the process of acquiring the first audio data and the first reference number of second audio data based on this manner, only the code stream of the last audio frame in the second audio frame needs to be decoded to obtain the second part of second audio data. Both the first audio data and the first portion of the second audio data may be directly extracted from the buffer.
Illustratively, assume that the first audio frame has a sequence number of 2 and the first reference number is 5. The second audio frame includes five audio frames with sequence numbers of 3, 4, 5, 6 and 7, and the last audio frame in the second audio frame is the audio frame with sequence number of 7. In the process of processing the audio frame with the sequence number of 1, the code stream of the audio frame with the sequence number of 2 (the first audio frame) and the code streams of the four audio frames with the sequence numbers of 3, 4, 5 and 6 (the other audio frames except the last audio frame in the second audio frame) are decoded, the obtained audio data are stored in the buffer data, and only the code stream of the audio frame with the sequence number of 7 (the last audio frame in the second audio frame) is not decoded. Therefore, in the process of acquiring the first audio data and the first reference number of consecutive second audio data, data of audio frames with sequence numbers of 2 (first audio data) and data of audio frames with sequence numbers of 3, 4, 5, 6 (first part of second audio data) are extracted from the buffer; then, the data (second part of second audio data) of the audio frame with the sequence number 7 is obtained by decoding the code stream of the audio frame with the sequence number 7.
Mode two: decoding the code stream of the first audio frame to obtain first audio data; and respectively decoding the code streams of all the audio frames in the second audio frames to obtain a first reference number of continuous second audio data.
In one possible implementation, such a second occurrence condition includes, but is not limited to, the following two:
condition 1: the first audio frame is the first audio frame in the voice message.
Condition 2: the first audio frame is not the first audio frame in the voice message, but the audio data obtained in the previous processing is not saved in the buffer data.
In the case that any of the above conditions is satisfied, since the first audio data and the first reference number of consecutive second audio data cannot be directly extracted from the buffer, the first audio data and the first reference number of consecutive second audio data are obtained by respectively performing decoding processing on the code stream of the first audio frame and the code stream of each of the second audio frames.
In a possible implementation manner, in the case that the above condition 1 is satisfied, that is, when the first audio frame is the first audio frame in the voice message, after the first audio data and the first reference number of consecutive second audio data are acquired, the first audio data and the first reference number of consecutive second audio data may be stored in the buffer data, so as to directly extract the relevant audio data from the buffer when other audio frames are subsequently processed.
It should be noted that, in the embodiment of the present application, the format of the audio data obtained after the decoding process is not limited, as long as the format of the audio data obtained after the decoding process can be identified and played by the playing device of the terminal. The format of the audio data obtained after the decoding process may be PCM (Pulse Code Modulation ) format, for example. The PCM format audio data is standard digital audio data that is converted by sampling, quantizing, and encoding an analog signal. The embodiment of the application is not limited to the playing device, and for example, the playing device may refer to a sound card device.
In step 202, the validity of the first audio frame is determined based on the first audio data and the second audio data.
The validity of the first audio frame is used for indicating whether the audio data to be played corresponding to the first audio frame needs to be acquired. When the first audio frame is valid, audio data to be played corresponding to the first audio frame is required to be acquired; when the first audio frame is invalid, the audio data to be played corresponding to the first audio frame does not need to be acquired. In one possible implementation, the process of determining the validity of the first audio frame based on the first audio data and the second audio data comprises the following steps 2021 and 2022:
Step 2021: and acquiring a detection result corresponding to the first audio data and a detection result corresponding to each second audio data.
The detection result corresponding to any audio data is used for indicating whether any audio data is voice signal data or not. That is, the meaning of the detection result indication corresponding to any one audio data includes two kinds: any one of the audio data is speech signal data, and any one of the audio data is non-speech signal data.
In an actual application scenario, the case where any audio data is non-speech signal data includes, but is not limited to: any audio data is silence data; any audio data is data without specific semantics, such as drag data and the like; the arbitrary audio data is noise data, for example, ambient noise data, irrelevant voice data, or the like.
The detection result corresponding to the first audio data and the detection result corresponding to each of the second audio data may be obtained by detecting the first audio data and each of the second audio data based on a VAD (Voice Activity Detection ) algorithm. The VAD algorithm is a detection algorithm that detects whether input audio data is voice signal data. The embodiment of the application does not limit the type of VAD algorithm. Illustratively, the types of VAD algorithms include, but are not limited to: signal-to-noise ratio (Signal Noise Ratio, SNR) based VAD algorithms, energy-stationarity based VAD algorithms, deep neural network (Deep Neural Networks, DNN) based VAD algorithms, and hidden markov model (Hidden Markov Model, HMM) based VAD algorithms, etc.
It should be noted that, when the first audio frame is the last audio frame in the voice message, the second audio data does not exist, and only the detection result corresponding to the first audio data needs to be obtained at this time.
In one possible implementation manner, the manner of obtaining the detection result corresponding to the first audio data and the detection result corresponding to each second audio data includes, but is not limited to, the following two manners:
mode one: extracting a detection result corresponding to the first audio data and a detection result corresponding to the first part of the second audio data from the cache; and detecting the second audio data of the second part based on the VAD algorithm to obtain a detection result corresponding to the second audio data of the second part.
In one possible implementation, the conditions under which this occurs are: the first audio frame is not the first audio frame in the voice message, and the detection result corresponding to the audio data obtained in the previous processing process is stored in the cache data.
In the case where the above condition is satisfied, the detection result corresponding to the first audio data and the detection result corresponding to the first portion of the second audio data have been obtained and stored in the buffer data during the processing of the audio frame preceding the first audio frame. Therefore, the detection result corresponding to the first audio data and the detection result corresponding to the first part of the second audio data can be directly extracted from the buffer memory.
However, in the process of processing the audio frame before the first audio frame, the process of detecting the second portion of the second audio data based on the VAD algorithm is not involved, so that the second portion of the second audio data needs to be detected based on the VAD algorithm to obtain a detection result corresponding to the second portion of the second audio data.
That is, in the process of acquiring the detection result corresponding to the first audio data and the detection result corresponding to each second audio data based on the mode, only the second part of the second audio data is detected based on the VAD algorithm, so as to obtain the detection result corresponding to the second part of the second audio data; the detection result corresponding to the first audio data and the detection result corresponding to the first part of the second audio data can be directly extracted from the cache.
Mode two: detecting the first audio data based on the VAD algorithm to obtain a detection result corresponding to the first audio data; and detecting each second audio data based on the VAD algorithm to obtain detection results corresponding to each second audio data.
In one possible implementation, such a second occurrence condition includes, but is not limited to, the following two:
Condition 1: the first audio frame is the first audio frame in the voice message.
Condition 2: the first audio frame is not the first audio frame in the voice message, but the detection result corresponding to the audio data obtained in the previous processing process is not stored in the buffer data.
Under the condition that any one of the conditions is met, the detection result corresponding to the first audio data and the detection result corresponding to each second audio data cannot be directly extracted from the cache, so that the detection result corresponding to the first audio data and the detection result corresponding to each second audio data are obtained by detecting the first audio data and each second audio data based on the VAD algorithm.
In one possible implementation manner, in the case that the above condition 1 is satisfied, that is, when the first audio frame is the first audio frame in the voice message, after the detection result corresponding to the first audio data and the detection result corresponding to each second audio data are obtained, the detection result corresponding to the first audio data and the detection result corresponding to each second audio data may be stored in the buffer data, so that when other audio frames are processed later, the detection result corresponding to the relevant audio data may be directly extracted from the buffer.
Either the first or second method may acquire the detection result corresponding to the first audio data and the detection result corresponding to each second audio data, and then step 2022 is performed.
Step 2022: and determining the validity of the first audio frame based on the detection result corresponding to the first audio data and the detection result corresponding to each second audio data.
In one possible implementation manner, the implementation process of this step is as follows: determining that the first audio frame is invalid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meeting the invalidation condition; and determining that the first audio frame is valid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data not meeting the invalid condition.
In one possible implementation manner, the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meeting the invalidation condition may refer to: the detection results corresponding to the first audio data indicate that the first audio data are non-voice signal data, and the detection results corresponding to the second audio data indicate that the second audio data are non-voice signal data.
When the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meet the invalidation condition, the first audio data and the first reference number of continuous second audio data are all non-voice signal data, and at the moment, the first audio frame is determined to be invalid.
When the detection result corresponding to the first audio data and the detection result corresponding to each second audio data do not meet the invalid condition, at least one audio data in the first audio data and the first reference number of continuous second audio data is indicated to be voice signal data, and at the moment, the first audio frame is determined to be valid.
For example, the detection result may be represented by a first value, for example, 0, and a second value, for example, 1, where the value 1 represents the speech signal data, the value 0 represents the non-speech signal data, and assuming that the first reference number is N (an integer not less than 0), the first audio data and the first reference number of consecutive second audio data correspond to (n+1) detection results in total. When the (n+1) detection results are all 0, indicating that the invalidation condition is satisfied, and invalidating the first audio frame; when at least one value 1 is included in the (n+1) detection results, it is indicated that the invalidation condition is not satisfied and the first audio frame is valid.
Based on steps 2021 and 2022 described above, the validity of the first audio frame may be determined. When the first audio frame is invalid, executing step 203; and when the first audio frame is valid, acquiring audio data to be played, which corresponds to the first audio frame, based on the first audio data in response to the first audio frame being valid.
In one possible implementation manner, based on the first audio data, the process of acquiring the audio data to be played corresponding to the first audio frame includes the following steps a to C:
step A: the validity of the third audio frame is obtained.
The third audio frame is an audio frame located before the first audio frame in the voice message.
When the first audio frame is valid, the validity of a third audio frame positioned in front of the first audio frame in the voice message needs to be acquired; when the third audio frame is valid, executing the step B; when the third audio frame is invalid, step C is performed.
In one possible implementation, the terminal may store the correspondence between the sequence number of the audio frame of known validity and the validity in the buffer data during processing of the voice message. The method for the terminal to acquire the validity of the third audio frame is as follows: and the terminal queries the validity corresponding to the sequence number of the third audio frame in the corresponding relation between the sequence number of the audio frame and the validity based on the sequence number of the third audio frame.
And (B) step (B): and responding to the third audio frame to be effective, and taking the first audio data as audio data to be played, which corresponds to the first audio frame.
When the third audio frame is valid, it is indicated that both continuous audio frames are valid, and the first audio data can be directly used as the audio data to be played corresponding to the first audio frame without processing the first audio data.
Step C: and responding to the invalidation of the third audio frame, acquiring target audio data, and acquiring audio data to be played corresponding to the first audio frame based on the first audio data and the target audio data.
The target audio data is audio data to be played corresponding to target audio frames, and the target audio frames are audio frames closest to the first audio frame in all the audio frames which are determined to be valid.
And when the third audio frame is invalid, acquiring target audio data. And then acquiring audio data to be played corresponding to the first audio frame based on the first audio data and the target audio data.
It should be noted that, the audio data to be played needs to be put into the play buffer, so that the audio data can be played by the playing device of the terminal. In one possible implementation, the manner in which the target audio data is obtained includes, but is not limited to, the following two:
Mode one: the target audio data is extracted from the play-out buffer.
The conditions under which this occurs are: after the target audio data is obtained, the target audio data is put into a play buffer. That is, after obtaining new audio data to be played, the new audio data to be played is put into the playing buffer.
Under the above conditions, the target audio data needs to be extracted from the play buffer, and when the audio data to be played corresponding to the first audio frame is obtained based on the target audio data and the first audio data, the audio data to be played corresponding to the first audio frame is put into the play buffer.
Mode two: target audio data is extracted from the buffer.
The conditions under which this occurs are: after the target audio data are obtained, the target audio data are temporarily stored in the cache data. That is, after obtaining new audio data to be played, the new audio data to be played is temporarily stored in the buffer data. Under this condition, after determining that the target audio data does not need to be spliced with other audio data, the target audio data is put into the play buffer.
Under the above conditions, it is necessary to extract target audio data from the buffer memory, and after obtaining audio data to be played corresponding to the first audio frame based on the target audio data and the first audio data, continue temporarily storing the audio data to be played corresponding to the first audio frame in the buffer memory data. After determining that the audio data to be played corresponding to the first audio frame does not need to be spliced with other audio data, the audio data to be played corresponding to the first audio frame is put into a play buffer.
The method can avoid the redundant operation of putting the target audio data into the play buffer and extracting the target audio data from the play buffer.
Whether the target audio data is acquired based on the first mode or the second mode, after the target audio data is acquired, the audio data to be played corresponding to the first audio frame is acquired based on the first audio data and the target audio data. In one possible implementation manner, based on the first audio data and the target audio data, the manner of obtaining the audio data to be played corresponding to the first audio frame is as follows: and performing splicing processing on the first audio data and the target audio data to obtain spliced audio data, and taking the spliced audio data as audio data to be played, which corresponds to the first audio frame.
Because the first audio frame and the target audio frame are not continuous audio frames, the first audio data and the target audio data need to be spliced so as to ensure the playing fluency of the voice message.
In one possible implementation manner, the process of splicing the first audio data and the target audio data to obtain the spliced audio data includes the following two steps:
Step 1: and windowing the first sampling point set in the target audio data and the second sampling point set in the first audio data to obtain a third sampling point set.
Wherein the first set of sampling points includes a second reference number of sampling points located at an end portion of the target audio data, and the second set of sampling points includes a second reference number of sampling points located at a start portion of the first audio data. The second reference number may be set empirically, or may be freely adjusted according to the application scenario, which is not limited in the embodiment of the present application. For convenience of explanation, it is assumed that the second reference number is N (an integer not less than 1). The first set of sampling points includes N sampling points; the second set of sample points also includes N sample points.
In one possible implementation manner, the process of windowing the first sampling point set and the second sampling point set to obtain the third sampling point set is as follows: and (3) windowing calculation is carried out on the ith (i is more than or equal to 1 and less than or equal to N) sampling point in the first sampling point set and the ith sampling point in the second sampling point set by utilizing a Hanning window function, and then the ith sampling point in the third sampling point set is obtained based on arithmetic addition and limiting value processing.
The method comprises the steps of performing windowing calculation on an ith sampling point in a first sampling point set and an ith sampling point in a second sampling point set by utilizing a Hanning window function, and then processing based on arithmetic addition limit values, wherein the process of obtaining the ith sampling point in a third sampling point set is as follows: arithmetically summing the product of the value of the i-th sampling point in the first sampling point set and the first hanning window function and the product of the value of the i-th sampling point in the second sampling point set and the second hanning window function based on formula 1; and then performing arithmetic addition limit processing based on the formula 2 to obtain the value of the ith sampling point in the third sampling point set.
x(i)=x 1 (i)hanning(N-i)+x 2 (i) hanning (i) equation 1
Wherein x (i) represents the value of the ith sample point in the third sample point set, x 1 (i) A value representing the ith sample point in the first set of sample points, x 2 (i) Representing the value of the ith sample point in the second set of sample points. And (3) taking hanning (N-i) as a first hanning window function, taking hanning (i) as a second hanning window function, and taking the function expression of hanning (i) as shown in a formula 3. As shown in FIG. 3, curve 301 represents a first hanning window function hanning (N-i) and curve 302 represents a second hanning window function hanning (i). The abscissa in fig. 3 represents the value of i, and the ordinate represents the function value of the hanning window function.
From the change in the value of i, the value of each sampling point in the third sampling point set can be obtained according to the above-described formulas 1 and 2, that is, the third sampling point set can be obtained.
Step 2: and based on the third sampling point set, splicing the first audio data and the target audio data to obtain spliced audio data.
After the third sampling point set is obtained, the first audio data and the target audio data can be spliced based on the third sampling point set, and the spliced audio data is obtained. In one possible implementation manner, based on the third sampling point set, the first audio data and the target audio data are spliced, and the process of obtaining the spliced audio data is as follows: and connecting other sampling points except the sampling points in the first sampling point set, the sampling points in the third sampling point set and the other sampling points except the sampling points in the second sampling point set in the first audio data to obtain the audio data after the splicing processing. The audio data after the splicing processing is the audio data to be played corresponding to the first audio frame. The audio data after the splicing processing is obtained in the mode, so that the problem of noise caused by direct butt joint of the head and tail sampling points of the two audio data can be avoided.
It should be noted that, the audio data to be played obtained based on the first audio data and the target audio data is not only the audio data to be played corresponding to the first audio frame, but also the updated audio data to be played corresponding to the target audio frame.
In one possible implementation, when the first audio frame is the first audio frame in the voice message, there is no third audio frame nor the target audio frame. In this case, when the first audio frame is valid, the first audio data is directly taken as audio data to be played, which corresponds to the first audio frame.
When the first audio frame is valid, the audio data to be played corresponding to the first audio frame can be obtained based on the steps a to C. After the audio data to be played corresponding to the first audio frame is acquired, it may be determined whether the first audio frame is the last audio frame in the voice message. If the first audio frame is the last audio frame in the voice message, the current voice message is processed, and the next voice message can be processed continuously; when the first audio frame is not the last audio frame in the voice message, the audio frame next to the first audio frame in the voice message can be used as the first audio frame to be processed currently until the first audio frame is the last audio frame in the voice message.
In step 203, in response to the first audio frame being invalid, validity of an audio frame located after the first audio frame in the voice message is determined until a valid audio frame is obtained, and audio data to be played corresponding to the valid audio frame is obtained.
When the first audio frame is invalid, audio data to be played corresponding to the first audio frame is not acquired. Based on step 201 and steps 2021 and 2022 in step 202, the validity of the audio frame following the first audio frame in the voice message is determined until a valid audio frame is obtained.
In determining the validity of an audio frame following a first audio frame in a voice message, the validity of an audio frame following the first audio frame in the voice message is determined. When the audio frame positioned at the rear position of the first audio frame in the voice message is effective, the audio frame positioned at the rear position of the first audio frame is the effective audio frame; and when the audio frame positioned at the rear of the first audio frame in the voice message is invalid, continuing to determine the validity of the next audio frame to be processed in the voice message, and circulating the process until the valid audio frame is obtained.
After obtaining the valid audio frame, audio data to be played corresponding to the valid audio frame is obtained based on steps a to C in step 202. Based on the above process, only the audio data to be played corresponding to the valid audio frame is obtained, and the invalid audio frame in the voice message can be effectively compressed.
In one possible implementation, when any audio frame is invalid, the audio data corresponding to the any audio frame may be placed in a data queue that does not need to be played, and cleared periodically.
Based on the above process, audio data to be played corresponding to the valid audio frame in the voice message can be obtained, so that the playing device of the terminal plays the voice message based on the audio data to be played corresponding to the valid audio frame.
In summary, the processing procedure of the voice message may be as shown in steps 401 to 410 in fig. 4, obtain a selected instruction of the voice message, and then determine whether an unprocessed audio frame exists in the voice message; if the unprocessed audio frames exist, judging whether a play pause instruction is detected or not; if the play pause instruction is not detected, directly acquiring the first audio data and the first reference number of continuous second audio data; if a pause instruction is detected, the serial number of the first audio frame is saved, the playing is stopped, and when a play continuing instruction is detected, the first audio data and the first reference number of continuous second audio data are acquired. Processing the first audio frame based on the first audio data and a first reference number of consecutive second audio data; after the first audio frame processing is finished, the process returns to step 402 to loop until step 401 is performed when there is no unprocessed audio frame in the voice message. In the processing procedure, the obtained audio data to be played corresponding to the effective audio frame is put into a playing buffer, and the playing equipment of the terminal plays the voice message based on the audio data to be played corresponding to the effective audio frame.
The processing of the first audio frame based on the first audio data and the second audio data may be seen in steps 501 to 509 in fig. 5. After the first audio data and the first reference number of continuous second audio data are acquired, VAD detection results corresponding to the first audio data and VAD detection results corresponding to the second audio data are further acquired. Judging whether the VAD detection result corresponding to the first audio data and the VAD detection result corresponding to each second audio data are 0; if the VAD detection result corresponding to the first audio data and the VAD detection result corresponding to each second audio data are both 0, determining that the first audio frame is invalid, and continuing to process the next audio frame; and if the VAD detection result corresponding to the first audio data and the VAD detection result corresponding to each second audio data are not equal to 0, determining that the first audio frame is valid. Judging whether the last frame of the first audio frame is valid or not, and if so, taking the first audio data as audio data to be played corresponding to the first audio frame; if the last frame of the first audio frame is invalid, the first audio data and the audio data to be played corresponding to the last valid audio frame are spliced, and the audio data to be played corresponding to the first audio frame is obtained.
Based on the voice message processing mode provided by the embodiment of the application, invalid audio frames can be effectively compressed, so that the efficiency of a user for listening to the voice message is improved, and according to statistics, the method can effectively save more than 20% of listening time of the user.
In the embodiment of the application, the validity of the first audio frame is determined according to the first audio data and the second audio data, and when the first audio frame is invalid, the validity of the subsequent audio frame is determined until the valid audio frame is obtained, and the audio data to be played corresponding to the valid audio frame is obtained. In the processing process of the voice message, only the audio data to be played corresponding to the effective audio frame is acquired, and the mode can ensure the playing quality and effectively shorten the playing time consumption in the process of playing the voice message based on the audio data to be played, so that the processing effect of the voice message is good.
Based on the same technical concept, referring to fig. 6, an embodiment of the present application provides a processing apparatus for a voice message, including:
the first obtaining module 601 is configured to obtain first audio data and a first reference number of consecutive second audio data, where the first audio data corresponds to a first audio frame to be currently processed in the voice message, the first reference number of consecutive second audio data corresponds to a second audio frame, and the second audio frame is a consecutive audio frame located after the first audio frame in the voice message;
A determining module 602, configured to determine validity of the first audio frame based on the first audio data and the second audio data;
the determining module 602 is further configured to determine, in response to the first audio frame being invalid, validity of an audio frame located after the first audio frame in the voice message until a valid audio frame is obtained;
the second obtaining module 603 is configured to obtain audio data to be played corresponding to the valid audio frame.
In one possible implementation, the second obtaining module 603 is further configured to obtain, in response to the first audio frame being valid, audio data to be played corresponding to the first audio frame based on the first audio data.
In one possible implementation manner, the second obtaining module 603 is further configured to obtain validity of a third audio frame, where the third audio frame is an audio frame located before the first audio frame in the voice message;
responding to the third audio frame to be effective, and taking the first audio data as audio data to be played, which corresponds to the first audio frame;
and responding to the invalidation of the third audio frame, acquiring target audio data, and acquiring audio data to be played corresponding to the first audio frame based on the first audio data and the target audio data, wherein the target audio data is the audio data to be played corresponding to the target audio frame, and the target audio frame is the audio frame nearest to the first audio frame in all the audio frames which are determined to be valid.
In one possible implementation manner, the second obtaining module 603 is further configured to splice the first audio data and the target audio data to obtain spliced audio data, and take the spliced audio data as audio data to be played corresponding to the first audio frame.
In one possible implementation manner, the second obtaining module 603 is further configured to perform windowing on a first sampling point set in the target audio data and a second sampling point set in the first audio data to obtain a third sampling point set, where the first sampling point set includes a second reference number of sampling points located at an end portion in the target audio data, and the second sampling point set includes a second reference number of sampling points located at a start portion in the first audio data;
and based on the third sampling point set, splicing the first audio data and the target audio data to obtain spliced audio data.
In one possible implementation manner, the determining module 602 is configured to obtain a detection result corresponding to the first audio data and a detection result corresponding to each second audio data, where the detection result corresponding to any audio data is used to indicate whether any audio data is voice signal data; and determining the validity of the first audio frame based on the detection result corresponding to the first audio data and the detection result corresponding to each second audio data.
In one possible implementation manner, the determining module 602 is configured to determine that the first audio frame is invalid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meeting an invalidation condition; and determining that the first audio frame is valid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data not meeting the invalid condition.
In one possible implementation manner, the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meet an invalidation condition, including: the detection results corresponding to the first audio data indicate that the first audio data are non-voice signal data, and the detection results corresponding to the second audio data indicate that the second audio data are non-voice signal data.
In one possible implementation, the first reference number of consecutive second audio data includes a first portion of second audio data corresponding to other audio frames of the second audio frame than the last audio frame and a second portion of second audio data corresponding to the last audio frame of the second audio frame; a first obtaining module 601, configured to extract first audio data and a first portion of second audio data from the buffer; and decoding the code stream of the last audio frame in the second audio frame to obtain second part of second audio data.
In one possible implementation manner, the first obtaining module 601 is configured to decode a code stream of the first audio frame to obtain first audio data; and respectively decoding the code streams of all the audio frames in the second audio frames to obtain a first reference number of continuous second audio data.
In the embodiment of the application, the validity of the first audio frame is determined according to the first audio data and the second audio data, and when the first audio frame is invalid, the validity of the subsequent audio frame is determined until the valid audio frame is obtained, and the audio data to be played corresponding to the valid audio frame is obtained. In the processing process of the voice message, only the audio data to be played corresponding to the effective audio frame is acquired, and the mode can ensure the playing quality and effectively shorten the playing time consumption in the process of playing the voice message based on the audio data to be played, so that the processing effect of the voice message is good.
It should be noted that, when the apparatus provided in the foregoing embodiment performs the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Fig. 7 is a schematic structural diagram of a voice message processing device according to an embodiment of the present application. The device may be a terminal, for example: smart phones, tablet computers, notebook computers or desktop computers. Terminals may also be referred to by other names as user equipment, portable terminals, laptop terminals, desktop terminals, etc.
Generally, the terminal includes: a processor 701 and a memory 702.
Processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 701 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of content that the display screen is required to display. In some embodiments, the processor 701 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the method of processing a voice message provided by an embodiment of the method of the present application.
In some embodiments, the terminal may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 703 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch display 705, camera assembly 706, audio circuitry 707, positioning assembly 708, and power supply 709.
A peripheral interface 703 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 701 and memory 702. In some embodiments, the processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency circuit 704 is configured to receive and transmit RF (Radio Frequency) signals, also referred to as electromagnetic signals. The radio frequency circuitry 704 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 704 may also include NFC (Near Field Communication ) related circuitry, which is not limiting of the application.
The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 705 is a touch display, the display 705 also has the ability to collect touch signals at or above the surface of the display 705. The touch signal may be input to the processor 701 as a control signal for processing. At this time, the display 705 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 705 may be one, disposed on the front panel of the terminal; in other embodiments, the display 705 may be at least two, respectively disposed on different surfaces of the terminal or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or a folded surface of the terminal. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 705 may be made of LCD (Liquid Crystal Display ), OLED (organic light-Emitting Diode) or other materials.
The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
The audio circuit 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing, or inputting the electric signals to the radio frequency circuit 704 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones can be respectively arranged at different parts of the terminal. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 707 may also include a headphone jack.
The location component 708 is operative to locate a current geographic location of the terminal to enable navigation or LBS (LocationBased Service, location-based services). The positioning component 708 may be a positioning component based on the United states GPS (GlobalPositioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.
The power supply 709 is used to power the various components in the terminal. The power supply 709 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the terminal further includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyroscope sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.
The acceleration sensor 711 can detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal. For example, the acceleration sensor 711 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch display screen 705 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 711. The acceleration sensor 711 may also be used for the acquisition of motion data of a game or a user.
The gyro sensor 712 may detect a body direction and a rotation angle of the terminal, and the gyro sensor 712 may collect a 3D motion of the user to the terminal in cooperation with the acceleration sensor 711. The processor 701 may implement the following functions based on the data collected by the gyro sensor 712: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
The pressure sensor 713 may be disposed at a side frame of the terminal and/or at a lower layer of the touch display screen 705. When the pressure sensor 713 is disposed at a side frame of the terminal, a grip signal of the terminal by a user may be detected, and the processor 701 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at the lower layer of the touch display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 705. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The fingerprint sensor 714 is used to collect a fingerprint of the user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 714 may be provided on the front, back or side of the terminal. When a physical key or vendor Logo is provided on the terminal, the fingerprint sensor 714 may be integrated with the physical key or vendor Logo.
The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 705 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 705 is turned down. In another embodiment, the processor 701 may also dynamically adjust the shooting parameters of the camera assembly 706 based on the ambient light intensity collected by the optical sensor 715.
A proximity sensor 716, also referred to as a distance sensor, is typically provided on the front panel of the terminal. The proximity sensor 716 is used to collect the distance between the user and the front face of the terminal. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front face of the terminal gradually decreases, the processor 701 controls the touch display 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal gradually increases, the processor 701 controls the touch display screen 705 to switch from the off-screen state to the on-screen state.
It will be appreciated by those skilled in the art that the structure shown in fig. 7 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.
In an exemplary embodiment, a computer device is also provided, see fig. 8, comprising a processor 801 and a memory 802, the memory 802 having at least one piece of program code stored therein. The at least one program code is loaded into and executed by the one or more processors 801 to implement any of the methods for processing voice messages described above.
In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one program code loaded and executed by a processor of a computer device to implement a method of processing any of the above-mentioned voice messages.
Alternatively, the above-mentioned computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a compact disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The foregoing description of the exemplary embodiments of the application is not intended to limit the application to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Claims (15)

1. A method for processing a voice message, the method comprising:
acquiring first audio data and a first reference number of continuous second audio data, wherein the first audio data corresponds to a first audio frame to be processed currently in a voice message, the first reference number of continuous second audio data corresponds to a second audio frame, and the second audio frame is a continuous audio frame positioned behind the first audio frame in the voice message;
determining validity of the first audio frame based on the first audio data and the second audio data;
and responding to the invalidation of the first audio frame, determining the validity of the audio frame positioned behind the first audio frame in the voice message until the valid audio frame is obtained, and acquiring audio data to be played, which corresponds to the valid audio frame.
2. The method according to claim 1, wherein the method further comprises:
And responding to the first audio frame to be effective, and acquiring audio data to be played corresponding to the first audio frame based on the first audio data.
3. The method of claim 2, wherein the obtaining audio data to be played corresponding to the first audio frame based on the first audio data comprises:
acquiring the effectiveness of a third audio frame, wherein the third audio frame is an audio frame positioned in front of the first audio frame in the voice message;
responding to the third audio frame to be effective, and taking the first audio data as audio data to be played, which corresponds to the first audio frame;
and responding to the invalidation of the third audio frame, acquiring target audio data, and acquiring audio data to be played corresponding to the first audio frame based on the first audio data and the target audio data, wherein the target audio data is the audio data to be played corresponding to a target audio frame, and the target audio frame is the audio frame nearest to the first audio frame in all audio frames which are determined to be valid.
4. The method of claim 3, wherein the obtaining audio data to be played corresponding to the first audio frame based on the first audio data and the target audio data comprises:
And performing splicing processing on the first audio data and the target audio data to obtain spliced audio data, and taking the spliced audio data as audio data to be played, which corresponds to the first audio frame.
5. The method of claim 4, wherein the splicing the first audio data and the target audio data to obtain spliced audio data comprises:
windowing a first sampling point set in the target audio data and a second sampling point set in the first audio data to obtain a third sampling point set, wherein the first sampling point set comprises a second reference number of sampling points positioned at the tail part of the target audio data, and the second sampling point set comprises a second reference number of sampling points positioned at the starting part of the first audio data;
and based on the third sampling point set, splicing the first audio data and the target audio data to obtain spliced audio data.
6. The method of claim 1, wherein the determining the validity of the first audio frame based on the first audio data and the second audio data comprises:
Acquiring a detection result corresponding to the first audio data and a detection result corresponding to each second audio data, wherein the detection result corresponding to any audio data is used for indicating whether any audio data is voice signal data or not;
and determining the validity of the first audio frame based on the detection result corresponding to the first audio data and the detection result corresponding to each second audio data.
7. The method of claim 6, wherein the determining the validity of the first audio frame based on the detection results for the first audio data and the detection results for the respective second audio data comprises:
determining that the first audio frame is invalid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data meeting an invalidation condition;
and determining that the first audio frame is valid in response to the detection result corresponding to the first audio data and the detection result corresponding to each second audio data not meeting the invalid condition.
8. The method of claim 7, wherein the detection result corresponding to the first audio data and the detection result corresponding to the respective second audio data satisfy an invalidation condition, comprising:
The detection results corresponding to the first audio data indicate that the first audio data are non-voice signal data, and the detection results corresponding to the second audio data indicate that the second audio data are non-voice signal data.
9. The method of claim 1, wherein the first reference number of consecutive second audio data comprises a first portion of second audio data corresponding to other ones of the second audio frames than a last one of the second audio frames and a second portion of second audio data corresponding to the last one of the second audio frames; the acquiring the first audio data and the first reference number of consecutive second audio data includes:
extracting the first audio data and the first portion of the second audio data from the buffer;
and decoding the code stream of the last audio frame in the second audio frame to obtain the second part of second audio data.
10. The method of claim 1, wherein the acquiring the first audio data and the first reference number of consecutive second audio data comprises:
Decoding the code stream of the first audio frame to obtain the first audio data;
and respectively decoding the code streams of all the audio frames in the second audio frames to obtain the first reference number of continuous second audio data.
11. A device for processing a voice message, the device comprising:
the first acquisition module is used for acquiring first audio data and a first reference number of continuous second audio data, wherein the first audio data corresponds to a first audio frame to be processed currently in a voice message, the first reference number of continuous second audio data corresponds to a second audio frame, and the second audio frame is a continuous audio frame positioned behind the first audio frame in the voice message;
a determining module configured to determine validity of the first audio frame based on the first audio data and the second audio data;
the determining module is further configured to determine, in response to the first audio frame being invalid, validity of an audio frame located after the first audio frame in the voice message until a valid audio frame is obtained;
and the second acquisition module is used for acquiring the audio data to be played corresponding to the effective audio frames.
12. The apparatus of claim 11, wherein the second obtaining module is further configured to obtain audio data to be played corresponding to the first audio frame based on the first audio data in response to the first audio frame being valid.
13. The apparatus of claim 12, wherein the second obtaining module is further configured to obtain a validity of a third audio frame, the third audio frame being an audio frame preceding the first audio frame in the voice message;
responding to the third audio frame to be effective, and taking the first audio data as audio data to be played, which corresponds to the first audio frame;
and responding to the invalidation of the third audio frame, acquiring target audio data, and acquiring audio data to be played corresponding to the first audio frame based on the first audio data and the target audio data, wherein the target audio data is the audio data to be played corresponding to a target audio frame, and the target audio frame is the audio frame nearest to the first audio frame in all audio frames which are determined to be valid.
14. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program code that is loaded and executed by the processor to implement the method of processing a voice message according to any of claims 1 to 10.
15. A computer readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor to implement a method of processing a voice message according to any one of claims 1 to 10.
CN202010013975.3A 2020-01-07 2020-01-07 Voice message processing method, device, equipment and storage medium Active CN113162837B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010013975.3A CN113162837B (en) 2020-01-07 2020-01-07 Voice message processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010013975.3A CN113162837B (en) 2020-01-07 2020-01-07 Voice message processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113162837A CN113162837A (en) 2021-07-23
CN113162837B true CN113162837B (en) 2023-09-26

Family

ID=76881369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010013975.3A Active CN113162837B (en) 2020-01-07 2020-01-07 Voice message processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113162837B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113936698B (en) * 2021-09-26 2023-04-28 度小满科技(北京)有限公司 Audio data processing method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108566558A (en) * 2018-04-24 2018-09-21 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN109348247A (en) * 2018-11-23 2019-02-15 广州酷狗计算机科技有限公司 Determine the method, apparatus and storage medium of audio and video playing timestamp
CN110634497A (en) * 2019-10-28 2019-12-31 普联技术有限公司 Noise reduction method and device, terminal equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8238726B2 (en) * 2008-02-06 2012-08-07 Panasonic Corporation Audio-video data synchronization method, video output device, audio output device, and audio-video output system
CN106409313B (en) * 2013-08-06 2021-04-20 华为技术有限公司 Audio signal classification method and device
CN108877778B (en) * 2018-06-13 2019-09-17 百度在线网络技术(北京)有限公司 Sound end detecting method and equipment
CN109473123B (en) * 2018-12-05 2022-05-31 百度在线网络技术(北京)有限公司 Voice activity detection method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108566558A (en) * 2018-04-24 2018-09-21 腾讯科技(深圳)有限公司 Video stream processing method, device, computer equipment and storage medium
CN109348247A (en) * 2018-11-23 2019-02-15 广州酷狗计算机科技有限公司 Determine the method, apparatus and storage medium of audio and video playing timestamp
CN110634497A (en) * 2019-10-28 2019-12-31 普联技术有限公司 Noise reduction method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113162837A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN110022489B (en) Video playing method, device and storage medium
CN110059686B (en) Character recognition method, device, equipment and readable storage medium
WO2020249025A1 (en) Identity information determining method and apparatus, and storage medium
WO2021052306A1 (en) Voiceprint feature registration
CN110933468A (en) Playing method, playing device, electronic equipment and medium
CN110931048A (en) Voice endpoint detection method and device, computer equipment and storage medium
CN111681655A (en) Voice control method and device, electronic equipment and storage medium
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN110798327B (en) Message processing method, device and storage medium
CN112269559A (en) Volume adjustment method and device, electronic equipment and storage medium
CN111862972B (en) Voice interaction service method, device, equipment and storage medium
CN113162837B (en) Voice message processing method, device, equipment and storage medium
CN108831423B (en) Method, device, terminal and storage medium for extracting main melody tracks from audio data
CN111341317A (en) Method and device for evaluating awakening audio data, electronic equipment and medium
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium
CN113362836B (en) Vocoder training method, terminal and storage medium
CN115035187A (en) Sound source direction determining method, device, terminal, storage medium and product
CN114333821A (en) Elevator control method, device, electronic equipment, storage medium and product
CN114360494A (en) Rhythm labeling method and device, computer equipment and storage medium
CN114388001A (en) Multimedia file playing method, device, equipment and storage medium
CN111028846B (en) Method and device for registration of wake-up-free words
CN110336881B (en) Method and device for executing service processing request
CN112015612B (en) Method and device for acquiring stuck information
CN111681654A (en) Voice control method and device, electronic equipment and storage medium
CN110989963B (en) Wake-up word recommendation method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40048301

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant