CN116760923A

CN116760923A - Audio real-time playing method, device, equipment and readable storage medium

Info

Publication number: CN116760923A
Application number: CN202310714665.8A
Authority: CN
Inventors: 冯坤; 朱灿; 林恩; 王怀彬
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-15

Abstract

The application discloses a method, a device, equipment and a readable storage medium for playing audio in real time, wherein the method comprises the following steps: acquiring an audio data file to be played; preprocessing an audio data file, and determining a part to be edited in the audio data file according to a preprocessing result; according to a preset audio buffer and a part to be edited, audio files in the audio data files are cached in a segmented mode, and editing processing is sequentially carried out on the audio files; when the audio buffer is preset to buffer the audio file, the edited data is synchronously written into the output stream, and corresponding audio content is played in real time according to the output stream. The application realizes the pretreatment of the audio data file, and determines the part to be edited of the audio data file, thereby directly processing the part to be edited during the caching according to the preset audio buffer, realizing the actions of simultaneously caching data and writing in an output stream, and realizing the effect of playing the audio in real time.

Description

Audio real-time playing method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for playing audio in real time.

Background

In order to ensure that a customer service provides a good service when a person dials a corresponding telephone service, it is common to record a sound recording when a customer communicates with the customer service telephone and play the sound recording in some scenes, for example, whether a service provided by a spot check service meets a standard.

When a customer communicates with a customer service through a telephone, the communication content may involve some private information, such as a certificate number, a telephone number, etc., so that personal privacy of the customer may be revealed when the recorded content is played, and a corresponding audio processing technology is generally used to convert the recorded content into text, perform desensitization processing on a part of the content in the text, and then convert the text into audio according to the desensitized text.

However, in the above method, when the audio is converted into text and converted into audio after text processing, the processing needs to adopt an offline processing mode, that is, the related personnel needs to wait for the whole processing flow of the audio data to finish before playing the audio, and the waiting time is long.

Disclosure of Invention

In view of the foregoing, the present application provides a method, apparatus, device and readable storage medium for playing audio in real time, which aims to reduce the waiting time of re-listening to recorded content.

In order to achieve the above object, the present application provides an audio real-time playing method, which includes the following steps:

Acquiring an audio data file to be played;

preprocessing the audio data file, and determining a part to be edited in the audio data file according to a preprocessing result;

according to a preset audio buffer and the part to be edited, the audio files in the audio data files are cached in a segmented mode, and editing processing is sequentially carried out on the audio files;

and when the preset audio buffer caches the audio file, synchronously writing the edited data into an output stream, and playing corresponding audio content in real time according to the output stream.

Illustratively, the audio data file includes an audio file, and a text file obtained by text conversion according to the content of the audio file, and the step of preprocessing the audio data file includes:

reading the content of the text file, and analyzing to obtain a text information list of the audio dialogue content;

according to the text information list, analyzing a first audio duration occupied by each word in the audio file and a second audio duration between each adjacent word;

predicting the total audio duration after editing the audio file according to the first audio duration and the second audio duration; the preprocessing comprises analyzing the text file, and predicting the total audio duration after editing processing.

Illustratively, the portion to be edited includes a first portion and a second portion, and the step of determining the portion to be edited in the audio data file according to the result of the preprocessing includes:

determining a first part which needs to be subjected to desensitization processing in the audio data file according to the text information list;

and determining a second part which needs to be compressed in the audio data file according to the text information list and the total audio duration.

Illustratively, the step of determining the first portion of the audio data file that needs to be desensitized according to the text information list includes:

acquiring a semantic information list matched with the content in the text information list;

analyzing digital word contents in the text information list according to the semantic information list, and determining adjacent word contents adjacent to the digital word contents;

if the adjacent word content is a number, analyzing whether the adjacent word of the adjacent word content is a number or not until the analyzed content is a non-digital content;

words that have been determined to be numerical are taken as the first part of the desensitization process that is required.

The step of predicting the total audio duration after editing the audio file according to the first audio duration and the second audio duration includes:

when the second audio time length is longer than a preset time length, predicting a compressed audio time length after compressing the second audio time length;

and predicting the total audio duration after editing the audio file according to the compressed audio duration and the first audio duration.

The step of buffering the audio files in the audio data file in a segmented manner according to a preset audio buffer and the portion to be edited, and sequentially performing editing processing on the audio files comprises the following steps:

according to the preset audio buffer, determining a first part and a second part which are involved in the current segmented file when the segmented audio file in the audio data file is cached;

converting the first part into sine wave audio with preset fixed frequency, and cutting the second part to realize editing processing of the file cached in a segmented mode.

The step of synchronously writing the edited data into the output stream when the preset audio buffer buffers the audio file includes:

And synchronously converting the edited data into a PCM encoded WAV format real-time audio stream and writing the PCM encoded WAV format real-time audio stream into an output stream when the preset audio buffer caches the audio file.

To achieve the above object, the present application further provides an audio real-time playing device, including:

the acquisition module is used for acquiring an audio data file to be played;

the determining module is used for preprocessing the audio data file and determining a part to be edited in the audio data file according to a preprocessing result;

the processing module is used for buffering the audio files in the audio data files in a segmented mode according to a preset audio buffer and the part to be edited, and editing the audio files;

and the playing module is used for synchronously writing the edited data into an output stream when the preset audio buffer caches the audio file, and playing corresponding audio content in real time according to the output stream.

To achieve the above object, the present application further provides an audio real-time playing device, including: the system comprises a memory, a processor and an audio real-time playing program stored on the memory and capable of running on the processor, wherein the audio real-time playing program is configured to realize the steps of the audio real-time playing method.

For example, to achieve the above object, the present application also provides a computer-readable storage medium having stored thereon an audio real-time playing program which, when executed by a processor, implements the steps of the audio real-time playing method as described above.

Compared with the situation that in the related art, a related person is required to wait for the whole processing flow of the audio data to finish and then play the video, namely, the time required for waiting for playing the audio is longer, the audio data file to be played is acquired in the application; preprocessing the audio data file, and determining a part to be edited in the audio data file according to a preprocessing result; according to a preset audio buffer and the part to be edited, the audio files in the audio data files are cached in a segmented mode, and editing processing is sequentially carried out on the audio files; and when the preset audio buffer caches the audio file, synchronously writing the edited data into an output stream, and playing corresponding audio content in real time according to the output stream. That is, the audio data file to be played is preprocessed, and the part to be edited in the audio data file is determined according to the result of the preprocessing, so that when the audio data file is cached according to the preset audio buffer section, the audio data file can be edited in a targeted manner, and the edited data is written into the output stream according to the output characteristic of the preset audio buffer when the audio data file is cached by the preset audio buffer, namely, the file is cached in a targeted manner, meanwhile, the processed part of the file is synchronously played, further, the relevant personnel can directly play the processed part in real time in the processed process of the audio data file without waiting for the file processing process, the audio data file to be processed is played in real time, and the waiting time of the re-listening recording content is reduced.

Drawings

FIG. 1 is a flowchart of a first embodiment of an audio real-time playing method according to the present application;

fig. 2 is a schematic diagram of a refinement flow of step S120 in the first embodiment of the audio real-time playing method according to the present application;

FIG. 3 is a schematic diagram of a preset audio buffer application flow of the audio real-time playing method of the present application;

fig. 4 is a schematic structural diagram of a hardware running environment according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Referring to fig. 1, fig. 1 is a flow chart of a first embodiment of an audio real-time playing method according to the present application.

The embodiments of the present application provide embodiments of audio real-time playback methods, it being noted that although a logical sequence is shown in the flowchart, in some cases the steps shown or described may be performed in a different order than that shown or described herein. For convenience of description, each step of executing the subject description audio real-time playing method is omitted below, and the audio real-time playing method includes:

Step S110: acquiring an audio data file to be played;

in this embodiment, the main content of the voice recording of customer service and customer calls is voice, and in order to achieve both recording efficiency and space occupation, a VOX format is adopted, and Adaptive Differential Pulse Code Modulation (ADPCM) coding is adopted. Wherein, because audio contents such as a number of a certificate, a card number and the like may exist in the recording, the risk of leakage of the privacy of the client may exist during the hearing back. Most of the current similar audio desensitization schemes are a scheme of processing audio offline, generating text, desensitizing the text, and generating audio by contrast, and do not support a VOX audio format. These solutions have some drawbacks: offline processing and replaying are long in time consumption and large in required storage space; the text is desensitized to regenerate audio, so that the accuracy of the undesensitized part is difficult to ensure; the VOX audio format is not supported, so that the VOX audio format cannot be directly played on an HTML5 page; in the process of handling business for clients, the waiting process of the clients may be needed, and the long-time silence time in the recording is not processed, so that the experience in the hearing back process is poor.

Therefore, in order to avoid the above-mentioned problems, an audio playing method is provided in this embodiment, and the problem that a related person needs to wait for an audio processing process for a long time when hearing back audio is mainly solved, and secondly, improvement is made for the situations that the text is desensitized, the accuracy is poor, the data after translating the text into audio does not support the VOX audio format, and the long silence time in the recording affects the hearing back experience.

It should be noted that, when the related personnel want to listen to the corresponding audio data file again, the related personnel will execute the corresponding operation according to the corresponding operation interface, the operation instruction will be sent to the system, the system will call the corresponding data according to the instruction content, and perform the corresponding processing on the data, after the data processing is completed, the output stream is written again, so as to generate the output of audio playing, so as to realize the effect that the related personnel hear the recorded audio content.

When the system receives a sound recording which is wanted to be re-listened by a relevant person, the corresponding audio data file is called from a database of the corresponding audio data file storing the sound recording, and the audio data files are provided with unique identification information, for example, unique audio data file identifications are generated according to the sound recording generation time or customer service corresponding to the sound recording, and the sound recording content can be matched from the database according to the identifications.

The audio data file to be played refers to a recording that a related person wants to listen back, and the recording and the audio data file are the same in content and refer to recording data when a client communicates with a customer service telephone.

Step S120: preprocessing the audio data file, and determining a part to be edited in the audio data file according to a preprocessing result;

After the audio data file is obtained, in order to avoid the privacy related to the client possibly involved in the audio recording during the hearing-back recording, editing processing needs to be performed on part of the content in the audio data file, wherein the editing processing may include desensitizing processing, cutting processing or compressing processing, the main purpose of the editing processing is to edit sensitive information and useless audio content in the audio data file, and preprocessing is to preprocess the audio data file before the audio data file is subjected to the editing processing, and the processing is mainly used for carrying out corresponding analysis on the audio data file, so as to determine the part to be edited in the audio data file, for example, determine the time period needing to be compressed from the audio data file, or determine the part of the audio content needing to be desensitized from the audio data file, and the like.

It should be noted that, the audio data file includes call records between the client and the customer service, and also includes text files translated from the corresponding call records, when the audio data file is preprocessed, the audio file is used to process the audio content corresponding to each node in the audio file, and the text file is used as a reference to determine whether the audio content is sensitive information or nonsensitive silence content, or the like.

The preprocessing is equivalent to overall analysis of the audio data file in advance, and the audio data file is marked and calculated in advance, so that the part to be edited in the audio data file is determined in advance, and the processing efficiency and the accuracy of the audio data file can be improved when the audio data file is processed later.

Step S130: according to a preset audio buffer and the part to be edited, the audio files in the audio data files are cached in a segmented mode, and editing processing is sequentially carried out on the audio files;

the preset audio buffer is a buffer of audio data, and has the main functions of buffering and reading the audio data file to be played, performing corresponding editing processing on the read data, and writing the data into an output stream after processing the data, so that the audio data file can be played in a file stream form according to the output stream.

In the process of caching the audio file by the preset audio buffer, the size of the audio file cached each time or the length of the audio file cached each time can be set, for example, the audio file cached each time for 100 seconds.

Because the size of the audio file cached by the preset audio buffer each time is set, and the duration of the actual recording content is not a fixed value, the length of the call recording between the client and the customer service is different under different conditions, and is generally greater than the preset length of the cached audio file, at this time, the preset audio buffer cannot complete the audio file caching by one-time caching, so that the situation that the preset audio buffer caches the same audio file for a plurality of times can be generated, and when the audio file is cached by the preset audio buffer for a plurality of times and in a segmented manner, the node of the data cached each time can be automatically recorded, thereby ensuring continuous and segmented caching of the whole audio file.

Step S140: and when the preset audio buffer caches the audio file, synchronously writing the edited data into an output stream, and playing corresponding audio content in real time according to the output stream.

When the audio file corresponding to the segment is obtained by buffering, editing processing is required to be performed on the segmented audio, in this embodiment, when the segmented data are sequentially processed, a precondition exists that the processing of the audio file buffered in the previous segment is completed, after the audio file is written into the output stream, the audio file buffered in the next segment is preset, and the audio file of the next segment is processed, that is, the processing of the audio file of the segment obtained in the previous segment is completed, and when the audio file played in real time through the output stream, the audio file buffered in the next segment is preset, so that the audio file is continuously buffered, processed and played in the form of a file stream.

When corresponding audio files are cached according to a preset audio buffer, the last audio file which has completed editing processing can be synchronously written into an output stream, and corresponding audio content is played in real time according to the output stream, namely the effects of synchronous caching, processing and playing are realized, so that the situation that when related personnel want to listen to recorded content again, the situation that the processing of the audio files needs to be waited for a long time is avoided.

It should be noted that, in the process of converting the traditional audio into the text, desensitizing the text and converting the text into the audio after the desensitization, the corresponding data of the audio and the text related to each step are required to be stored, so that operations such as rewriting or processing the corresponding data are performed, the audio file is cached and processed in a segmented manner through the preset audio buffer, and the processed data are played in real time, so that additional content is not required to be occupied, the effect of real-time caching and outputting is realized, and the situation that the audio file and the corresponding text file occupy space is avoided.

step a: and synchronously converting the edited data into a PCM encoded WAV format real-time audio stream and writing the PCM encoded WAV format real-time audio stream into an output stream when the preset audio buffer caches the audio file.

And when the preset audio buffer caches the audio file, synchronously converting the edited data into a PCM encoded WAV format real-time audio stream, and writing the PCM encoded WAV format real-time audio stream into an output stream, namely converting the data format into the WAV format, so that the required storage space is reduced.

In the process of caching, processing and writing the audio file into the output stream through the preset audio buffer, in order to ensure that the finally played audio is consistent with the performance effect of the original audio file, the relevant information of the audio file needs to be synchronously acquired, for example, including a sound channel or a sampling rate and the like. Wherein the audio format is VOX and the sampling rate is f _sampling =8000 Hz, the channel is mono.

Compared with the situation that in the related art, a related person is required to wait for the whole processing flow of the audio data to be finished and then play the audio, namely, the time required for waiting for playing the audio is longer, in the application, the audio data file to be played is acquired; preprocessing the audio data file, and determining a part to be edited in the audio data file according to a preprocessing result; according to a preset audio buffer and the part to be edited, the audio files in the audio data files are cached in a segmented mode, and editing processing is sequentially carried out on the audio files; and when the preset audio buffer caches the audio file, synchronously writing the edited data into an output stream, and playing corresponding audio content in real time according to the output stream. That is, the audio data file to be played is preprocessed, and the part to be edited in the audio data file is determined according to the result of the preprocessing, so that when the audio data file is cached according to the preset audio buffer section, the audio data file can be edited in a targeted manner, and the edited data is written into the output stream according to the output characteristic of the preset audio buffer when the audio data file is cached by the preset audio buffer, namely, the file is cached in a targeted manner, meanwhile, the processed part of the file is synchronously played, further, the relevant personnel can directly play the processed part in real time in the processed process of the audio data file without waiting for the file processing process, the audio data file to be processed is played in real time, and the waiting time of the re-listening recording content is reduced.

Referring to fig. 2, an embodiment of the audio real-time playing method according to the first embodiment of the present application is provided, where the step of preprocessing the audio data file includes:

step S210: reading the content of the text file, and analyzing to obtain a text information list of the audio dialogue content;

when the audio data file is preprocessed, the corresponding analysis is mainly performed on the audio file in the audio data file to determine the part to be edited, the audio data file comprises the text file and the audio file, and when the audio data file is preprocessed, the analysis is performed on the audio data file by combining the text file and the audio file.

The text file is a text file which converts customer service in an audio format and dialogue contents of clients into through a voice-to-text technology according to the audio content of the audio file.

The text file is read and analyzed to obtain text information list of corresponding audio dialogue content, the text information list includes text information of dialogue content and corresponding time points of each dialogue content in the audio file, for example, customer service can greet with customers through fixed voice at early dialogue stage, customers and customer service chat to corresponding business content at middle dialogue stage, customer service can differentiate dialogue content at different time points through fixed voice and customers at end dialogue, so when analyzing to obtain text information list, time data corresponding to time points of dialogue content of audio file is needed.

Step S220: according to the text information list, analyzing a first audio duration occupied by each word in the audio file and a second audio duration between each adjacent word;

the text information list contains text information (dialogue content between the customer and customer service) and corresponding time information of the text information in the audio file, for example, the total length of the audio file is 10 minutes, the initial time of the audio file is 0, the ending time of the audio file is 600 seconds, the customer service calls the customer within 0-10 seconds, the customer inquires about the detailed information of the customer service in 20-300 seconds, 300-350 seconds, the customer service handles the corresponding service for the customer through online operation, the customer inquires about another service in 350-580 seconds, and the customer is in the category of the customer.

According to the text information list, the speaking distance between the corresponding client and the customer service in the conversation process in the audio file can be determined, for example, after the customer service explains the concrete details of the service, the customer agrees to transact the corresponding service, the customer service needs to spend a certain time to transact the service, no conversation is generated at the moment, the conversation audio and the silence audio can be determined according to the condition that the corresponding text content (words) exist in the text information list, wherein the silence audio is the audio when the customer service and the customer do not generate the conversation waiting time, the audio does not have the text content, namely, the corresponding text or words cannot be obtained according to the audio conversion, if the silence audio occupies too long time, the experience feeling when the follow-up related personnel listen to the record again can be influenced, and the silence audio needs to be removed.

The method comprises the steps that according to the first audio time length occupied by each word in an audio file and the second audio time length between adjacent words, the overall time length of the audio file can be analyzed, the first audio time length corresponds to the total time length of an effective dialogue, namely the total time length of a dialogue generated between a client and customer service, the first audio time length needs to be reserved, the second audio time length corresponds to the time length between the words, and the second audio time length comprises the time length of a pause in a normal dialogue and also comprises the time length corresponding to a silent audio.

Step S230: predicting the total audio duration after editing the audio file according to the first audio duration and the second audio duration; the preprocessing comprises analyzing the text file, and predicting the total audio duration after editing processing.

The first audio duration and the second audio duration are integrated, the total audio duration of the audio file after editing processing can be predicted in advance, wherein the total audio duration after the silent audio is removed from the audio file is mainly predicted, so that the audio length of the preset audio buffer in each segmented cache is determined according to the total audio duration, and the situation that the audio length of the preset audio buffer in each segmented cache is too short, and related personnel need to wait for more time is avoided.

Meanwhile, the total audio duration can be used for determining the total audio duration played by related personnel in advance, so that a duration bar of one audio can be correspondingly output according to the total audio duration, and further the related personnel can be provided with a function of selecting a corresponding time point on the duration bar of the total audio duration and correspondingly jumping.

In addition, the audio total duration provides a certain basis for the subsequent processing of the audio file, in the audio total duration, a second audio duration occupied by the silent audio and the position of the second audio duration in the audio total duration can be correspondingly divided, namely the audio total duration is calculated, and all time nodes in the audio total duration are counted, wherein the time nodes correspond to dialogue nodes in the call record.

Illustratively, when determining the total audio duration, summing the time of all the current words and the time of the intervals between the words (the first audio duration and the second audio duration) by adopting a summing mode is as follows: t is t _sum In seconds.

Wherein, calculate the audio byte length as: t is t _sum *f _sampling *2。

Where 2 is the PCM encoded audio that is finally output, each sample occupies 16 bits, i.e. 2 bytes.

Step b: determining a first part which needs to be subjected to desensitization processing in the audio data file according to the text information list;

step c: and determining a second part which needs to be compressed in the audio data file according to the text information list and the total audio duration.

According to the text information list, the part of the audio data needing desensitization in the audio data file is used as a first part, and further combined with the total audio duration, a second part needing compression processing in the audio data file can be determined, wherein the first part and the second part are both parts to be edited, the first part refers to sensitive information (such as a certificate number or a card number) and the second part refers to silent audio (audio without dialogue content for a long time or audio corresponding to word text) and different editing processing modes are needed for the two parts respectively.

step d: acquiring a semantic information list matched with the content in the text information list;

When determining the first part of the audio file to be desensitized according to the text information list, since the text content in the text information list is text content obtained by converting the voice content of the audio file, the text content has certain conversion errors or is converted into an error character, or when converting the digits in the original document number into Chinese characters, for example, 2 is converted into two, if the text information is identified according to two, if the text content is the Chinese characters, the text content may be ignored and represents one digit in the document number or the card number, so that the desensitization processing cannot be performed on the sensitive information related to the two, and therefore, when determining the first part to be desensitized according to the text information list, the corresponding semantic information list needs to be acquired first.

The semantic information list is mainly used for replacing the content of the digits represented by Chinese characters in the process of converting different voices into the text with the same digits, for example, the semantic information list comprises the condition of converting two into 2, and simultaneously, the semantic information list also comprises the digits for some special pronunciations, and a corresponding semantic information list is set, for example, a '1' is in a telephone number, and is commonly used by people to pronounce the Chinese characters, so that in the semantic information list, different Chinese characters, for example, a 'unitary' or a 'one' and the like, are required to be corresponding to the number 1.

Step e: analyzing digital word contents in the text information list according to the semantic information list, and determining adjacent word contents adjacent to the digital word contents;

through the semantic information list, partial words in the text information list can be replaced with homophones, so that the range of inquiring the sensitive information from the text information list is enlarged, and the desensitization processing of all the sensitive information in the text information list is guaranteed.

The digital word content in the text information list is analyzed mainly according to the related content in the example, and the digital content of which the part is converted into the Chinese character expression is considered to be converted into the digital expression form again during analysis.

When analyzing digital word content, considering that when designing a card number or a certificate number in the communication process, the sensitive information is a continuous word content, namely, when explaining the sensitive information, the number is continuously stated, therefore, when analyzing the digital word content, adjacent word content adjacent to the digital word content needs to be synchronously analyzed, if the adjacent word content is a number, the digital word content analyzed at present can be proved to be the sensitive information, and the adjacent word content is the sensitive information.

Step f: if the adjacent word content is a number, analyzing whether the adjacent word of the adjacent word content is a number or not until the analyzed content is a non-digital content;

step g: words that have been determined to be numerical are taken as the first part of the desensitization process that is required.

And when the adjacent word content is the number, further analyzing whether the adjacent word of the adjacent word content is the number, and repeating the process until the analyzed content is non-digital content.

Illustratively, when the text content of "the card number is 12345678 and the attribution is" the a province ", the first digit is analyzed to be 1, at this time, the adjacent word content is analyzed to be" yes "and" 2", at this time, it is determined that 1 and 2 may be sensitive information, further, the adjacent words" 1 "and" 3 "of" 2 "are analyzed, and the finding is repeated to know that" the non-digital content is determined to be "the attribution", at this time, the analysis is stopped, and the desensitization processing is performed using 1-8 as sensitive information.

When the number is greater than a certain number, the information can be used as sensitive information, for example, greater than 5 bits, namely, the information can be used as normal information, and if the number is less than 5 bits, for example, the number is only 4 bits, the information is used as a contact address in 1900 rooms of a certain cell, and the privacy degree is low.

step h: when the second audio time length is longer than a preset time length, predicting a compressed audio time length after compressing the second audio time length;

step i: and predicting the total audio duration after editing the audio file according to the compressed audio duration and the first audio duration.

When the second audio time length is longer than the preset time length, the fact that the silent audio corresponding to the second audio time length is too long is proved, if the silent audio is not processed, the silent audio needs to be heard for a long time when relevant personnel listen to the call record again, and the re-listening experience of the relevant personnel during the re-listening of the call record is affected, but at the same time, when the relevant service is handled by the customer service, part of the silent audio is generated during the call process, and when the silent audio is re-listened, the condition that the relevant service is handled by the customer service is determined by listening to the silent audio, namely in the embodiment, the silent audio needs to be processed, the duration of the silent audio needs to be reduced, and part of the silent audio needs to be reserved, so that the silent audio when the customer service is handled by the customer service possibly exists during the call process of the customer service is embodied.

Therefore, when the second audio time length is longer than the preset time length, it is determined that the second audio time length needs to be compressed, in this embodiment, the silence audio is compressed, and the second audio time length is mainly cut to the audio with the length equal to the preset time length by cutting part of the silence audio, for example, the second audio time length is 15 seconds, and the preset time length is 10 seconds according to the actually set time length, at this time, the second audio time length is longer than the preset time length, and the second audio time length needs to be compressed.

In summary, the compressed audio duration after the second audio duration is compressed may be predicted according to the preset duration and the second audio duration, so that the total audio duration of the audio file after the corresponding editing process may be further predicted according to the compressed audio duration and the first audio duration.

It should be noted that when the compressed audio duration of the second audio duration is predicted, the position of the second audio duration in the whole audio file is marked according to the position of the second audio duration, for example, the second audio duration is 15 seconds, the corresponding silence audio is predicted to be within 140 seconds-155 seconds in the audio file, the predicted duration of the silence audio is 10 seconds, the predicted silence audio position is within 140 seconds-150 seconds, and the silence audio within 150 seconds-155 seconds needs to be compressed, that is, the silence audio is searched in advance, and the position mode of the silence audio is marked, so that the audio to be processed is marked in advance, and is convenient for subsequent processing.

step j: according to the preset audio buffer, determining a first part and a second part which are involved in the current segmented file when the segmented audio file in the audio data file is cached;

step k: converting the first part into sine wave audio with preset fixed frequency, and cutting the second part to realize editing processing of the file cached in a segmented mode.

When the audio file in the audio data file is cached according to the preset audio buffer segment, the first part related to the sensitive information and the second part related to the silent audio are subjected to corresponding desensitization processing and clipping processing in the preprocessing process, and according to the content, the positions of the first part and the second part in the audio file are marked when the audio file is preprocessed to determine the first part and the second part, namely the time nodes of the first part and the second part are determined in the preprocessing process, so that the preset audio buffer can directly find the corresponding first part and the second part, and the processing process of the first part and the second part is further rapidly completed.

The preset audio buffer is in a segmented form, the content of an audio file is sequentially read, that is, the length of one segmented content of the audio file is fixed each time, the starting point and the ending point of the segment are only related to the fixed buffer length of the preset audio buffer when the segment is buffered each time, and the segment is buffered each time, for example, referring to fig. 3, the preset audio buffer buffers the segmented audio with the fixed length from the audio file, the segmented audio includes the silence audio, the audio to be desensitized, the normal dialogue audio, and the like, wherein the part of the silence audio obtained by the leftmost buffer is only the part of the silence audio in the original audio file, if the part of the silence audio is measured according to the buffered audio, the length of the silence audio is not more than the preset length, and the part of the silence audio can not be compressed according to the buffered segmented content, but actually, the total length of the part of the silence audio after being combined with other adjacent parts of the silence audio is greater than the preset audio, so that the part of the silence audio is required to be compressed.

Therefore, in order to ensure the accuracy of editing processing by the preset audio buffer, it is necessary to mark the second portion (the time node corresponding to the silent audio, and the position and length of the silent audio to be compressed) in advance when the total audio duration is calculated in advance by the preprocessing method.

When the audio file is edited through the preset audio buffer, the content of the first part is converted into sine wave audio with preset fixed frequency, namely, original normal dialogue audio is converted into beeping sound, so that sensitive information corresponding to the first part is prevented from being exposed.

When the audio file is edited through the preset audio buffer, the audio of the second part is correspondingly compressed and cut, so that the audio file is compressed.

In this embodiment, the content of the text file is read, and a text information list of the audio dialogue content is obtained by parsing; according to the text information list, analyzing a first audio duration occupied by each word in the audio file and a second audio duration between each adjacent word; predicting the total audio duration after editing the audio file according to the first audio duration and the second audio duration; the preprocessing comprises analyzing the text file, predicting the total audio duration after the editing processing, namely, preprocessing the audio data file, so that the part to be edited is marked in advance, the follow-up accuracy of the editing processing is facilitated, and the efficiency of the editing processing is improved.

In addition, the application also provides an audio real-time playing device, which comprises:

the acquisition module is used for acquiring an audio data file to be played;

Illustratively, the determining module includes:

the reading sub-module is used for reading the content of the text file and analyzing and obtaining a text information list of the audio dialogue content;

the analysis submodule is used for analyzing a first audio time length occupied by each word in the audio file and a second audio time length between each two adjacent words according to the text information list;

the prediction sub-module is used for predicting the total audio duration after editing the audio file according to the first audio duration and the second audio duration; the preprocessing comprises analyzing the text file, and predicting the total audio duration after editing processing;

A first determining submodule, configured to determine, according to the text information list, a first portion of the audio data file that needs to be desensitized;

and the second determining submodule is used for determining a second part which needs to be compressed in the audio data file according to the text information list and the total audio duration.

Illustratively, the first determining submodule includes:

an acquisition unit, configured to acquire a semantic information list that matches with content in the text information list;

the analysis unit is used for analyzing digital word contents in the text information list according to the semantic information list and determining adjacent word contents adjacent to the digital word contents;

the judging unit is used for analyzing whether the adjacent words of the adjacent word content are numbers or not if the adjacent word content is the numbers, until the analyzed content is non-digital content;

and the determining unit is used for taking the words which are determined to be numbers as a first part which needs to be subjected to desensitization processing.

Illustratively, the prediction submodule includes:

the first prediction unit is used for predicting the compressed audio duration after the second audio duration is compressed when the second audio duration is longer than a preset duration;

The second prediction unit is used for predicting the total audio duration after editing the audio file according to the compressed audio duration and the first audio duration.

Illustratively, the processing module includes:

a third determining submodule, configured to determine, according to the preset audio buffer, a first portion and a second portion related in a file of a current cached segment when the audio file in the audio data file is cached in segments;

and the processing sub-module is used for converting the first part into sine wave audio with a preset fixed frequency and cutting the second part so as to realize editing processing of the file cached in a segmented mode.

Illustratively, the playback module includes:

and the playing sub-module is used for synchronously converting the edited data into a PCM encoded WAV format real-time audio stream and writing the PCM encoded WAV format real-time audio stream into an output stream when the preset audio buffer caches the audio file.

The specific implementation of the audio real-time playing device is basically the same as the above embodiments of the audio real-time playing method, and will not be repeated here.

In addition, the application also provides audio real-time playing equipment. As shown in fig. 4, fig. 4 is a schematic structural diagram of a hardware running environment according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a hardware running environment of the audio real-time playing device.

As shown in fig. 4, the audio real-time playing device may include a processor 401, a communication interface 402, a memory 403 and a communication bus 404, where the processor 401, the communication interface 402 and the memory 403 complete communication with each other through the communication bus 404, and the memory 403 is used for storing a computer program; the processor 401 is configured to implement the steps of the audio real-time playing method when executing the program stored in the memory 403.

The communication bus 404 mentioned above for the audio real-time playback apparatus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus 404 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 402 is used for communication between the audio real-time playing device and other devices.

The Memory 403 may include a random access Memory (Random Access Memory, RMD) or may include a Non-Volatile Memory (NM), such as at least one disk Memory. Optionally, the memory 403 may also be at least one storage device located remotely from the aforementioned processor 401.

The processor 401 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The specific implementation manner of the audio real-time playing device is basically the same as that of each embodiment of the audio real-time playing method, and is not repeated here.

In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores an audio real-time playing program, and the audio real-time playing program realizes the steps of the audio real-time playing method when being executed by a processor.

The specific implementation manner of the computer readable storage medium of the present application is basically the same as the embodiments of the audio real-time playing method described above, and will not be repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The audio real-time playing method is characterized by comprising the following steps of:

acquiring an audio data file to be played;

2. The audio real-time playing method of claim 1, wherein the audio data file comprises an audio file and a text file obtained by text conversion according to the content of the audio file, and the step of preprocessing the audio data file comprises:

3. The audio real-time playing method of claim 2, wherein the portion to be edited includes a first portion and a second portion, and the step of determining the portion to be edited in the audio data file according to the result of the preprocessing includes:

4. The audio real-time playback method of claim 3, wherein the step of determining a first portion of the audio data file that needs to be desensitized based on the text information list comprises:

5. The method for playing audio in real time according to claim 2, wherein the step of predicting the total audio duration after editing the audio file according to the first audio duration and the second audio duration comprises:

6. The method for playing audio in real time according to claim 1, wherein the portion to be edited includes a first portion and a second portion, and the step of buffering audio files in the audio data file in segments according to a preset audio buffer and the portion to be edited and sequentially editing the audio files includes:

7. The audio real-time playing method as claimed in claim 1, wherein the step of synchronously writing the edited data into the output stream when the preset audio buffer buffers the audio file comprises:

8. An audio real-time playback device, characterized in that the audio real-time playback device comprises:

the acquisition module is used for acquiring an audio data file to be played;

9. An audio real-time playback device, the device comprising: a memory, a processor and an audio real-time playing program stored on the memory and executable on the processor, the audio real-time playing program being configured to implement the steps of the audio real-time playing method according to any one of claims 1 to 7.

10. A computer readable storage medium, wherein an audio real-time playing program is stored on the computer readable storage medium, which when executed by a processor, implements the steps of the audio real-time playing method according to any one of claims 1 to 7.