CN113921011A - Audio processing method, device and equipment - Google Patents

Audio processing method, device and equipment Download PDF

Info

Publication number
CN113921011A
CN113921011A CN202111206068.1A CN202111206068A CN113921011A CN 113921011 A CN113921011 A CN 113921011A CN 202111206068 A CN202111206068 A CN 202111206068A CN 113921011 A CN113921011 A CN 113921011A
Authority
CN
China
Prior art keywords
audio
target
text
transcription
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111206068.1A
Other languages
Chinese (zh)
Inventor
骆鹏鹏
苏文畅
李全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Tingjian Technology Co ltd
Original Assignee
Anhui Tingjian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Tingjian Technology Co ltd filed Critical Anhui Tingjian Technology Co ltd
Priority to CN202111206068.1A priority Critical patent/CN113921011A/en
Publication of CN113921011A publication Critical patent/CN113921011A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an audio processing method, an audio processing device and audio processing equipment, wherein the method comprises the following steps: acquiring a recording audio, a transcription text and a translation text of a recording file; coding the recording audio, the transcription text and the translation text to obtain a target audio frame; acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format; and obtaining a target coding audio based on the target audio head and the target audio frame, wherein the format of the target coding audio is the target audio format. According to the method, the recorded audio, the transcription text and the translation text are encoded and stored to form the target encoded audio in a new audio format, so that a user can directly analyze the audio and the corresponding transcription translation result by acquiring the target encoded audio, the recording efficiency is improved, the waiting time of the user is effectively reduced, and the use experience of the user is improved.

Description

Audio processing method, device and equipment
Technical Field
The present invention relates to the field of information technologies, and in particular, to an audio processing method, apparatus, and device.
Background
In scenes such as meetings, interviews, negotiations, teaching and the like, recording equipment is often required to be used for recording. When a user needs to acquire transcription and translation information from audio, the recording audio file needs to be subjected to independent voice transcription or recognition translation.
Obtaining the transcription information and the translation information from the audio is a time-consuming operation and has a long operation flow, a user needs to record first and then respectively transcribe and translate the audio, the recording is a time-consuming process, and the waiting time for the user to obtain the transcription information or the translation information needed by the user is prolonged again.
Disclosure of Invention
The invention provides an audio processing method, device and equipment, which are used for solving the problem that time consumption for acquiring transcription information and translation information is too long in the prior art.
The invention provides an audio processing method, which comprises the following steps:
acquiring a recording audio, a transcription text and a translation text of a recording file;
coding the recording audio, the transcription text and the translation text to obtain a target audio frame;
acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format;
and obtaining a target coding audio based on the target audio head and the target audio frame, wherein the format of the target coding audio is the target audio format.
According to an audio processing method provided by the present invention, the encoding the recording audio, the transcription text, and the translation text to obtain a target audio frame includes:
inserting a first data separator between the recorded audio and the transcribed text, and inserting a second data separator between the transcribed text and the translated text;
inserting a target frame header in front of the recorded audio to obtain a target audio frame;
and the target frame header comprises a transcription data identifier and a translation data identifier.
According to an audio processing method provided by the present invention, the inserting a target frame header before the recorded audio to obtain the target audio frame includes: acquiring the byte length of the sound recording file;
and under the condition that the byte length is determined to be larger than the target byte length, inserting the target frame header in front of the audio record to obtain the target audio frame.
According to the audio processing method provided by the invention, the acquiring of the recording audio, the transcription text and the translation text of the recording file comprises the following steps:
acquiring continuous target number of the recording audios, the transcription texts and the translation texts in the recording file;
the encoding the recording audio, the transcription text and the translation text to obtain a target audio frame includes:
respectively assembling the continuous target number of the recording audios, the transcription texts and the translation texts into target recording audios, target transcription texts and target translation texts;
and coding the target recording audio, the target transcription text and the target translation text to obtain the target audio frame.
According to an audio processing method provided by the present invention, the target encoded audio includes the target audio frame and other audio frames.
According to an audio processing method provided by the invention, the target audio header comprises a start flag, a sampling rate, a channel number, a bit rate, a file size and target audio format information.
The present invention also provides an audio processing apparatus comprising:
the first acquisition module is used for acquiring the recording audio, the transcription text and the translation text of the recording file;
the first processing module is used for coding the recording audio, the transcription text and the translation text to obtain a target audio frame;
the second acquisition module is used for acquiring a target audio head, and the target audio head comprises information used for representing a target audio format;
and the second processing module is used for obtaining target coded audio based on the target audio head and the target audio frame, wherein the format of the target coded audio is the format of the target audio.
The present invention also provides an audio processing apparatus comprising:
the sound pickup device is used for picking up and outputting the sound recording file;
the transcription translation device is electrically connected with the sound pickup device and is used for outputting a recording audio, a transcription text and a translation text based on the recording file;
according to the audio processing device, the audio processing device is electrically connected with the transcription and translation device, and the audio processing device is used for obtaining the target coding audio based on the recorded audio, the transcribed text and the translated text.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the audio processing method as described in any one of the above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the audio processing method as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the audio processing method as described in any of the above.
According to the audio processing method, the device and the equipment, the recorded audio, the transcription text and the translation text are coded and stored to form the target coded audio in a new audio format, so that a user can obtain the target coded audio and directly analyze the audio and the corresponding transcription translation result, the recording efficiency is improved, the waiting time of the user is effectively reduced, and the use experience of the user is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an audio processing method provided by the present invention;
FIG. 2 is a schematic diagram of a processing procedure of an audio processing device provided by the present invention;
FIG. 3 is a schematic diagram of the structure of target encoded audio provided by the present invention;
FIG. 4 is a schematic diagram of a target audio header provided in the present invention;
FIG. 5 is a schematic diagram of a structure of a target audio frame provided by the present invention;
FIG. 6 is a schematic structural diagram of a target frame header provided in the present invention;
FIG. 7 is a schematic diagram of an audio processing apparatus according to the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The audio processing method of the present invention is described below with reference to fig. 1 to 6, and an execution subject of the method may be a controller of a device side, or a cloud side, or an edge server.
As shown in fig. 1, the audio processing method of the present invention includes steps 110 to 140.
And step 110, acquiring the recording audio, the transcription text and the translation text of the recording file.
In practical implementation, the sound pickup apparatus 10 may pick up and output a sound recording file, and the process of picking up sound by the sound pickup apparatus 10 may set the basic audio attribute of the sound recording file.
The basic audio attributes of the sound recording file include, but are not limited to, audio attributes such as a sampling rate, a channel number, and a sampling bit number.
By setting the basic audio attribute of the sound recording file, the sound pickup apparatus 10 picks up the sound recording file that can output the PCM format.
The sound data of the sound recording file in the PCM format is not compressed, and when the sound recording file is a single sound channel file, the sound pickup sampling data can be sequentially stored according to the time sequence.
When the recording file is a dual-track file, the pickup sampling data can be alternately stored in time sequence.
As shown in fig. 2, the sound collecting apparatus 10 collects the sound recording file and inputs the sound recording file to the transcription and translation apparatus 20 capable of performing speech recognition processing for multilingual transcription and multilingual translation.
The transcription translation device 20 may perform voice recognition on the audio file to obtain a transcription text corresponding to the audio file, perform semantic recognition on the transcription text, and perform translation by using a translation engine to obtain a corresponding translation text.
In this embodiment, the recording, transcription, and translation of the sound pickup apparatus 10 and the transcription and translation apparatus 20 are performed in separate processes, respectively.
The three processes of recording, transcription and translation are independent, the transcription translation device 20 can recognize and transcribe the continuous recording files output by the sound pickup device 10, perform semantic recognition on the transcribed texts, obtain relatively continuous sentences by combining context analysis, and then translate the continuous sentences.
It is understood that the language corresponding to the translated text in the transcription and translation apparatus 20 is set by the user, and the languages that can be translated by the transcription and translation apparatus 20 include, but are not limited to, chinese, english, japanese, korean, french, german, russian, spanish, portuguese, vietnamese, indonesia, italian, dutch, or thai.
And step 120, coding the recorded audio, the transcribed text and the translated text to obtain a target audio frame.
The target audio frame comprises recording audio coding data, transcription text data and translation text data.
In practical implementation, the target audio frame may be represented by binary data, that is, the three information of the recorded audio, the transcribed text and the translated text are encoded by using binary format.
It can be understood that, in the encoding process of the target audio frame, the recorded audio, the transcribed text and the translated text are all encoded, and when the target audio frame is decoded and read, the recorded audio, the transcribed text and the translated text can be correspondingly read.
It should be noted that, in the process of encoding the recording audio, the transcribed text, and the translated text, the order of encoding the three information may be set, and the order of the encoded data in the finally obtained target audio frame may be: recording audio, transcription text and translation text; also transcription text, recorded audio and translation text.
And step 130, acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format.
After the recording audio, the transcription text, and the translation text are encoded to obtain the target audio frame, the target audio frame needs to be coupled to the target audio head.
The target audio Header belongs to a Header (Header), the Header is data at the beginning of the file that undertakes certain tasks, and the Header may include a description of the main data of the file.
The target audio header includes target audio format information describing a file format, the target audio format information being information for characterizing the target audio format, the target audio format information representing a fixed format of the file.
In this embodiment, the target audio format information in the target audio header may be expressed by using 4 bytes of "ATC".
And step 140, obtaining the target coding audio based on the target audio head and the target audio frame.
As shown in fig. 3, the target encoded audio includes a target audio header 310 and a frame-by-frame encoded audio frame.
As shown in fig. 4, the target audio header includes target audio format information 460, and the format of the corresponding target encoded audio is the target audio format.
In this embodiment, the target audio format information may be 4 bytes of "ATC", and the corresponding target audio format of the target encoded audio is ". ATC".
It should be noted that the target audio format information and the target audio format are file formats set for storing target encoded audio of recorded audio, transcribed text, and translated text encoding.
In the related art, usually, audio is encoded and stored during recording, and when a user needs a voice transcription or translation result, the stored file of the recorded audio is used to perform voice transcription or recognition translation, and the results of the transcription and translation are stored respectively.
The time for acquiring the transcription or translation information from the audio is long, the operation flow is long, the recording needs to consume a certain time, and the waiting time for the user to acquire the transcription or translation information needed by the user is prolonged again.
In the technology, a user needs to respectively inquire and acquire the audio and the transcription and translation result, so that the operation steps of the user are increased, three files need to be stored, and the storage pressure of equipment is increased.
The invention carries out multilingual transcription and translation while recording, encodes and stores the recorded audio, the transcribed text and the translated text to form the target encoded audio in a new audio format, so that a user can directly analyze the recorded audio, the transcribed text and the translated text by acquiring the target encoded audio, thereby improving the efficiency of recording, transcribing and translating, reducing the waiting time of the user and improving the use experience of the user.
The target coding audio file comprises three information of a recording audio, a transcription text and a translation text, so that the operation steps of acquiring the transcription text and the translation text by a user are reduced, the equipment only needs to store one target coding audio file, and the storage pressure of the equipment is reduced.
According to the audio processing method provided by the invention, the target coded audio in a new audio format is formed by coding and storing the recorded audio, the transcription text and the translation text, so that a user can directly analyze the audio and the corresponding transcription translation result by acquiring the target coded audio, the waiting time of the user can be effectively reduced, and the use experience of the user is improved.
In some embodiments, step 120 comprises: inserting a first data separator between the recorded audio and the transcribed text, and inserting a second data separator between the transcribed text and the translated text; and inserting a target frame header before recording the audio to obtain a target audio frame.
The target audio frame belongs to a data frame, and the data frame comprises a frame head, a data part and a frame tail, wherein the frame head and the frame tail contain basic information of the data frame.
The data portion of the target audio frame includes recorded audio, transcribed text, and translated text encoded data.
As shown in fig. 5, when encoding the data portion of the target audio frame, it is necessary to insert a first data separator between the recording audio and the transcribed text and insert a second data separator between the transcribed text and the translated text.
Where separator 530 is a first data separator and separator 550 is a second data separator.
The first data separator is a division mark between the recorded audio and the transcribed text, when the target audio frame is decoded and the corresponding first data separator is read, the recorded audio is completely read, and the following data content is the transcribed text.
In actual implementation, the position of the first data separator can be queried to obtain the data content after the first data separator, so that the transcribed text can be obtained.
The second data separator is a division identifier between the transcribed text and the translated text, and when the corresponding second data separator is read during decoding of the target audio frame, the fact that the transcribed text is read is indicated, and the following data content is the translated text.
In actual execution, the position of the second data separator can be queried to obtain the data content after the second data separator, so that the translated text can be obtained.
A specific embodiment is described below.
As shown in FIG. 5, the target header 510 is located at the head of the target audio frame, the encoded recorded audio data 520 in the target audio frame is binary audio data of the recorded audio, the encoded transcribed text data 540 is transcribed corresponding to the recorded audio, and the encoded translated text data 560 is translated corresponding to the transcribed text.
In actual implementation, the first data separator 530 and the second data separator 550 may be identified differently.
For example, the first data separator between the recorded audio and the transcribed text may be 0 xaff 500, and the corresponding second data separator between the transcribed text and the translated text may be 0 xaff 501.
The target audio frame further includes a target frame header that precedes the encoded data of the recorded audio in the target audio frame.
As shown in fig. 6, the target header includes a frame number 610, a frame size 620, an audio data size 630, a transcription data identification 640, and a translation data identification 650.
The initial frame number is sequentially increased from 0 upwards to represent the sequence of frames in the target encoded audio, and the length of the frame number is 4 bytes.
The frame size identifies the total size of the target audio frame data, including the size of the frame header data and the size of the binary encoding of the audio and text data within the frame, and has a length of 4 bytes
The size of the audio data is used for identifying the size of pure audio data in the target audio frame, namely the data size of the recording audio in the target audio frame, so that the audio data corresponding to the recording audio can be conveniently obtained by decoding.
The transcription data identification is used for identifying whether the transcription text is stored in the target audio frame, and the translation data identification is used for identifying whether the translation text is stored in the target audio frame.
It is understood that not every frame of data in the recorded audio generated during the recording process may include transcribed text or translated text.
For example, in the recorded audio, a frame of data recorded with a sound is a user's breath, sigh, or a sound without transcription translation, and the frame of data has no corresponding transcribed text, i.e., no translated text corresponding to the transcribed text.
For another example, when the sound recorded in a certain frame of data in the recorded audio is a word which is difficult to translate, such as the user's mood word "kao", "o", or "er", the transcribed text cannot be subjected to semantic recognition, and the corresponding translated text cannot be obtained.
It should be noted that the target frame header includes a transcription data identifier and a translation data identifier, the audio frame with the target frame header is a target audio frame, and when decoding the target encoded audio, it may be determined whether a corresponding transcription text or translation text exists in the audio frame according to whether the target frame header exists in each audio frame.
In some embodiments, the byte length of the audio record file is obtained; and under the condition that the byte length is determined to be larger than the target byte length, inserting a target frame header before the audio is recorded to obtain a target audio frame.
In this embodiment, by obtaining the byte length of the audio record file, it is determined whether the audio record file has a corresponding transcribed text or translated text according to the byte length.
The byte length of the sound recording file comprises the byte length of all information of the sound recording file.
When the recording file comprises the recording audio, the transcription text and the translation text, the byte length of the recording file is the sum of the byte lengths of the recording audio, the transcription text and the translation text.
When the recording file only has recording audio, the byte length of the recording file is the byte length of the recording audio.
In actual implementation, the byte length of the sound recording file can be obtained by the calculation method of size of (byte [ ]).
And under the condition that the byte length of the recording file is greater than the target byte length, inserting a target frame header with a transcription data identifier and a translation data identifier in front of the recording audio to obtain a corresponding target audio frame.
For example, the byte length of the sound recording file is obtained by the calculation method of size of (byte [ ]), which is larger than the target byte length, the transcribed text or the translated text exists, and the transcribed data identifier or the translated data identifier stored in 0x01 is stored.
In some embodiments, the target encoded audio includes the target audio frame and other audio frames.
The target audio frame is a coded audio frame comprising a recorded audio, a transcription text and a translation text, and the target audio frame is provided with a target frame header which is provided with a transcription data identifier and a translation data identifier.
The other audio frames are coded audio frames which do not have transcription or translation information and only comprise recorded audio, and the frame headers of the other audio frames do not have transcription data identifications and translation data identifications.
In actual implementation, when the byte length of the recording file is not greater than the target byte length, other frame headers are inserted before the recording audio, and no transcription data identifier or translation data identifier is found in the other frame headers, so as to obtain corresponding other audio frames.
For example, the byte length of the sound recording file is obtained by the calculation method of size of (byte [ ]), and is not greater than the target byte length, so that the transcribed text or the translated text exists and is stored in 0x00, and no transcribed data mark or translated data mark exists in other frame headers.
It is understood that the target encoded audio includes a plurality of encoded audio frames, and among the plurality of encoded audio frames, the target audio frame having the transcription and translation information, and the other audio frame having no transcription or translation information.
For example, as shown in FIG. 3, the target encoded audio includes target audio header 310, audio frame 320 and audio frame 340 are target audio frames, and audio frame 330 is other audio frame.
In this embodiment, the audio frame data sizes of the encoded audio frames are stored in the respective frame header information.
The data size of the target audio frame is the sum of the data sizes of the target frame header, the recorded audio, the first data separator, the transcribed text, the second data separator and the translated text.
The data size of the other audio frames includes the sum of the data sizes of the other frame headers and the recorded audio.
Because not every frame of data in the recording audio generated in the recording process will include the transcribed text or the translated text, the data size of each encoded audio frame in the target encoded audio is different.
In some embodiments, a target number of recorded audios, transcription texts, and translation texts may be assembled into a target recorded audio, a target transcription text, and a target translation text, respectively;
and coding the target recording audio, the target transcription text and the target translation text to obtain a target audio frame.
It is understood that the sound pickup apparatus 10 can continuously output a plurality of the recorded audio, the transcribed text, and the translated text when picking up the output recorded file, and when transcribing and translating the output recorded audio, the transcribed text, and the translated text.
In this embodiment, a target number of continuous recording audios, target transcription texts, and target translation texts in the recording file are obtained and assembled, respectively, to obtain corresponding target recording audios, target transcription texts, and target translation texts.
And further coding the target recording audio, the target transcription text and the target translation text to obtain corresponding target coding audio.
In actual execution, the obtained recording audio, the transcription text and the translation text can be cached in sequence, and when the current cached number of the recording audio, the transcription text and the translation text reaches the target number, the continuous target number of the recording audio, the transcription text and the translation text are integrated into the corresponding target recording audio, the target transcription text and the target translation text.
It can be understood that the numbers of the recording audios, the transcribed texts and the translated texts in the recording file are equal, and the current cache numbers of the recording audios, the transcribed texts and the translated texts reach the target number, which means that the target number of the recording audios, the target number of the transcribed texts and the target number of the translated texts exist in the corresponding cache queues.
For example, recording audio 1 through recording audio 10, transcription text 1 through transcription text 10, and translation text 1 through translation text 10 are received and buffered.
The 10 recording audios from the recording audio 1 to the recording audio 10 are integrated into a target recording audio, the 10 transcription texts from the transcription text 1 to the transcription text 10 are integrated into a target transcription text, and the 10 translation texts from the translation text 1 to the translation text 10 are integrated into a target translation text.
In the target coded audio, each coded audio frame is arranged according to the receiving sequence, and the frame number corresponding to each coded audio frame is increased linearly.
The continuous target number of the recorded audios, the transcribed texts and the translated texts are cached, and then the encoding processing is carried out, so that the frequency of the encoding processing can be reduced, and the target audio frames with the transcribed texts and the translated texts can be obtained by caching the target number of the recorded audio pairs because each frame of data in the recorded audio does not include the transcribed texts or the translated texts.
In some embodiments, the target audio head includes: start flag, sampling rate, number of channels, bit rate, file size, and target audio format information.
As shown in fig. 4, the target audio header of the target encoded audio contains basic information of the target encoded audio, including a start flag 410, a sampling rate 420, a channel number 430, a bit rate 440, a file size 450, and target audio format information 460.
The start flag of the target audio header is a flag for marking the start of the entire target encoded audio, and may be represented by 4 bytes, and may be represented by a flag "ATTC".
The sample rate of the target audio header represents the size of the sample rate of the recorded audio, and may be expressed in 4 bytes, in relation to the output setting of the sound pickup apparatus.
The number of channels of the target audio header represents the number of channels of the recorded audio, and may be represented by 2 bytes, 1 for mono and 2 for binaural, in relation to the output setting of the sound pickup apparatus.
The bit rate of the target audio header represents the audio bit rate of the recorded audio and may be represented by 4 bytes.
The audio format information of the target audio header represents a fixed format of the target encoded audio, and may be expressed by using 4 bytes of "ATC".
The file size is the total size of the recorded audio file, and can be represented by 4 bytes, and the file size is updated after the recorded audio is recorded.
The target audio header comprises a start mark, a sampling rate, the number of channels, a bit rate, a file size and target audio format information, and the size of the whole target audio header is fixed and is 22 bytes.
The invention can simultaneously carry out multi-language transcription and translation in the recording process, and synchronously carry out coding storage on the transcription and translation result and the audio to generate the target coding audio in a new audio format, so that a user can directly analyze the audio and the transcription and translation result when acquiring the target coding audio, thereby improving the efficiency.
The following describes the audio processing apparatus provided by the present invention, and the audio processing apparatus 30 described below and the audio processing method described above may be referred to correspondingly.
As shown in fig. 7, the audio processing apparatus 30 provided by the present invention includes:
a first obtaining module 710, configured to obtain a recording audio, a transcription text, and a translation text of a recording file;
the first processing module 720 is configured to perform encoding processing on the recording audio, the transcription text, and the translation text to obtain a target audio frame;
a second obtaining module 730, configured to obtain a target audio header, where the target audio header includes information used for representing a target audio format;
the second processing module 740 is configured to obtain a target encoded audio based on the target audio header and the target audio frame, where the format of the target encoded audio is a target audio format.
According to the audio processing device 30 provided by the invention, the target coded audio in a new audio format is formed by coding and storing the recorded audio, the transcription text and the translation text, so that a user can directly analyze the audio and the corresponding transcription translation result by acquiring the target coded audio, the waiting time of the user can be effectively reduced, and the use experience of the user is improved.
In some embodiments, the first processing module 720 is configured to insert a first data separator between the recorded audio and the transcribed text, and insert a second data separator between the transcribed text and the translated text; inserting a target frame header before recording audio to obtain a target audio frame; the target frame header comprises a transcription data identifier and a translation data identifier.
In some embodiments, the first processing module 720 is configured to obtain a byte length of the audio record file; and under the condition that the byte length is determined to be larger than the target byte length, inserting a target frame header before the audio is recorded to obtain a target audio frame.
In some embodiments, the first processing module 720 is configured to assemble a target number of consecutive recorded audios, transcribed texts, and translated texts into a target recorded audio, a target transcribed text, and a target translated text, respectively; and coding the target recording audio, the target transcription text and the target translation text to obtain a target audio frame.
In some embodiments, the target encoded audio includes the target audio frame and other audio frames.
In some embodiments, the target audio header includes a start flag, a sampling rate, a number of channels, a bit rate, a file size, and target audio format information.
The invention also provides an audio processing device.
As shown in fig. 2, the audio processing apparatus includes the sound pickup device 10, the transcription translation device 20, and the above-described audio processing device 30.
The sound pickup device 10 is used for picking up sound and outputting a sound recording file; the transcription and translation device 20 is electrically connected with the sound pickup device 10, and the transcription and translation device 20 is used for outputting recording audio, transcription text and translation text based on the recording file.
As shown in fig. 2, in step 210, the sound collecting apparatus 10 collects the sound-collected sound recording file and inputs the sound recording file to the transcription and translation apparatus 20 capable of performing the speech recognition processing of the multilingual transcription and the multilingual translation.
The transcription translation device 20 may perform voice recognition on the audio file to obtain a transcription text corresponding to the audio file, perform semantic recognition on the transcription text, and perform translation by using a translation engine to obtain a corresponding translation text.
In step 220, the transcription and translation apparatus 20 inputs the recorded audio, the transcribed text and the translated text to the audio processing apparatus 30.
The audio processing device 30 is electrically connected to the transcription and translation device 20, and the audio processing device 30 is configured to obtain the target encoded audio based on the recorded audio, the transcribed text, and the translated text.
According to the audio processing equipment provided by the invention, the target coded audio in a new audio format is formed by coding and storing the recorded audio, the transcription text and the translation text, so that a user can directly analyze the audio and the corresponding transcription and translation result by acquiring the target coded audio, the waiting time of the user can be effectively reduced, and the use experience of the user is improved.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform an audio processing method that includes obtaining recorded audio, transcribed text, and translated text for a recorded file; coding the recorded audio, the transcribed text and the translated text to obtain a target audio frame; acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format; and obtaining target coded audio based on the target audio head and the target audio frame, wherein the format of the target coded audio is the target audio format.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the audio processing method provided by the above methods, the method includes obtaining recorded audio, transcribed text and translated text of a recorded file; coding the recorded audio, the transcribed text and the translated text to obtain a target audio frame; acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format; and obtaining target coded audio based on the target audio head and the target audio frame, wherein the format of the target coded audio is the target audio format.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements an audio processing method provided by the above methods, the method including obtaining recorded audio, transcribed text, and translated text of a recorded file; coding the recorded audio, the transcribed text and the translated text to obtain a target audio frame; acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format; and obtaining target coded audio based on the target audio head and the target audio frame, wherein the format of the target coded audio is the target audio format.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (11)

1. An audio processing method, comprising:
acquiring a recording audio, a transcription text and a translation text of a recording file;
coding the recording audio, the transcription text and the translation text to obtain a target audio frame;
acquiring a target audio head, wherein the target audio head comprises information for representing a target audio format;
and obtaining a target coding audio based on the target audio head and the target audio frame, wherein the format of the target coding audio is the target audio format.
2. The audio processing method of claim 1, wherein the encoding the recorded audio, the transcribed text, and the translated text to obtain a target audio frame comprises:
inserting a first data separator between the recorded audio and the transcribed text, and inserting a second data separator between the transcribed text and the translated text;
inserting a target frame header in front of the recorded audio to obtain a target audio frame;
and the target frame header comprises a transcription data identifier and a translation data identifier.
3. The audio processing method of claim 2, wherein the inserting a target frame header before the recorded audio to obtain the target audio frame comprises: acquiring the byte length of the sound recording file;
and under the condition that the byte length is determined to be larger than the target byte length, inserting the target frame header in front of the audio record to obtain the target audio frame.
4. The audio processing method of claim 1, wherein the obtaining of the recorded audio, the transcribed text and the translated text of the recorded file comprises:
acquiring continuous target number of the recording audios, the transcription texts and the translation texts in the recording file;
the encoding the recording audio, the transcription text and the translation text to obtain a target audio frame includes:
respectively assembling the continuous target number of the recording audios, the transcription texts and the translation texts into target recording audios, target transcription texts and target translation texts;
and coding the target recording audio, the target transcription text and the target translation text to obtain the target audio frame.
5. The audio processing method according to any one of claims 1 to 4, wherein the target encoded audio comprises the target audio frame and other audio frames.
6. The audio processing method according to any one of claims 1 to 4, wherein the target audio header includes a start flag, a sampling rate, a number of channels, a bit rate, a file size, and target audio format information.
7. An audio processing apparatus, comprising:
the first acquisition module is used for acquiring the recording audio, the transcription text and the translation text of the recording file;
the first processing module is used for coding the recording audio, the transcription text and the translation text to obtain a target audio frame;
the second acquisition module is used for acquiring a target audio head, and the target audio head comprises information used for representing a target audio format;
and the second processing module is used for obtaining target coded audio based on the target audio head and the target audio frame, wherein the format of the target coded audio is the format of the target audio.
8. An audio processing device, comprising:
the sound pickup device is used for picking up and outputting the sound recording file;
the transcription translation device is electrically connected with the sound pickup device and is used for outputting a recording audio, a transcription text and a translation text based on the recording file;
the audio processing device according to claim 7, wherein the audio processing device is electrically connected to the transcription and translation device, and the audio processing device is configured to obtain the target encoded audio based on the recorded audio, the transcribed text, and the translated text.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the audio processing method according to any of claims 1 to 6 are implemented when the processor executes the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the audio processing method according to any one of claims 1 to 6.
11. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the audio processing method according to any of claims 1 to 6 when executed by a processor.
CN202111206068.1A 2021-10-14 2021-10-14 Audio processing method, device and equipment Pending CN113921011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111206068.1A CN113921011A (en) 2021-10-14 2021-10-14 Audio processing method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111206068.1A CN113921011A (en) 2021-10-14 2021-10-14 Audio processing method, device and equipment

Publications (1)

Publication Number Publication Date
CN113921011A true CN113921011A (en) 2022-01-11

Family

ID=79240748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111206068.1A Pending CN113921011A (en) 2021-10-14 2021-10-14 Audio processing method, device and equipment

Country Status (1)

Country Link
CN (1) CN113921011A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333858A (en) * 2021-12-06 2022-04-12 安徽听见科技有限公司 Audio encoding and decoding method and related device, equipment and storage medium
CN115050393A (en) * 2022-06-23 2022-09-13 安徽听见科技有限公司 Method, device and equipment for acquiring audioback and storage medium
CN115240369A (en) * 2022-07-21 2022-10-25 天津君秒安减灾科技有限公司 Internet of things loudspeaker voice broadcasting system based on earthquake early warning
CN116821052A (en) * 2023-08-29 2023-09-29 深圳爱图仕创新科技股份有限公司 File processing method, device, data acquisition equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050131709A1 (en) * 2003-12-15 2005-06-16 International Business Machines Corporation Providing translations encoded within embedded digital information
US20080005656A1 (en) * 2006-06-28 2008-01-03 Shu Fan Stephen Pang Apparatus, method, and file format for text with synchronized audio
US20100114556A1 (en) * 2008-10-31 2010-05-06 International Business Machines Corporation Speech translation method and apparatus
CN104106113A (en) * 2012-02-16 2014-10-15 大陆汽车有限责任公司 Method for phonetising a data list and speech-controlled user interface
CN105426413A (en) * 2015-10-31 2016-03-23 华为技术有限公司 Coding method and device
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
US20180158365A1 (en) * 2015-05-21 2018-06-07 Gammakite, Llc Device for language teaching with time dependent data memory
CN109584891A (en) * 2019-01-29 2019-04-05 乐鑫信息科技(上海)股份有限公司 Audio-frequency decoding method, device, equipment and medium under embedded environment
CN110046222A (en) * 2019-03-04 2019-07-23 视联动力信息技术股份有限公司 A kind of intelligent answer method and system
CN111863043A (en) * 2020-07-29 2020-10-30 安徽听见科技有限公司 Audio transfer file generation method, related equipment and readable storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050131709A1 (en) * 2003-12-15 2005-06-16 International Business Machines Corporation Providing translations encoded within embedded digital information
US20080005656A1 (en) * 2006-06-28 2008-01-03 Shu Fan Stephen Pang Apparatus, method, and file format for text with synchronized audio
US20100114556A1 (en) * 2008-10-31 2010-05-06 International Business Machines Corporation Speech translation method and apparatus
CN104106113A (en) * 2012-02-16 2014-10-15 大陆汽车有限责任公司 Method for phonetising a data list and speech-controlled user interface
US20150012261A1 (en) * 2012-02-16 2015-01-08 Continetal Automotive Gmbh Method for phonetizing a data list and voice-controlled user interface
US20180158365A1 (en) * 2015-05-21 2018-06-07 Gammakite, Llc Device for language teaching with time dependent data memory
CN105426413A (en) * 2015-10-31 2016-03-23 华为技术有限公司 Coding method and device
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN109584891A (en) * 2019-01-29 2019-04-05 乐鑫信息科技(上海)股份有限公司 Audio-frequency decoding method, device, equipment and medium under embedded environment
CN110046222A (en) * 2019-03-04 2019-07-23 视联动力信息技术股份有限公司 A kind of intelligent answer method and system
CN111863043A (en) * 2020-07-29 2020-10-30 安徽听见科技有限公司 Audio transfer file generation method, related equipment and readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333858A (en) * 2021-12-06 2022-04-12 安徽听见科技有限公司 Audio encoding and decoding method and related device, equipment and storage medium
CN115050393A (en) * 2022-06-23 2022-09-13 安徽听见科技有限公司 Method, device and equipment for acquiring audioback and storage medium
CN115240369A (en) * 2022-07-21 2022-10-25 天津君秒安减灾科技有限公司 Internet of things loudspeaker voice broadcasting system based on earthquake early warning
CN116821052A (en) * 2023-08-29 2023-09-29 深圳爱图仕创新科技股份有限公司 File processing method, device, data acquisition equipment and storage medium
CN116821052B (en) * 2023-08-29 2024-05-14 深圳爱图仕创新科技股份有限公司 File processing method, device, data acquisition equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113921011A (en) Audio processing method, device and equipment
CN110263322B (en) Audio corpus screening method and device for speech recognition and computer equipment
CN111968649A (en) Subtitle correction method, subtitle display method, device, equipment and medium
CN110853649A (en) Label extraction method, system, device and medium based on intelligent voice technology
CN110853615B (en) Data processing method, device and storage medium
CN101382937A (en) Multimedia resource processing method based on speech recognition and on-line teaching system thereof
CN111128223A (en) Text information-based auxiliary speaker separation method and related device
CN110705254B (en) Text sentence-breaking method and device, electronic equipment and storage medium
CN110265001B (en) Corpus screening method and device for speech recognition training and computer equipment
CN106851401A (en) A kind of method and system of automatic addition captions
CN111986656B (en) Teaching video automatic caption processing method and system
CN110210416B (en) Sign language recognition system optimization method and device based on dynamic pseudo tag decoding
CN111883137A (en) Text processing method and device based on voice recognition
JP2012181358A (en) Text display time determination device, text display system, method, and program
CN111881297A (en) Method and device for correcting voice recognition text
CN112270917B (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN111489754A (en) Telephone traffic data analysis method based on intelligent voice technology
CN111797599A (en) Conference record extraction and PPT insertion method and system
CN115150660B (en) Video editing method based on subtitles and related equipment
CN114639386A (en) Text error correction and text error correction word bank construction method
CN110312161B (en) Video dubbing method and device and terminal equipment
CN109858005A (en) Document updating method, device, equipment and storage medium based on speech recognition
WO2019119552A1 (en) Method for translating continuous long speech file, and translation machine
CN111108553A (en) Voiceprint detection method, device and equipment for sound collection object
CN113435902A (en) Intelligent logistics customer service robot based on voice information analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination