CN110895575A

CN110895575A - Audio processing method and device

Info

Publication number: CN110895575A
Application number: CN201810974926.9A
Authority: CN
Inventors: 高欣羽
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2020-03-20
Anticipated expiration: 2038-08-24
Also published as: CN110895575B

Abstract

The application discloses an audio processing method and device, comprising the following steps: converting first audio information to be processed into text information; searching the converted text information by using the search information to obtain a text segment containing the search information; and processing the audio clip corresponding to the text clip containing the search information to obtain second audio information. According to the method and the device, the audio content is visually presented, and tasks such as searching and positioning, editing and audio splicing are carried out on the audio content according to the searching information, so that the audio information can be edited conveniently, automatically and efficiently as text editing is carried out, and the workload of the whole audio information processing is greatly reduced.

Description

Audio processing method and device

Technical Field

The present application relates to, but not limited to, speech recognition technology, and more particularly, to an audio processing method and apparatus.

Background

When a user edits and splices multiple sections of audio, the user usually needs to listen to each section of audio first, then manually mark a content segment to be spliced, and finally splice the audio of the marked content segment. The audio processing mode in the related art has obviously large workload; furthermore, when the start time point of the content segment is manually recorded, the accuracy is low, such as: if the exact stopping time point required is 00:01:53 and the user records as 00:02:01, this will result in extraneous noise being left in the audio piece to be spliced; moreover, the user finally obtains a plurality of audio files, the user records the time points of the content segments, the audio segments and the like, and the content of the audio files before and after splicing is difficult to be intuitively seen, so that the difficulty degree of auditing, rechecking, modifying and re-editing is greatly increased, and a large amount of word descriptions are required to be attached to the audio files when the audio files are retained and handed over.

In summary, the audio processing method in the related art is time-consuming, low in efficiency, large in workload, high in error rate of time point marking, and low in intelligent automation degree in the whole process.

Disclosure of Invention

The application provides an audio processing method and an audio processing device, which can accurately and efficiently process audio information.

The application provides an audio processing method, which comprises the following steps:

converting the audio information to be processed into text information;

searching the converted text information by utilizing the search information to obtain a text segment containing the search information;

and processing the audio clip corresponding to the text clip containing the search information to obtain second audio information.

Optionally, the searching the converted text information by using the search information, and acquiring the text fragment containing the search information includes:

searching in the text information according to the search information to obtain at least one text segment containing the search information;

and respectively determining the start-stop time point information of the audio segment corresponding to each text segment according to the searched start-stop position of at least one text segment.

Start-stop time point information of a text segment containing search information is identified.

Optionally, the processing according to the audio segment corresponding to the text segment containing the search information includes:

splicing the obtained text segments containing the search information into a text message;

cutting each audio clip from the first audio clip according to the start-stop time point of the audio clip corresponding to each text clip in the spliced text information;

and splicing the cut audio segments to obtain the second audio information.

Optionally, the method further comprises:

identifying an audio source of an audio segment corresponding to the text segment;

and adding audio source information to the audio clips.

Optionally, identifying an audio source of an audio segment corresponding to the text segment, and adding audio source information to the audio segment, including:

judging a speaker corresponding to the audio clip through voiceprint recognition;

and adding information of a speaker corresponding to the voiceprint in the text fragment.

Optionally, the processing the audio clip corresponding to the text clip containing the search information includes:

converting the text information added with the speaker information into a corresponding audio clip by a speech synthesis technology; and splicing the converted audio segments into the second audio information.

Optionally, the method further comprises:

generating text information containing the additional information, and converting the text information containing the additional information into a system audio clip by using a speech synthesis technology;

the processing of the audio clip corresponding to the obtained text clip containing the search information includes: and splicing the obtained system audio clip and the audio clip corresponding to the text clip containing the search information to form the second audio information.

Optionally, after the splicing into one text message, the method further includes:

and editing the spliced text information according to the operation information from the user.

Optionally, the editing comprises: adding or deleting text and adding annotation and comment information.

The present application further provides an audio processing apparatus, comprising a memory and a processor, wherein the memory stores the following instructions executable by the processor: for performing the steps of the audio processing method of any of the above.

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing any of the audio processing methods described above.

The present application further provides an audio processing apparatus comprising: a conversion unit, a search unit, and a processing unit; wherein the content of the first and second substances,

the conversion unit is used for converting the audio information of the driver instrument to be processed into text information;

the search unit is used for searching the converted text information by utilizing the search information to obtain a text segment containing the search information;

and the processing unit is used for processing the audio clip corresponding to the text clip containing the search information to obtain second audio information.

Optionally, the search unit is specifically configured to:

Optionally, the processing unit is specifically configured to: splicing the text segments containing the search information into a text message;

and splicing the cut audio segments to obtain the second audio information.

Optionally, the processing unit is further configured to: and editing the spliced text information according to the operation information from the user.

Optionally, the apparatus further comprises: the adding unit is used for generating text information containing the additional information and converting the text information containing the additional information into a system audio clip;

the processing unit is specifically configured to: and splicing the obtained system audio clip and the obtained audio clip corresponding to the text clip containing the search information to form the second audio information.

The technical scheme at least comprises the following steps: converting the audio information to be processed into text information; searching the converted text information by using preset search information to obtain a text segment containing the search information; and processing the audio clip corresponding to the obtained text clip containing the search information to form processed audio information. According to the method and the device, the audio content is visually presented, and tasks such as searching, positioning, editing and audio splicing are carried out on the audio content according to the search information such as the keywords, so that the audio information can be edited conveniently, automatically and efficiently as text editing is realized, and the workload of the whole audio information processing is greatly reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a flow chart of an audio processing method in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

In one exemplary configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a flowchart of an audio processing method in an embodiment of the present application, as shown in fig. 1, including:

step 100: and converting the first audio information to be processed into text information.

The first audio information to be processed may include one or more audio files.

Optionally, in this step, a Speech conversion technology, such as a Speech-to-Text (STT) technology, may be used to convert each audio file into a Text file.

Step 101: and searching the converted text information by utilizing the search information to obtain a text segment containing the search information.

Alternatively, the search information may be a keyword as set in advance.

Optionally, this step may include: searching in the text information according to the search information to obtain at least one text segment containing the search information; and respectively determining the start-stop time point information of the audio segment corresponding to each text segment according to the searched start-stop position of at least one text segment.

In an exemplary embodiment, taking search information as a keyword as an example, searching in the converted text information by using the keyword to obtain a text segment containing the search information; and identifying the start-stop time point information of the audio segment corresponding to the text segment containing the search information. Specifically, the method comprises the following steps: after the text segment containing the search information is obtained by searching the text information by using the keyword, the text segment containing the search information is identified, that is, the text segment with the identification is the text segment containing the search information. At this time, for example, the marked text segment a includes the starting and ending time points of the text segment a, such as 00:04:32-00:25:01, and the text segment a is marked as: audio A00:04:32-00:25: 01.

It should be noted that, according to the voice intelligent conversion technology provided in the related art, preliminary sentence-breaking processing can be performed. Then, the text segment containing the search information in the present application may be defined as: a full sentence containing the search information. In the step, the text segments containing the keywords are searched, so that the context containing the keywords can be quickly acquired, and the positioning effect on the related text information is realized.

Optionally, if the user considers that the sentence break of the text segment containing the search information is not accurate or complete, which characters are specifically selected as the text segment to be spliced subsequently according to the marked position. The specific implementation may be implemented in a form of providing a human-computer interaction interface for a user, and the specific implementation form is not used to limit the scope of the present application.

It should be noted that there may be several correspondences in information for a segment of speech, such as a sound chart, a time axis, and a text. Where the sonogram and time axis are similar to the audio track, carrying time information. The principle of the speech conversion technology is to convert the sound segment into text, and when the unit of the sound segment is small enough, the corresponding time point can be obtained on the time axis, such as by backward-pushing: the transcription obtains the word 'cloud', -the sound clip transcribed into the word 'cloud', -the voice file has a corresponding time axis, so that the time point corresponding to the sound clip where the 'cloud' is located can be obtained as 00:05:30, then the 'cloud' in the text can obtain the corresponding time point: 00:05:30, that is, the converted text information has corresponding time information.

Step 102: and processing the audio clip corresponding to the text clip containing the search information to obtain second audio information.

Optionally, this step includes: splicing the obtained text segments containing the search information into complete text information; cutting each audio clip from the first audio clip to be processed according to the start-stop time point of the audio clip corresponding to each text clip in the spliced complete text information; and splicing the cut audio segments to obtain second audio information.

It should be noted that, after the text segment containing the search information is spliced by the present step, the present application further includes: and editing the spliced text information according to the operation information from the user. Wherein the editing includes, but is not limited to: add or delete text, add annotation commentary, etc. For example, the spliced text is edited by the user on his/her own, such as: if the text is explained by Mr. Wang, the text can be added, such as 'Mr. Wang' and so on. The editing is difficult to realize for a simple audio file, and by means of the text splicing after the text splicing, the purpose of editing the audio through the convenience of text editing is realized, and the blind editing of the audio information is avoided.

According to the audio processing method, through the tasks of visually presenting the audio content, searching and positioning the audio content according to the keywords, editing, splicing the audio and the like, the audio information can be conveniently, automatically and efficiently edited like text editing such as copying, pasting, cutting and pasting, and the like, and the workload of the whole audio information processing is greatly reduced.

Optionally, before step 102, after step 101, the audio processing method of the present application further includes:

and identifying the audio source of the audio clip corresponding to the text clip containing the search information, and adding audio source information to the audio clip. Therefore, the audio source information is added to the audio texts with different audio sources.

Wherein, the audio source refers to information of a speaker, and identifying the audio source of the audio segment is to determine whether the speakers of different audio segments are the same person.

judging a speaker corresponding to the audio clip through voiceprint recognition; and adding information of a speaker corresponding to the voiceprint in the text fragment.

In an exemplary embodiment, a voiceprint recognition technique may be used to identify whether speakers of different audio segments corresponding to text segments containing search information are the same person, i.e., whether the audio sources are the same; if the audio sources are not the same speaker, namely different audio sources exist, the user can be prompted whether the information of the speaker needs to be added or not;

if an instruction to select to add speaker information is received from a user, the speaker information is directly added to text information containing search information.

Accordingly, the number of the first and second electrodes,

step 102 comprises: converting the text information added with the speaker information into an audio clip added with the speaker information by utilizing a speech synthesis technology; and splicing the converted audio segments added with the information of the speakers into a second audio file.

Optionally, the method further includes:

generating Text information containing the additional information, and converting the Text information containing the additional information into a system audio fragment, for example, using a Speech synthesis technology such as a Text-to-Speech (TTS) technology; correspondingly, step 102 specifically includes: and splicing the obtained system audio clip and the audio clip corresponding to the text clip containing the search information obtained by the application to form second audio information.

The audio processing method of the present application is described in detail below with reference to specific embodiments.

Suppose a user is a product manager and visits a customer and collects four pieces of audio files such as audio a, audio B, audio C, and audio D from four customers, respectively. In this embodiment, it is assumed that the raw audio material collected by the customer manager includes: an audio file a for speaker a, an audio file B for speaker B, an audio file C for speaker C, and an audio file D for speaker D. Now, the views of the clients on the "arrhizus" need to be sorted out according to the collected audio files, that is, all the segments which refer to the "arrhizus" need to be searched out from the four collected audio files and spliced into one audio file for the voice of the "arrhizus". The audio processing method provided by the application specifically comprises the following steps:

first, the audio file a, the audio file B, the audio file C, and the audio file D are all converted into text information using a voice conversion technique. Take the example of converting audio file a into text file a:

"we are a company that focuses on human genome data analysis and genetic information application development. With the maturity of gene sequencing technology and the rapid decrease of cost, gene detection gradually goes into common families from scientific research. The united states introduced an 'accurate medical' program, the united kingdom also initiated a genome program for one hundred thousand people, and it is believed that china would also initiate a corresponding program immediately. When everyone performs a whole genome sequencing, a data volume of 90Gbp is generated. If tens of thousands of people, millions of people and even tens of millions of people carry out whole genome sequencing, the generated mass data can be solved by setting up several servers by self, and large-scale calculation and mass storage of cloud computing are required to be relied on. The Aliyun is a relatively mature cloud product provider in China at present, covers various aspects such as calculation, storage, safety and the like, saves the cost of manpower and material resources for building a machine room by self, and has good elasticity. Our massive genomic data analysis relies on multiple products such as ECS, OSS, OTS, BatchCompute, etc. Fastq data generated by the Hiseq Xten sequencer is directly transmitted to the OSS through a high-speed network, so that the problems of data storage and backup are solved. In data analysis, the ECS and the batchcomputer directly read genome data of the OSS from the intranet, analyze a plurality of genome data concurrently, and quickly return a gene interpretation result.

The Ali cloud provides various products, so that people do not need to spend too much manpower and material resources on deployment and maintenance of a server, can concentrate strength on the products to the greatest extent, provides the best genome data interpretation and analysis service for users, enables the people and the Ali cloud to cooperate with each other in a hand, and expects the early arrival of personalized medical treatment. "

And then, the text information obtained by conversion is subjected to keyword search by using the keyword 'Aliyun', and related content is quickly positioned. Taking the text file a as an example, first, preliminary sentence-breaking processing can be performed according to the voice intelligent conversion technology provided in the related art. Then, the quickly located text segment containing search information in this embodiment may be defined as: containing the search informationComplete sentence, i.e.) "Aliyun is a relatively mature cloud product provider in China at present", and"Aliyun is Provide a wide variety of products". And then, the user decides which context is to be selected as the text segment to be spliced according to the two positions which are quickly positioned. For example, the user selects the text information underlined in the text file a as the searched related content: "… … must rely on the large-scale computing and mass storage of cloud computing.The Aliyun is compared with the Aliyun at present in China Mature cloud product providers cover various aspects of calculation, storage, safety and the like, save the cost of manpower and material resources for self-building machine rooms, and has excellent elasticity. Our massive genomic data analysis relies on multiple ECS, OSS, OTS, BatchCompute, etc And (5) producing the product. Fastq data generated by a Hiseq X ten sequencer are directly transmitted to an OSS through a high-speed network, so that the problem of data is solved Storage and backup. In the data analysis, ECS and BatchCompute directly read the genome data of OSS from the intranet and concurrently perform And analyzing a plurality of genome data and rapidly returning a gene reading result.

The Aliyun provides various products, so that people do not need to spend much manpower and material resources on a server In the aspects of deployment and maintenance, the system can concentrate the strength on products to the greatest extent, and provides the best genome data interpretation and classification for users Analysis service, which allows us and Aliyun to cooperate with each other, is expected to bring about the early arrival of personalized medical treatment.”

Thus, the user can quickly locate the desired segment, and the starting and ending time points of the located text segment are assumed as follows: 00:04:32-00:25:01, then this piece of text information a will be automatically marked as: audio files A00:04:32-00:25: 01.

In this embodiment, it is assumed that the text information B, the text information C, and the text information D obtained by conversion are respectively located and marked after being searched according to the above method, and that: audio file B00:05:45-00:35:06, audio file C00: 01:22-00:15:03, and audio file D00: 34:01-00:46: 45.

And then, cutting the marked audio fragments from four audio files collected by a product manager, and splicing into processed audio files: "Audio A00:04:32-00:25:01 + Audio B00:05:45-00:35:06+ Audio C00: 01:22-00:15:03+ Audio D00: 34:01-00:46: 45".

Furthermore, a voiceprint recognition technology can be used for recognizing that the four sections of audio files to be spliced come from four different speakers, and at the moment, a user can be further prompted whether the information of the speakers needs to be added or not; if the user chooses to add the speaker's information, the addition can be made directly in the text message, such as: "The user A says:audio A00:04:32-00:25: 01.User B speaks: text information of the audio files B00:05:45-00:35: 06.User C To say that:text information of audio files C00: 01:22-00:15: 03.The user D says:text information of audio files D00: 34:01-00:46:45 "; then, converting the text information added with the additional information into an audio clip with the additional information by utilizing a speech synthesis technology; and finally, splicing the converted audio segments with the additional information into a processed audio file.

Optionally, the additional information may also be a description of the spliced audio file, for example, for the four audio files identified and cut out in the above embodiment, a title or other description may be added, such as:"the content of this audio is Evaluation of the Alicloun by four customers. I visited these four users in beijing city 12 months 3 days ago 2016 ", etc.At this time, this additional information may be used alone as a text information for representing the additional information. In the subsequent splicing treatment, the text information representing the additional information is converted into an audio clip of the additional information only by utilizing a voice synthesis technology; finally, splicing the converted audio clip of the additional information with the marked audio clip cut from the four sections of audio files collected by a product manager to obtain the processed audio file: "audio clip of additional information + audio a00:04:32-00:25:01 + audio B00:05:45-00:35:06+ audio C00: 01:22-00:15:03+ audio D00: 34:01-00:46: 45" are spliced into a processed audio file.

The audio processing method provided by the embodiment of the application visually presents the audio content, so that people can read the sound, the speed and the information transparency of cognitive processing are improved, and the friendliness to listeners or subsequent audio processing personnel is greatly improved.

An embodiment of the present application further provides an audio processing apparatus, including a memory and a processor, where the memory stores the following instructions executable by the processor: the executable instructions are for performing the steps of the audio processing method described in one or more of the embodiments above.

The embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions for executing the audio processing method described in one or more embodiments above.

Fig. 2 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, which at least includes: a conversion unit, a search unit, and a processing unit; wherein the content of the first and second substances,

the conversion unit is used for converting the first audio information to be processed into text information;

and the processing unit is used for processing the audio clip corresponding to the obtained text clip containing the search information to obtain second audio information.

Optionally, the search unit is specifically configured to:

Optionally, the processing unit is specifically configured to:

and splicing the cut audio segments to obtain second audio information.

Optionally, the processing unit is further configured to: and editing the spliced text information according to the operation information from the user. Wherein, editing includes but is not limited to: add or delete text, add annotation commentary, etc.

Optionally, the processing unit is further configured to:

and identifying the audio source of the audio clip corresponding to the text clip containing the search information, and adding audio source information for the audio clips with different audio sources.

Optionally, the audio processing apparatus of the present application further includes: the adding unit is used for generating text information containing the additional information and converting the text information containing the additional information into a system audio clip;

correspondingly, the processing unit is specifically configured to: and splicing the obtained system audio clip and the obtained audio clip corresponding to the text clip containing the search information to form the second audio information.

Alternatively, the search information may include keywords.

Although the embodiments disclosed in the present application are described above, the descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. An audio processing method, comprising:

converting first audio information to be processed into text information;

2. The audio processing method of claim 1, wherein the searching the converted text information by using the search information to obtain the text segment containing the search information comprises:

respectively determining the start-stop time point information of the audio clip corresponding to each text clip according to the searched start-stop position of at least one text clip;

3. The audio processing method according to claim 1, wherein the processing according to the audio segment corresponding to the text segment containing the search information comprises:

and splicing the cut audio segments to obtain the second audio information.

4. The audio processing method according to claim 1 or 3, characterized in that the method further comprises:

and adding audio source information to the audio clips.

5. The audio processing method according to claim 4, wherein identifying an audio source of an audio segment corresponding to the text segment and adding audio source information to the audio segment includes:

6. The audio processing method according to claim 5, wherein the processing the audio segment corresponding to the text segment containing the search information comprises:

converting the text information added with the information of the speaker into a corresponding audio clip through voice synthesis;

and splicing the converted audio segments to obtain the second audio information.

7. The audio processing method according to claim 1 or 3, characterized in that the method further comprises:

generating text information containing the additional information, and converting the text information containing the additional information into a system audio clip through voice synthesis;

the processing of the audio clip corresponding to the obtained text clip containing the search information includes: and splicing the system audio clip and the audio clip corresponding to the text clip containing the search information to form the second audio information.

8. The audio processing method of claim 3, wherein after said concatenating into a text message, the method further comprises:

9. The audio processing method of claim 8, wherein the editing comprises: adding or deleting text and adding annotation and comment information.

10. An audio processing apparatus comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: steps for performing the audio processing method of any one of claims 1 to 9.

11. A computer-readable storage medium storing computer-executable instructions for performing the audio processing method of any one of claims 1 to 9.

12. An audio processing apparatus, comprising: a conversion unit, a search unit, and a processing unit; wherein the content of the first and second substances,

13. The audio processing apparatus according to claim 12, wherein the search unit is specifically configured to:

14. The audio processing device according to claim 12, wherein the processing unit is specifically configured to: splicing the text segments containing the search information into a text message;

and splicing the cut audio segments to obtain the second audio information.

15. The audio processing device of claim 14, wherein the processing unit is further configured to: and editing the spliced text information according to the operation information from the user.

16. The audio processing apparatus according to claim 12 or 14, characterized in that the apparatus further comprises: the adding unit is used for generating text information containing the additional information and converting the text information containing the additional information into a system audio clip;