CN111415651A - Audio information extraction method, terminal and computer readable storage medium - Google Patents
Audio information extraction method, terminal and computer readable storage medium Download PDFInfo
- Publication number
- CN111415651A CN111415651A CN202010094370.1A CN202010094370A CN111415651A CN 111415651 A CN111415651 A CN 111415651A CN 202010094370 A CN202010094370 A CN 202010094370A CN 111415651 A CN111415651 A CN 111415651A
- Authority
- CN
- China
- Prior art keywords
- audio
- track
- text
- extraction method
- information extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 41
- 238000004891 communication Methods 0.000 claims description 19
- 230000002123 temporal effect Effects 0.000 claims 3
- 238000000034 method Methods 0.000 abstract description 9
- 238000010586 diagram Methods 0.000 description 8
- 241001672694 Citrus reticulata Species 0.000 description 5
- 241000282414 Homo sapiens Species 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000001755 vocal effect Effects 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000000088 lip Anatomy 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 210000001584 soft palate Anatomy 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 210000002105 tongue Anatomy 0.000 description 1
- 210000000515 tooth Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses an audio information extraction method, a terminal and a computer readable storage medium, wherein the method comprises the following steps: determining audio to be extracted, determining audio tracks of the audio based on the audio, wherein the audio at least has two audio tracks, analyzing the audio to obtain audio texts generated by each audio track, and storing each audio text in the audio based on each audio track to serve as the extracted audio information. Because the audio text generated by each audio track in the audio can be directly obtained through computer analysis, and the audio information extracted from the audio can be obtained by storing each audio text in the audio based on each audio track, namely, the technical scheme provided by the invention can obtain the audio information in the audio without a playing and listening mode manually by a user, so that the user can conveniently look up the audio information, and the user can obtain higher experience.
Description
Technical Field
The present invention relates to the field of information technology, and more particularly, to an audio information extraction method, a terminal, and a computer-readable storage medium.
Background
The existing intelligent terminal can generate audio through diversified recording software, the audio is stored in sound formats such as WAV format, MIDI format, CDA format and MP3 format, when a user wants to obtain information in the audio, the user needs to listen to the audio again or more times, which is time-consuming for the user and reduces the user experience.
Disclosure of Invention
The invention provides an audio information extraction method, a terminal and a computer readable storage medium, and solves the technical problem that the user experience is reduced because the audio information in the audio needs to be acquired in a playing and listening mode in the prior art.
The invention provides an audio information extraction method, which comprises the following steps:
determining audio to be extracted, and determining tracks of the audio based on the audio, wherein the audio has at least two tracks;
analyzing the audio to obtain audio texts generated by each audio track;
each audio text in the audio is stored on the basis of each audio track as extracted audio information.
Optionally, determining the track of the audio based on the audio comprises:
analyzing the audio to obtain the voiceprint characteristics of the audio, wherein the audio comprises at least two sound waves and at least has two voiceprint characteristics;
and judging the sound waves with the same voiceprint characteristics in the audio to belong to the same audio track to obtain each audio track in the audio.
Optionally, determining the track of the audio based on the audio comprises:
determining a sound source of the audio according to the audio, wherein the audio comprises at least two sound waves and at least two sound sources;
and judging the sound waves with the same sound source in the audio frequency to belong to the same audio track to obtain each audio track in the audio frequency.
Optionally, determining the sound source of the audio according to the audio includes:
determining a social application that generates audio;
and searching each contact person participating in audio generation in the social application, and determining each contact person as a sound source of the audio.
Optionally, after storing each audio text in the audio based on each audio track, the audio information extraction method further includes:
acquiring characters to be extracted;
determining the time sequence of the characters and each audio text;
characters are added between each audio text in chronological order.
Optionally, after the audio is analyzed to obtain the audio text generated by each audio track, the audio information extraction method further includes:
displaying a track selection box, wherein the track selection box comprises an audio track of audio and audio text generated by the audio track;
and detecting the selection operation of the audio track in the audio track selection frame, and modifying the audio text generated by the selected audio track.
Optionally, storing each audio text in the audio based on each audio track includes:
determining a storage language of the audio and an audio text different from the storage language in the audio;
translating the audio text different from the stored language in the audio into the stored language to obtain each audio text after the audio is unified into the language;
the time sequence of each audio text in the audio is determined, and each audio text identified by each audio track is stored according to the time sequence.
Optionally, after storing each audio text in the audio based on each audio track to serve as the extracted audio information, the audio information extracting method further includes:
sending audio information to a preset mailbox account or/and a social application account;
or sending audio texts of preselected audio tracks in the audio to a preset mailbox account or/and a social application account;
or selecting at least one conference element from the audio information according to a preset conference recording template, and filling the conference element into the preset conference recording template according to a preset format to generate a conference record.
Furthermore, the invention also provides a terminal, which comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more programs stored in the memory to implement the steps of the audio information extraction method as described in any one of the above.
Further, the present invention also provides a computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the audio information extraction method as described in any one of the above.
Aiming at the defect that the audio information can be obtained only by playing and listening the audio in the prior art, the technical scheme provided by the invention can directly obtain the audio information of the audio by analyzing the audio text generated by each audio track in the audio to be extracted and storing each audio text in the audio based on each audio track.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a sound wave diagram of a segment of audio provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of sound waves of the audio tracks of FIG. 1;
FIG. 3 is a flowchart illustrating an audio information extraction method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a first interaction interface of a social application according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an interaction interface of another social application provided in an embodiment of the present invention;
FIG. 6 is a diagram illustrating parsing of the audio of FIG. 1 to obtain audio texts;
FIG. 7 is a diagram illustrating an interface with a track selection box according to an embodiment of the present invention;
FIG. 8 is a second interactive interface of the social application shown in FIG. 4;
FIG. 9 is a diagram illustrating an interface with a track selection box according to an embodiment of the present invention;
fig. 10 is a preset conference recording template according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Before the audio information extraction method provided by the invention is introduced, some basic descriptions are carried out:
audio, which refers to recorded sounds that can be heard by human beings, includes speaking voices, singing voices, sounds produced by musical instruments, and meaningless noises, may be stored in format files of WAV format, MIDI format, CDA format, MP3 format, and the like.
Sound is propagated by sound waves and the recorded sound in audio can be represented by sound waves as shown in fig. 1. Referring to fig. 1, fig. 1 shows an audio frequency Q, i.e., a recorded sound, including a plurality of sound waves. The audio duration shown in fig. 1 is 23 seconds, wherein Q5 represents the fifth sound wave in the audio Q.
It should be understood that a piece of audio may include sound waves from one sound source, or sound waves from multiple sound sources, and in the embodiment of the present invention, sound waves generated from the same sound source belong to a unified track, and each track has different specific properties, such as timbre and voiceprint characteristics, from other tracks.
In other embodiments, the specific attributes of the tracks may further include a library of timbres, number of channels, input/output ports, volume, etc. in the audio editing processing device of the sound sequencer, digital music software, etc. And the tracks can be displayed as one parallel "track" in the audio editing processing means, such as the first track a and the second track B in fig. 2.
It is noted that the sound waves in the track are time-sequentially present, that is, a track includes information not only carried by all the sound waves in the track, but also the time information of the sound waves appearing on the track.
Referring to fig. 2, the audio track of fig. 1 has a first audio track a and a second audio track B. The first track a includes a first sound wave (a 1 in fig. 2, the time of occurrence is 0 th to 10 th seconds), a second sound wave (a 2 in fig. 2, the time of occurrence is 18 th to 23 th seconds). The second track B includes a third sound wave segment (B3 in fig. 2, the occurrence time is 18 th to 23 th seconds).
The following will describe the audio information extraction method provided by the present invention based on the above description, please refer to fig. 3, which is a schematic flow chart of the audio information extraction method provided by the embodiment of the present invention, and the method includes:
s301, determining the audio to be extracted, and determining the audio track of the audio based on the audio.
Wherein the audio has at least two tracks. It should be clear that the method provided by the embodiment of the present invention can also be used to extract audio with only one track.
The audio to be extracted here may be a complete uninterrupted piece of audio, or may be audio composed of multiple pieces of audio. It is to be understood that the combination of multiple pieces of audio into one piece of audio vocalization is in many cases:
for example, referring to fig. 4, which shows that the user M and the user N use the social application to perform online voice chat, it can be known that the voice chat generates five voices with durations of 25 seconds, 32 seconds, 12 seconds, 8 seconds and 2 seconds, respectively, and the user can select the five voices to store, in this example, the five voices can be combined into one audio.
For another example, please refer to fig. 5, which shows that nine users from user C to user K are using a certain conference software to perform an online voice conference/video conference, and it should be noted that the conference software can store the voice uttered by each user no matter when performing the voice conference or the video conference, and then the conference software stores the voice generated by each user according to the time sequence to obtain the conference audio of the conference.
In the embodiment of the present invention, the number of audios to be extracted may be one. In other examples, the number of the audios to be extracted may be multiple, for example, when an audio recording interruption occurs, and a same batch of multiple audios is received, it is necessary to extract audio information of multiple pieces of audios simultaneously and integrate the audio information into one complete audio information, and for this example that the number of the audios to be extracted is multiple, a more detailed description is provided later.
Determining the track of the audio may be accomplished based on voiceprint characteristics. The voiceprint characteristics of the sound such as the level, frequency and the like of the sound can be stored in the audio frequency, and the stored sound is not changed, the voiceprint characteristics of the sound source (the object generating the sound) are still kept, and the sound waves with the same voiceprint characteristics can be determined to belong to the same track. The method for extracting the audio information based on the voiceprint features can be used for recording software which does not distinguish sound sources, and can also be used for social application which can accommodate a plurality of users to carry out social communication.
In other embodiments, determining the tracks of the audio may be based on the source of the audio. It should be understood that a piece of audio is obtained by combining sound waves emitted from a plurality of sound sources, and strictly speaking, the audio itself has no sound source, and it is emphasized that the sound source in this document refers to a generation source of each sound wave in the audio, and sound waves having the same sound source can be determined as belonging to a unified soundtrack. The method for extracting the audio information based on the sound source can be used for social applications capable of accommodating social communication of a plurality of users.
It should be noted that, in this embodiment, the sound source of the audio is not determined by analyzing the audio and based on the voiceprint features in the audio, but may be determined by tracing back the sound source of each sound wave in the audio through social application. Specifically, when a social application is used for voice conference, video conference, voice chat, and blackout of a game, the social application used by each user can record the sound waves generated by the user, and the relationship between the user and the sound waves generated by the user is uniquely determined, so that the sound waves with the same sound source can be determined to belong to the same sound track after the sound source of each sound wave is determined.
Referring to fig. 4, among the five voices generated in the voice chat, the voices with the durations of 25 seconds, 12 seconds and 8 seconds are generated by the user M, and the voices with the durations of 32 seconds and 2 seconds are generated by the user N, so that after the user merges, stores and integrates the five voices into one audio, the sound sources (users) of the voices in the current conversation can be traced back from the social application, and the sound source of the integrated audio can be determined.
S302, analyzing the audio to obtain audio texts generated by each audio track.
It is to be understood that a piece of audio includes one or more sound waves, and one or more consecutive sound waves may constitute a semantic sound piece, which may be parsed into audio text. Referring to fig. 6, fig. 6 shows audio texts obtained by analyzing the audio Q in fig. 2. The first sound track a, a first segment of sound wave a1, of the first sound track a in fig. 2 (fig. 6) can be interpreted as the first audio text, "i have recently found that there are shops that do you eat well, are you going to taste together? ", the second sound wave A2 can be parsed into the second audio text" good, that we go to a bar today in the evening! ", the third sound wave segment B3 of the second sound track B could be interpreted as the third audio text" good o, recently also wanting to go to a little nice |)! ".
It should be understood that each audio text obtained by parsing the audio is an audio text generated from each audio track in the audio. These audio texts may be unordered text segments, semantically incoherent, and not as information extracted from the audio. With continued reference to fig. 6, parsing the audio Q may obtain a first, a second and a third audio texts, where the three audio texts are not sequential and are not semantically coherent, and then:
s303, storing each audio text in the audio based on each audio track as the extracted audio information.
As can be seen from the foregoing description, the information included in a sound track includes not only the information carried by all sound waves in the sound track, but also the time information of the occurrence of each sound wave on the sound track, and then the steps may be:
(1) and storing each audio text in the audio according to the time information to obtain a text with semantics, wherein the text with semantics is the audio information extracted from the audio.
(2) Or storing an audio text generated by a certain audio track in the audio according to the time information to obtain a text with semantics, wherein the text with semantics is the audio information extracted from the audio. It should be noted that the extracted audio information in this example is the audio information of the certain track in the audio, and not the audio information of the audio (having multiple tracks in the audio).
In other embodiments, the obtained text with semantics may be translated to obtain a text in a unified language, where the text in the unified language is audio information extracted from audio.
In some other embodiments, the step of translating the text may be performed after obtaining the audio text, sequentially storing the audio texts in the unified language according to the time information, and using the obtained text with semantic meaning as the audio information extracted from the audio.
After the audio information is extracted, the audio information can be utilized, such as sending the audio information to a preset mailbox account or/and a social application account. It is also possible in some examples to send only some audio text of a certain track to a preset mailbox account or/and social application account. In some other examples, the extracted audio information may also be used to generate a conference recording.
The embodiment of the invention provides an audio information extraction method, aiming at the defect that in the prior art, audio information can be obtained only by playing and listening to audio manually, the technical scheme provided by the invention can directly obtain audio texts generated by each audio track in the audio through computer analysis, and the audio information extracted from the audio can be obtained by storing each audio text in the audio based on each audio track.
Other embodiments may be realized by the method based on the above-described audio information extraction method as described further below.
The present invention also provides an embodiment, in which the audio information extraction method includes steps S1001 to S1013:
s1001, determining an audio to be extracted;
in the present embodiment, the audio to be extracted here is a complete uninterrupted piece of audio. In other examples, the audio to be extracted here is audio composed of a plurality of pieces of audio.
And S1002, analyzing the audio to obtain the voiceprint characteristics of the audio.
In the embodiment of the invention, the audio comprises at least two sound waves, and the audio has at least two voiceprint characteristics. It should be noted that the embodiment of the present invention can also be used in the case that the audio has only one sound wave and one voiceprint feature.
It should be understood that there are differences in size, shape and function of human vocal organs (including vocal cords, soft palate, tongue, teeth, lips, etc.), and different human vocal resonators (including pharyngeal cavity, oral cavity, nasal cavity), and the small differences of these organs will cause the change of vocal airflow, resulting in the difference of tone quality and tone color. In addition, the habit of human sounding is fast or slow, and the difference between the sound intensity and the sound length is caused by the strength of the force.
Based on the language graph instrument, the sound (audio frequency) can be converted into electric signals, and further, the change of the electric signals can be drawn into a spectrum graph to form a voiceprint graph with voiceprint characteristics. Therefore, the audio can be analyzed to obtain the voiceprint characteristics of the audio.
As can be seen from the above description, different people have different voiceprint characteristics. In some examples, a voiceprint feature may be characterized by three dimensions of a syllable, a speech sentence, and a speech paragraph in a piece of audio:
the syllable dimension comprises physiological characteristics such as vocal cord shape, vocal tract length, size and the like;
the sentence dimension comprises the voice characteristics of tone, strength, rhythm, pause and the like;
the paragraph dimensions include behavioral characteristics such as word usage, accent, pronunciation, and the like.
In some examples of embodiments of the present invention, the voiceprint features include primarily a sentence dimension and a paragraph dimension.
S1003, judging the sound waves with the same voiceprint characteristics in the audio to belong to the same audio track, and obtaining each audio track in the audio.
The audio frequency comprises a plurality of sound waves, and each sound wave in the audio frequency can be divided according to a rule that continuous sound waves with the same voiceprint characteristics are determined as a sound segment to obtain each audio track in the audio frequency.
And S1004, analyzing the audio to obtain audio texts generated by each audio track.
It is to be understood that a piece of audio includes one or more sound waves, and one or more consecutive sound waves may constitute a semantic sound piece, which may be parsed into audio text.
It should be understood that each audio text obtained by parsing the audio is an audio text generated from each audio track in the audio. These audio texts may be unordered text segments, semantically incoherent, and not as information extracted from the audio.
When the user thinks that the audio text obtained by analysis is wrong, the audio text can be modified. In the embodiment of the invention, the user has at least the following two modification conditions: (1) modifying audio texts generated by all audio tracks in the audio; (2) a user whose sound has a certain voiceprint characteristic can modify the audio text, and the modified audio text is obtained based on the audio track determined by the voiceprint characteristic.
The embodiment of the present invention mainly introduces a modification condition (1), and after step S1004, the audio information extraction method provided by the embodiment of the present invention further includes:
s1005, displaying the track selection frame.
The audio track selection box includes audio tracks of the audio and audio texts generated by the audio tracks. The user may review the audio text generated for each track at the interface with the track selection box.
Referring to fig. 7, an interface with a track selection box is shown, which includes first through fourth tracks of audio, and audio text generated by the four tracks.
After the user consults, if the audio text generated by a certain audio track is found to have a problem, the user can modify the audio text with the problem. For example, when the user finds that the audio text in the second track in fig. 7 is incorrect, he may select the selection button z corresponding to the second track.
S1006, detecting the selection operation of the audio track in the audio track selection frame, and modifying the audio text generated by the selected audio track.
The terminal can detect the selection operation of the user on the audio track in the audio track selection frame, and modify the audio text generated by the selected audio track.
The modifications herein include: discarding all or part of the audio text generated by the selected audio track, or/and adding or/and deleting characters in the respective audio text.
The audio information extraction method provided by the embodiment of the invention can also perform uniform language on the audio text, and comprises the following steps:
s1007, determining the storage language of the audio and the audio text different from the storage language in the audio.
The storage language here is a language for storing each audio text, and may be a language of each international country or a local dialect. For example, when a section of audio in the audio is japanese, the audio text obtained by parsing is also japanese, and if the user wants to unify the languages of the audio, it is necessary to determine the audio text different from the stored language in the audio.
There are two ways to determine the storage language of the audio:
(1) the language used by the terminal is determined according to the language used by the terminal set by the user, namely the language used by the terminal is used as the storage language of the audio.
(2) Determined according to the user's selection. Referring to fig. 8, fig. 8 shows an interface having a language selection box, on which a user can select a language to be selected as a stored language of audio.
S1008, translating the audio texts in the audio different from the storage language into the storage language to obtain the audio texts with the unified audio language.
With continued reference to fig. 8, after the user selects mandarin as the storage language, the audio text other than mandarin in the audio needs to be translated into mandarin, that is, the languages of other countries and dialects of various regions are translated into mandarin, so that the finally obtained audio text is all mandarin.
S1009 determines a time sequence of each audio text in the audio, and stores each audio text identified by each audio track in the time sequence.
One track in the audio comprises one or more sound waves, one or more continuous sound waves can form a sound segment with semantics, one sound segment in the audio can be analyzed to obtain one audio text, the audio can be analyzed to obtain a plurality of audio texts, and the plurality of audio texts are unordered and cannot be used for extracting audio information from the audio.
As can be known from the foregoing description of the audio track, information included in one audio track includes not only information (audio text) carried by all sound waves in the audio track, but also time information of occurrence of each sound wave on the audio track, so that each audio text in the audio can be sequentially stored according to the time information, and a text with semantics is obtained, where the text with semantics is audio information extracted from the audio.
In the embodiment of the invention, the storage file obtained by directly storing the audio texts according to the time sequence is the audio information obtained by extraction. In other examples, after obtaining the audio information extracted from the audio, the following steps are required:
s1010, obtaining characters to be extracted;
when social communication is performed by using social application, the situation that voice call/video call is inconvenient to perform due to no audio access exists, at the moment, a user can communicate by adopting a character sending mode, and the characters are characters to be extracted in the communication.
Referring to fig. 9, it is shown that the user M and the user N use the social application to perform online voice communication, and it can be known that three voices (with durations of 25 seconds, 32 seconds and 2 seconds, respectively) and two characters are generated in the communication. If the audio and the characters generated in the communication are extracted, it is necessary to:
and S1011, determining the time sequence of the characters and each audio text.
And S1012, adding characters among the audio texts according to the time sequence to serve as the extracted audio information.
In other examples, the audio information may also be extracted:
and S1013, sending the audio text of the pre-selected audio track in the audio to a preset mailbox account or/and a social application account.
In this example, each user participating in a voice call can only modify the words (audio text) that they personally speak, when only the audio text of the respective audio track is sent to the user to whom the audio track corresponds.
The audio text sending mode can be sent to a preset mailbox account of the corresponding user through a mail, and can also be sent to a social application account of the corresponding user.
In other examples, the step of extracting the audio information further includes the following two ways.
(1) And sending the audio information to a preset mailbox account or/and a social application account.
The difference from the previous example is that the example is that all audio text in the audio (the discarded audio text may be removed) is sent to some or all users.
(2) And selecting at least one conference element from the audio information according to a preset conference recording template, and filling the conference element into the preset conference recording template according to a preset format to generate a conference record.
In this example, the meeting elements may include one or more of meeting subject, time, location, participants, moderator, meeting summary.
The preset conference record template is a preset template, and can be in a format of Word, Excel, PDF and the like. Referring to fig. 10, which is a preset conference recording template provided by the embodiment of the present invention, including the above-mentioned conference elements, it should be understood that the extracted audio information will be recorded in the conference summary part of the template.
The audio information extraction method provided by the embodiment of the invention can determine the audio tracks based on the voiceprint characteristics, obtain the audio texts generated by each audio track in the audio by analyzing, and store each audio text in the audio to obtain the audio information extracted from the audio, namely, the audio information extraction method provided by the embodiment of the invention can obtain the audio information in the audio without a user manually in a playing and listening mode, so that the user can conveniently look up the audio information, and the user can obtain higher user experience.
The audio information extraction method provided by the invention also has other examples, and comprises the following steps of S1201 to S1204:
s1201, determining the audio to be extracted.
As in the previous embodiment, the audio to be extracted here is a complete uninterrupted piece of audio. In other examples, the audio to be extracted here is audio composed of a plurality of pieces of audio. A piece of audio may include sound waves from one sound source or may include sound waves from multiple sound sources.
In the present embodiment, the audio to be extracted has at least two sound sources, but the audio information extraction method provided by the present embodiment can also be used in the case where the audio has only one sound source.
It should be noted that many steps of the present embodiment are the same as those described above, and in order to avoid repetition, the same steps are briefly described, and the above description may be referred to where unclear points exist.
S1202, determining the social application generating the audio.
The embodiment can trace the sound source of each sound wave in the audio through social application. Specifically, when a social application is used for voice conference, video conference, voice chat and game blacking, the social application used by each user can record sound waves generated by the user, and the relationship between the user and the sound waves generated by the user is uniquely determined.
The social application may store audio left over from social communications using the social application, and based on the audio, the social application that generated this audio may be determined.
S1203, searching each contact person participating in audio generation in the social application, and determining each contact person as a sound source of the audio.
When the social application is utilized for voice conference, video conference, voice chat and game blacking, the participants are known and recorded in the social application, and after the social application is determined, the contacts participating in the audio generation are found based on the history of the social application, and the contacts can be determined as the sound source of the audio to be extracted.
S1204, sound waves with the same sound source in the audio are judged to belong to the same audio track, and each audio track in the audio is obtained.
It should be understood that a piece of audio is obtained by combining sound waves emitted from a plurality of sound sources, and strictly speaking, the audio itself has no sound source.
And S1205, analyzing the audio to obtain audio texts generated by each audio track.
A piece of audio comprises one or more sound waves, one or more continuous sound waves can form a sound segment with semantics, and the sound segment with semantics can be analyzed into audio text.
And S1206, displaying the track selection frame.
The audio track selection box includes audio tracks of the audio and audio texts generated by the audio tracks. The user may review the audio text generated for each track at the interface with the track selection box.
S1207, detecting the selection operation of the track in the track selection box, and modifying the audio text generated by the selected track.
The modifications herein include: discarding all or part of the audio text generated by the selected audio track, or/and adding or/and deleting characters in the respective audio text.
S1208, determining the storage language of the audio and the audio text different from the storage language in the audio.
The storage language here is a language for storing each audio text, and may be a language of each international country or a local dialect. There are two ways to determine the storage language of the audio:
(1) the language used by the terminal is determined according to the language used by the terminal set by the user, namely the language used by the terminal is used as the storage language of the audio.
(2) The selected language is determined as the stored language of the audio according to the user's selection.
S1209, translating the audio text different from the storage language in the audio into the storage language to obtain each audio text after the audio is unified into the language.
S1210, determining the time sequence of each audio text in the audio, and storing each audio text identified by each audio track according to the time sequence.
As can be known from the foregoing description of the audio track, information included in one audio track includes not only information (audio text) carried by all sound waves in the audio track, but also time information of occurrence of each sound wave on the audio track, so that each audio text in the audio can be sequentially stored according to the time information, and a text with semantics is obtained, where the text with semantics is audio information extracted from the audio.
In the embodiment of the present invention, storing the audio text in time sequence here results in the extracted audio information. In some other examples of this embodiment, the audio information extracting method provided by the present invention further includes the following steps:
s1211, obtaining the character to be extracted.
When social communication is performed by using social application, the situation that voice call/video call is inconvenient to perform due to no audio access exists, at the moment, a user can communicate by adopting a character sending mode, and the characters are characters to be extracted in the communication.
And S1212, determining the time sequence of the characters and each audio text.
S1213, adding characters between the audio texts in time sequence to obtain the extracted audio information.
S1214, sending the audio information to a preset mailbox account or/and a social application account.
In other examples, only audio text of some tracks may be sent to a preset mailbox account or/and a social application account. In some other examples, the extracted audio information may also be used to generate a conference recording.
The embodiment of the invention provides an audio information extraction method, aiming at the defect that in the prior art, audio information can be obtained only by playing and listening to audio, social application is determined based on audio, a sound source of the audio in the social application is further determined, audio tracks in the audio are determined based on the sound source, audio texts generated by the audio tracks are obtained by analysis, and the audio information extracted from the audio can be obtained by storing the audio texts in the audio.
It is emphasized that, unlike the previous example in which the track is determined based on the voiceprint characteristics, the present example is based on the social application to determine the sound source of the audio and further determine the track in the audio, which can reduce the consumption of computer resources and also take less time to determine the track in the audio, thereby providing a better experience for the user.
The embodiment of the present invention further provides a terminal, please refer to fig. 11, which includes a processor 1101, a memory 1102 and a communication bus 1103;
the communication bus 1103 is used for implementing connection communication between the processor 1101 and the memory 1102;
the processor 1101 is configured to execute one or more programs stored in the memory 1102 to implement the steps of the audio information extraction method mentioned in any one of the above-described embodiments.
Embodiments of the present invention also provide a computer-readable storage medium, which stores one or more programs that can be executed by one or more processors to implement the steps of the audio information extraction method mentioned in any one of the above-described embodiments.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.
In the above embodiments, the description of each embodiment has its own emphasis, and parts of a certain embodiment that are not described in detail can be referred to related descriptions of other embodiments, and the above serial numbers of the embodiments of the present invention are merely for description and do not represent advantages and disadvantages of the embodiments, and those skilled in the art can make many forms without departing from the spirit and scope of the present invention and as claimed in the claims, and these forms are within the protection of the present invention.
Claims (10)
1. An audio information extraction method, characterized in that the audio information extraction method comprises:
determining audio to be extracted, determining tracks of the audio based on the audio, wherein the audio has at least two tracks;
analyzing the audio to obtain audio texts generated by each audio track;
and storing each audio text in the audio on the basis of each audio track to serve as the extracted audio information.
2. The audio information extraction method of claim 1, wherein the determining the track of the audio based on the audio comprises:
analyzing the audio to obtain voiceprint characteristics of the audio, wherein the audio comprises at least two sound waves and at least has two voiceprint characteristics;
and judging the sound waves with the same voiceprint characteristics in the audio frequency to belong to the same audio track to obtain each audio track in the audio frequency.
3. The audio information extraction method of claim 1, wherein the determining the track of the audio based on the audio comprises:
determining a sound source of the audio according to the audio, wherein the audio comprises at least two sound waves and the audio has at least two sound sources;
and judging the sound waves with the same sound source in the audio frequency to belong to the same audio track to obtain each audio track in the audio frequency.
4. The audio information extraction method of claim 3, wherein the determining a sound source of the audio from the audio comprises:
determining a social application that generated the audio;
and searching each contact person participating in the audio generation in the social application, and determining each contact person as a sound source of the audio.
5. The audio information extraction method of any one of claims 1 to 4, wherein after storing each audio text in the audio based on each of the tracks, the audio information extraction method further comprises:
acquiring characters to be extracted;
determining a time sequence of the characters and each of the audio texts;
adding said characters between each of said audio texts in said temporal order.
6. The audio information extraction method of any one of claims 1-4, wherein after parsing the audio to obtain audio text generated by each of the audio tracks, the audio information extraction method further comprises:
displaying an audio track selection box, wherein the audio track selection box comprises an audio track of the audio and audio text generated by the audio track;
and detecting the selection operation of the audio track in the audio track selection frame, and modifying the audio text generated by the selected audio track.
7. The audio information extraction method of any one of claims 1 to 4, wherein the storing of each audio text in the audio based on each of the tracks comprises:
determining a storage language of the audio and audio text in the audio different from the storage language;
translating the audio text different from the storage language in the audio into the storage language to obtain each audio text after the audio is unified into the language;
determining a temporal order of audio texts in the audio, and storing each audio text identified by each audio track according to the temporal order.
8. The audio information extraction method according to any one of claims 1 to 4, wherein, after storing each audio text in the audio on a per-track basis as the extracted audio information, the audio information extraction method further comprises:
sending the audio information to a preset mailbox account or/and a social application account;
or sending an audio text of a preselected audio track in the audio to a preset mailbox account or/and a social application account;
or selecting at least one conference element from the audio information according to a preset conference recording template, and filling the conference element into the preset conference recording template according to a preset format to generate a conference record.
9. A terminal, characterized in that the terminal comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more programs stored in the memory to implement the steps of the audio information extraction method of any of claims 1-8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs which are executable by one or more processors to implement the steps of the audio information extraction method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010094370.1A CN111415651A (en) | 2020-02-15 | 2020-02-15 | Audio information extraction method, terminal and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010094370.1A CN111415651A (en) | 2020-02-15 | 2020-02-15 | Audio information extraction method, terminal and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111415651A true CN111415651A (en) | 2020-07-14 |
Family
ID=71492792
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010094370.1A Pending CN111415651A (en) | 2020-02-15 | 2020-02-15 | Audio information extraction method, terminal and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111415651A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488006A (en) * | 2021-07-05 | 2021-10-08 | 功夫(广东)音乐文化传播有限公司 | Audio processing method and system |
CN113823250A (en) * | 2021-11-25 | 2021-12-21 | 广州酷狗计算机科技有限公司 | Audio playing method, device, terminal and storage medium |
WO2023236794A1 (en) * | 2022-06-06 | 2023-12-14 | 华为技术有限公司 | Audio track marking method and electronic device |
-
2020
- 2020-02-15 CN CN202010094370.1A patent/CN111415651A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113488006A (en) * | 2021-07-05 | 2021-10-08 | 功夫(广东)音乐文化传播有限公司 | Audio processing method and system |
CN113823250A (en) * | 2021-11-25 | 2021-12-21 | 广州酷狗计算机科技有限公司 | Audio playing method, device, terminal and storage medium |
CN113823250B (en) * | 2021-11-25 | 2022-02-22 | 广州酷狗计算机科技有限公司 | Audio playing method, device, terminal and storage medium |
WO2023236794A1 (en) * | 2022-06-06 | 2023-12-14 | 华为技术有限公司 | Audio track marking method and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10789290B2 (en) | Audio data processing method and apparatus, and computer storage medium | |
US8027837B2 (en) | Using non-speech sounds during text-to-speech synthesis | |
KR101274961B1 (en) | music contents production system using client device. | |
US11942093B2 (en) | System and method for simultaneous multilingual dubbing of video-audio programs | |
CN111415651A (en) | Audio information extraction method, terminal and computer readable storage medium | |
CN108242238B (en) | Audio file generation method and device and terminal equipment | |
TW201214413A (en) | Modification of speech quality in conversations over voice channels | |
CN111739556A (en) | System and method for voice analysis | |
Kostov et al. | Emotion in user interface, voice interaction system | |
CN111213200A (en) | System and method for automatically generating music output | |
Cooper | Text-to-speech synthesis using found data for low-resource languages | |
CN110741430A (en) | Singing synthesis method and singing synthesis system | |
CN112185341A (en) | Dubbing method, apparatus, device and storage medium based on speech synthesis | |
Bharadwaj et al. | Analysis of Prosodic features for the degree of emotions of an Assamese Emotional Speech | |
JP2011186143A (en) | Speech synthesizer, speech synthesis method for learning user's behavior, and program | |
Mitsui et al. | Towards human-like spoken dialogue generation between AI agents from written dialogue | |
JP2005215888A (en) | Display device for text sentence | |
Aylett et al. | Combining statistical parameteric speech synthesis and unit-selection for automatic voice cloning | |
Wu et al. | Modeling the expressivity of input text semantics for Chinese text-to-speech synthesis in a spoken dialog system | |
CN115472185A (en) | Voice generation method, device, equipment and storage medium | |
JP4409279B2 (en) | Speech synthesis apparatus and speech synthesis program | |
JP2003099089A (en) | Speech recognition/synthesis device and method | |
Abdo et al. | Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech. | |
Ferris | Techniques and challenges in speech synthesis | |
WO2022041177A1 (en) | Communication message processing method, device, and instant messaging client |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |