CN114595356A - Text and audio presentation processing method and system - Google Patents

Text and audio presentation processing method and system Download PDF

Info

Publication number
CN114595356A
CN114595356A CN202210089590.4A CN202210089590A CN114595356A CN 114595356 A CN114595356 A CN 114595356A CN 202210089590 A CN202210089590 A CN 202210089590A CN 114595356 A CN114595356 A CN 114595356A
Authority
CN
China
Prior art keywords
sound
track
audio
editing
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210089590.4A
Other languages
Chinese (zh)
Inventor
范梓野
朱风云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Real Time Intelligent Technology Co ltd
Original Assignee
Dalian Real Time Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Real Time Intelligent Technology Co ltd filed Critical Dalian Real Time Intelligent Technology Co ltd
Priority to CN202210089590.4A priority Critical patent/CN114595356A/en
Publication of CN114595356A publication Critical patent/CN114595356A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/638Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/687Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

The invention discloses a text and audio presentation processing method, which comprises the following steps: the script editing equipment generates a script; the post-production equipment acquires the script from the script editing equipment and creates a post-production project; post-production equipment performs post-production processing on the sound clips through post-production engineering to generate and output a post-production result; and the audio rendering device renders the post-production result. In addition, the invention also discloses a text and audio presentation processing system. By adopting the invention, the text and audio presentation of the book with integrated listening and reading can be realized, the audio data is structured through the script, and the relation between the audio and the characters is established; the editing user can conveniently carry out post-production of the audio book, and the production of the finished product can be efficiently completed by using the materials; the material and the script are mutually associated, a visual multi-track nonlinear editing interface is provided, and post-production can be quickly and accurately completed even if the material is lacked; the burden of manually adjusting the material time is relieved when the lines in the script are updated.

Description

Text and audio presentation processing method and system
Technical Field
The invention relates to the technical field of recording and sound mixing, in particular to a text and audio presentation processing method and system.
Background
At present, due to popularization and promotion of computer and internet technologies, the spreading form of books is no longer limited by traditional paper reading materials, a large number of books, especially novel books, have electronic text versions and audio versions, wherein the audio versions are audio book reading materials; however, in the prior art, reading software can only present text versions, and audio book software can only present audio versions.
With the popularity of audio books, users need a new mode that can seamlessly switch between reading and listening. For example, in one daily scenario, the user reads the text version of the book at night before sleep, continues to listen to the audio version of the book the next morning from where it was read at yesternight during work commutes, and continues to read the text version of the book the afternoon from where it was heard in the morning.
However, the inventor has found through research that the problem of the prior art is that in the traditional audio book production, the production process is a serial production process from the beginning of script editing to the recording of the anchor until the final editing is completed, the recording process of the anchor is an important link and a core link of the production process, and if the recording of the anchor cannot be completed, the whole production process is obstructed; and when the script is edited in the later period, the edited object only comprises audio, and the better understanding of the edited object cannot be intuitively obtained from the text of the script. In order to solve the problem, optimize the production process and improve the production efficiency, a production scheme capable of fully utilizing the incidence relation between the audio and the characters is required, so that the text and the audio which are integrated by listening and reading are presented.
Disclosure of Invention
Based on this, in order to solve the technical problems in the prior art, a text and audio presentation processing method is specifically provided, which includes:
the script editing equipment generates a script; the script comprises one or more paragraphs;
the post-production equipment acquires scripts from script editing equipment connected with the post-production equipment and creates a post-production project according to the scripts; the post production project comprises sound segments, and the sound segments correspond to paragraphs in the script;
the post-production equipment carries out post-production processing on the sound fragments through post-production engineering to generate a post-production result and outputs the post-production result to audio presentation equipment connected with the post-production result;
the audio rendering device renders the post-production results.
In one embodiment, the script comprises a recording material, an audio material, a sound effect processing mode and a paragraph presentation sequence corresponding to the paragraphs;
the post-production equipment comprises a sound effect processor, a sound mixing processor, a material manager and an editing display;
the material manager comprises a material library, wherein the material library locally stores materials; the material manager organizes and manages the materials in a hierarchical structure in a material library, wherein the material types in the material library comprise audio materials and effect materials;
The sound effect processor acquires audio materials through the material manager connected with the sound effect processor, applies sound effect processing operation to the audio materials according to a sound effect processing mode set by the script, and outputs the audio materials subjected to sound effect processing to the sound mixing processor connected with the sound mixing processor;
the audio mixing processor comprises a main track and an auxiliary track, wherein the main track and the auxiliary track are used for bearing audio clips; the sound mixing processor performs sound mixing processing operation on the sound clips in the main track and the auxiliary track according to the script to obtain a sound mixing processing result;
wherein the script paragraphs comprise text paragraphs and audio paragraphs; the sound segments comprise sound segments in a main track corresponding to the text paragraphs and sound segments in an auxiliary track corresponding to the audio paragraphs;
the position of the sound fragment in the main track is determined by the paragraph presentation sequence set by the script, and the position of the sound fragment in the auxiliary track is set by an editing user;
the editing display displays an editing view in the process of mixing processing performed by the mixing processor, wherein the editing view comprises a text editing view and a multi-track editing view; the cursor positions in the text editing view and the multi-track editing view are bound with each other;
In the editing display, the text editing view and the multi-track editing view are presented simultaneously, or an editing user selects one of the views for presentation and can switch between them.
In one embodiment, the segment content of the sound segment in the main track includes an association relationship between the sound segment in the main track and a text segment in the script, the text content of the associated text segment, an association relationship between the sound segment in the main track and a recording material, the associated recording material, and editing information of the sound segment in the main track;
when the sound clip in the main track is associated with the recording material, the clip content of the sound clip in the main track also comprises text alignment information between the text content and the recording material; for the script which finishes recording the recording material, sound fragments associated with all text paragraphs are arranged on the main track according to the paragraph presentation sequence set by the script and are separated from each other by taking leading mute as an interval;
the multi-track editing view presents a primary track and a secondary track, and a time axis of the multi-track editing view is correlated with the text editing view through text alignment information of sound clips in the primary track;
the editing information of the sound clip in the main track comprises leading mute time, a material playing starting position, the time of the sound clip in the main track and effect setting information;
The segment content of the sound segment in the main track comprises one or more anchors; wherein the anchor point is set at a position in the text alignment information; or the anchor point is arranged at a specific position of the sound segment in the main track, and the specific position comprises the beginning and the end of the sound segment in the main track; or the anchor point is arranged at the relative position based on the recording material set by the editing user; wherein the anchor point comprises semantic information, and the semantic information comprises words or phrases in the text, audio event description, start time and end time.
In one embodiment, the segment content of the sound segment in the secondary track includes an association relationship between the sound segment in the secondary track and the audio material, the associated audio material, and editing information of the sound segment in the secondary track;
the editing information of the sound clip in the auxiliary track comprises a material playing start position, the duration of the sound clip in the auxiliary track, the start position of the sound clip in the auxiliary track, circulating playing setting information and track description information;
the positioning mode of the starting position of the sound clip in the auxiliary track comprises an absolute time positioning mode and a relative binding positioning mode; in the absolute time positioning mode, the starting position of the sound clip in the auxiliary track is located at a specific time defined by the global time axis; in the relative binding positioning mode, the starting position of the sound clip in the auxiliary track is at an offset moment relative to the anchor point, and the anchor point is provided by the sound clip in the main track;
When the loop playing is effective, the loop playing setting information comprises a loop ending position; setting the cycle end position comprises setting by the absolute time length after the cycle or setting by an anchor point provided by a sound segment in the main track;
in the process of sound mixing processing, the sound mixing processor acquires the audio material subjected to sound effect processing through the sound effect processor and adds the acquired audio material to an editing view of the editing display, and the added audio material is placed at a position set by an editing user.
In one embodiment, the post production engineering includes engineering default configuration information; the engineering default configuration information comprises default interval values among sound clips in the main track, target gain set values of sound mixing processing and speech speed estimation values of a sound recorder; the default interval value between the sound clips in the main track is the default leading mute time length, and the target gain setting value of the sound mixing processing comprises target volume values of human voice, music and sound effect;
in the process of sound mixing, when the text paragraphs are not recorded yet and the main audio clip lacks associated recording materials, the time length of the main audio clip on the main track is determined by the estimated voice speed value of the sound recorder and the number of paragraph words in the default configuration information of the project, or the editing user selects the voice synthesis result as the recording materials for temporary use.
In addition, in order to solve the technical problems in the prior art, the text and audio presentation processing system is particularly provided, and comprises script editing equipment, post-production equipment and audio presentation equipment which are sequentially connected with one another;
the script editing equipment generates a script; the script comprises one or more paragraphs;
the post-production equipment acquires a script and creates a post-production project according to the script; the post production project comprises sound segments, wherein the sound segments correspond to paragraphs in the script;
the post-production equipment carries out post-production processing on the sound fragments through a post-production project to generate a post-production result and outputs the post-production result to the audio presentation equipment;
the audio presentation device presents the post-production results.
In one embodiment, the script comprises a recording material, an audio material, a sound effect processing mode and a paragraph presentation sequence corresponding to the paragraphs;
the post-production equipment comprises a sound effect processor, a sound mixing processor, a material manager and an editing display; the sound effect processor is connected with the material manager; the sound effect processor is connected with the sound mixing processor; the mixing processor is connected with the editing display;
The material manager comprises a material library, and the material library locally stores materials; the material manager organizes and manages the materials in a hierarchical structure in a material library, wherein the material types in the material library comprise audio materials and effect materials;
the sound effect processor acquires audio materials through the material manager, applies sound effect processing operation to the audio materials according to a sound effect processing mode set by the script, and outputs the audio materials subjected to sound effect processing to the sound mixing processor;
the sound mixing processor comprises a main track and an auxiliary track, wherein the main track and the auxiliary track are used for bearing sound fragments; the sound mixing processor performs sound mixing processing operation on the sound clips in the main track and the auxiliary track according to the script to obtain a sound mixing processing result;
wherein the paragraphs of the script comprise a text paragraph and an audio paragraph; the sound segments comprise sound segments in a main track corresponding to the text paragraphs and sound segments in an auxiliary track corresponding to the audio paragraphs;
the position of the sound fragment in the main track is determined by the paragraph presentation sequence set by the script, and the position of the sound fragment in the auxiliary track is set by an editing user;
The editing display displays an editing view in the process of mixing sound by the sound mixing processor, wherein the editing view comprises a text editing view and a multi-track editing view; the positions of the cursors in the text editing view and the multi-track editing view are bound with each other;
in the editing display, the text editing view and the multi-track editing view are presented simultaneously, or one of the views is selected by an editing user for presentation and can be switched with each other.
In one embodiment, the segment content of the sound segment in the main track includes an association relationship between the sound segment in the main track and a text segment in the script, text content of the associated text segment, an association relationship between the sound segment in the main track and a recording material, the associated recording material, and editing information of the sound segment in the main track;
when the sound clip in the main track is associated with the recording material, the clip content of the sound clip in the main track also comprises text alignment information between the text content and the recording material; for the script which finishes recording the recording material, sound segments associated with all text paragraphs are arranged on the main track according to the paragraph presentation sequence set by the script and are separated from each other by taking leading silence as an interval;
The multi-track editing view presents a primary track and a secondary track, and a time axis of the multi-track editing view is correlated with the text editing view through text alignment information of sound clips in the primary track;
the editing information of the sound clip in the main track comprises leading mute time, a material playing starting position, the time of the sound clip in the main track and effect setting information;
the segment content of the sound segment in the main track comprises one or more anchors; wherein the anchor point is set at a position in the text alignment information; or the anchor point is arranged at a specific position of the sound segment in the main track, and the specific position comprises the beginning and the end of the sound segment in the main track; or the anchor point is arranged at the relative position based on the recording material set by the editing user; the anchor point comprises semantic information, wherein the semantic information comprises words or phrases in the text, audio event description, start time and end time.
In one embodiment, the segment content of the sound segment in the secondary track includes an association relationship between the sound segment in the secondary track and the audio material, the associated audio material, and editing information of the sound segment in the secondary track;
Editing information of the sound clip in the auxiliary track comprises a material playing starting position, sound clip duration in the auxiliary track, a sound clip starting position in the auxiliary track, circulating playing setting information and track description information;
the positioning mode of the starting position of the sound clip in the auxiliary track comprises an absolute time positioning mode and a relative binding positioning mode; in the absolute time positioning mode, the starting position of the sound clip in the auxiliary track is located at a specific time defined by the global time axis; in the relative binding positioning mode, the starting position of the sound clip in the auxiliary track is at an offset moment relative to the anchor point, and the anchor point is provided by the sound clip in the main track;
when the loop playing is effective, the loop playing setting information comprises a loop ending position; the setting of the loop ending position comprises setting by absolute time length after loop or setting by an anchor point provided by a sound segment in the main track;
in the sound mixing process, the sound mixing processor acquires audio materials subjected to sound effect processing through the sound effect processor and adds the acquired audio materials to an editing view of an editing display, the added audio materials being placed at editing user set positions.
In one embodiment, the post production engineering includes engineering default configuration information; the engineering default configuration information comprises default interval values among sound clips in the main track, target gain set values of sound mixing processing and speech speed estimation values of a sound recorder; the default interval value between the sound clips in the main track is the default leading mute duration, and the target gain setting value of the sound mixing processing comprises target volume values of human voice, music and sound effect;
in the process of sound mixing, when the text paragraphs are not recorded yet and the main audio clip lacks associated recording materials, the time length of the main audio clip on the main track is determined by the estimated voice speed value of the sound recorder and the number of paragraph words in the default configuration information of the project, or the editing user selects the voice synthesis result as the recording materials for temporary use.
The embodiment of the invention has the following beneficial effects:
by adopting the method and the device, the text and audio presentation integrated with listening and reading of the book can be realized, the audio data is structured through the audio mixing script, and the relation between the audio and the characters is established; the invention enables the editing user to conveniently carry out the post-production of the audio book according to the script, and the production are efficiently finished by utilizing the recording material recorded by the recorder and the audio material such as music, sound effect and the like; the invention correlates the recording material, the audio material and the script, provides a visual multi-track audio nonlinear editing interface, and can complete quick and accurate post editing even in the absence of the recording material; when the recorded material of the speech in the script is updated, the burden of manually adjusting the material time can be avoided for the user.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Wherein:
FIG. 1 is a schematic diagram of a text and audio rendering system according to the present invention;
FIG. 2 is a flowchart illustrating a text and audio rendering processing method according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a text and audio presentation processing system, which comprises script editing equipment, post-production equipment and audio presentation equipment which are sequentially connected with one another;
The script editing equipment generates a script; the script comprises one or more paragraphs;
the script comprises a recording material, an audio material, a sound effect processing mode and a paragraph presenting sequence corresponding to the paragraphs;
the post-production equipment acquires a script and creates a post-production project according to the script; the post production project comprises sound segments, wherein the sound segments correspond to paragraphs in the script;
in particular, the post production project includes project default configuration information; the engineering default configuration information comprises default interval values among sound clips in the main track, a target gain set value of sound mixing processing and a speech speed estimation value of a sound recorder; the default interval value between the sound clips in the main track is the default leading mute duration, and the target gain setting value of the sound mixing processing comprises target volume values of human voice, music and sound effect;
particularly, the post-production equipment comprises a sound effect processor, a sound mixing processor, a material manager and an editing display; the sound effect processor is connected with the material manager; the sound effect processor is connected with the sound mixing processor; the mixing processor is connected with the editing display;
The material manager comprises a material library, and the material library locally stores materials; the materials manager organizes and manages materials in a hierarchical structure in a materials library; the material types in the material library comprise audio materials and effect materials;
the audio materials include but are not limited to sound effect materials and music materials; the effect material comprises effect setting parameters, such as cave environment effect, broadcast effect and the like;
the sound effect processor acquires audio materials through the material manager, applies sound effect processing operation to the audio materials according to a sound effect processing mode set by the script, and outputs the audio materials subjected to sound effect processing to the sound mixing processor;
wherein, the sound effects include but are not limited to gain, sound channel mixing, fade-in and fade-out, equalization, environment, noise reduction, compression, time scaling, and three-dimensional effects;
in the sound effect processing mode, a plurality of sound effects are connected in series and the sequence is adjustable; the sound effect comprises one or more adjustable operating parameters, and the adjustable operating parameters of the sound effect are set to change along with the playing time according to a specific curve;
the sound mixing processor comprises a main track and an auxiliary track, wherein the main track and the auxiliary track are used for bearing sound fragments; the sound mixing processor performs sound mixing processing operation on the sound clips in the main track and the auxiliary track according to the script to obtain a sound mixing processing result;
Wherein the paragraphs of the script comprise a text paragraph and an audio paragraph; the sound segments comprise sound segments in a main track corresponding to the text paragraphs and sound segments in an auxiliary track corresponding to the audio paragraphs;
the positions of the sound segments in the main track are determined by the presentation sequence of the paragraphs set by the script, and the positions of the sound segments in the auxiliary track are set by an editing user;
the main track is positioned according to the paragraph presentation sequence, the paragraph presentation duration and the leading silence of the sound clip carried by the main track;
specifically, the clip content of the sound clip in the main track includes an association relationship between the sound clip in the main track and a text paragraph in the script, the text content of the associated text paragraph, an association relationship between the sound clip in the main track and a recording material, the associated recording material, and editing information of the sound clip in the main track;
the text paragraphs for which the recording materials are not recorded do not have corresponding recording materials, and the fragment contents of the sound fragments in the corresponding main track do not include the association relationship between the sound fragments in the main track and the recording materials;
when the sound clip in the main track is associated with the recording material, namely when the clip content of the sound clip in the main track comprises the association relation between the sound clip in the main track and the recording material, the clip content of the sound clip in the main track also comprises text alignment information between the text content and the recording material;
For the script which finishes recording the recording material, sound segments associated with all text segments are arranged on the main track according to the segment presentation sequence set by the script and are separated from each other by taking leading silence as an interval;
the editing information of the sound clip in the main track comprises leading mute time, a material playing starting position, the time of the sound clip in the main track and effect setting information;
in particular, the segment content of the sound segment in the main track comprises one or more anchor points;
wherein the anchor point is set at a position in the text alignment information; or, the anchor point is arranged at a specific position of the sound clip in the main track, and the specific position includes but is not limited to the beginning and the end of the sound clip in the main track; or the anchor point is arranged at the relative position based on the recording material set by the editing user;
the anchor point comprises semantic information, wherein the semantic information comprises words or phrases in a text, audio event description, start time and end time;
the anchor point is used for text content positioning based on semantics; positioning a corresponding text paragraph from the audio time according to the anchor point, or positioning a corresponding audio time from the text paragraph;
The sound fragment in the main track is a main presentation object of a recording material corresponding to a script in the post-production of the audio book and a text paragraph in the script; that is, only the recording material corresponding to the text passage in the script is usually placed in the main track, and the recording material is recorded by a sound recorder or obtained through voice synthesis;
the sound segments in the primary track may also be used to present audio material, i.e. to place audio material in the primary track;
inserting the audio material into the middle of the main audio segment of the main track, converting the audio material into the main audio segment, and positioning according to the positioning mode of the main track;
specifically, the clip content of the sound clip in the secondary track includes an association relationship between the sound clip in the secondary track and the audio material, the associated audio material, and editing information of the sound clip in the secondary track;
particularly, the editing information of the sound clip in the auxiliary track includes a material playing start position, the duration of the sound clip in the auxiliary track, the start position of the sound clip in the auxiliary track, the circular playing setting information, and the track description information;
the positioning mode of the starting position of the sound clip in the auxiliary track comprises an absolute time positioning mode and a relative binding positioning mode; in the absolute time positioning mode, the starting position of the sound clip in the auxiliary track is located at a specific time defined by the global time axis; in the relative binding positioning mode, the starting position of the sound clip in the auxiliary track is at an offset moment relative to the anchor point, and the anchor point is provided by the sound clip in the main track;
When the loop playing is effective, the loop playing setting information comprises a loop ending position; setting the cycle end position comprises setting by the absolute time length after the cycle or setting by an anchor point provided by a sound segment in the main track;
the editing display displays an editing view in the process of mixing processing performed by the mixing processor, wherein the editing view comprises a text editing view and a multi-track editing view; the positions of the cursors in the text editing view and the multi-track editing view are bound with each other;
wherein, in the editing display, the text editing view and the multi-track editing view are presented simultaneously, or an editing user selects one of the views to present and can switch each other;
the multi-track editing view provides a visual multi-track audio nonlinear editing interface for an editing user;
wherein the multi-track editing view presents a primary track and a secondary track, and a timeline of the multi-track editing view is correlated with the text editing view by text alignment information of sound clips in the primary track;
in the sound mixing process, the sound mixing processor acquires audio materials subjected to sound effect processing through the sound effect processor, and adds the acquired audio materials into an editing view of an editing display, wherein the added audio materials are placed at a position set by an editing user;
Specifically, when the audio mixing processor adds the acquired audio material to the multi-track editing view, the added audio material is placed at an editing user set position in an absolute time positioning manner;
after the audio materials are added to the multi-track editing view, the positioning mode of the audio materials can be selected to be modified into a relative binding positioning mode;
specifically, when the audio mixing processor adds the acquired audio material into the text editing view, the added audio material is placed at a position set by an editing user in a relative binding positioning mode;
displaying the audio material positioned in an absolute time positioning mode on a side bar of a text editing view in a floating mode in the text editing view;
in particular, in a text editing view or a multi-track editing view, the editing user can edit the properties of the added audio material;
particularly, in the process of sound mixing, when a text passage is not recorded yet and a main audio segment lacks associated recording materials, the time length of the main audio segment on a main track is determined by the estimated voice speed value of a sound recorder and the number of words of the passage in the default configuration information of the project, or an editing user selects a voice synthesis result as the recording materials to be temporarily used;
The sound mixing processor performs sound mixing operation on the sound clip according to the script to obtain a sound mixing processing result; the post-production equipment generates a post-production result according to the sound mixing processing result;
the post-production equipment generates a post-production result through the created post-production engineering and outputs the post-production result to the audio presentation equipment;
the post-production result comprises an audio file, a script and an anchor point of the sound mixing processing result;
the audio presentation device presents the post-production results;
specifically, the audio presentation device plays the recording material and the audio material in the main track and the auxiliary track according to the paragraph presentation sequence set by the script.
The invention discloses a text and audio presentation processing method, which comprises the following steps:
the script editing equipment generates a script; the script comprises one or more paragraphs;
particularly, the script comprises a recording material, an audio material, a sound effect processing mode and a paragraph presentation sequence corresponding to the paragraphs;
the post-production equipment acquires the script from the script editing equipment connected with the post-production equipment and creates a post-production project according to the script; the post production project comprises sound segments, wherein the sound segments correspond to paragraphs in the script;
In particular, the post production engineering includes engineering default configuration information; the engineering default configuration information comprises default interval values among sound clips in the main track, target gain set values of sound mixing processing and speech speed estimation values of a sound recorder; the default interval value between the sound clips in the main track is the default leading mute time length, and the target gain setting value of the sound mixing processing comprises target volume values of human voice, music and sound effect;
particularly, the post-production equipment comprises a sound effect processor, a sound mixing processor, a material manager and an editing display;
the sound effect processor is connected with the material manager; the sound effect processor is connected with the sound mixing processor; the mixing processor is connected with the editing display;
the material manager comprises a material library, wherein the material library locally stores materials;
the material manager organizes and manages the materials in a hierarchical structure in a material library, wherein the material types in the material library comprise audio materials and effect materials; the audio materials include but are not limited to sound effect materials and music materials; the effect material comprises effect setting parameters, such as cave environment effect, broadcast effect and the like;
The sound effect processor acquires audio materials through the material manager connected with the sound effect processor, applies sound effect processing operation to the audio materials according to a sound effect processing mode set by the script, and outputs the audio materials subjected to sound effect processing to the sound mixing processor connected with the sound mixing processor;
wherein, the sound effects include, but are not limited to, gain, sound channel mixing, fade-in/fade-out, equalization, environment, noise reduction, compression, time scaling, and three-dimensional effects;
in the sound effect processing mode, a plurality of sound effects are connected in series and the sequence is adjustable; the sound effect comprises one or more adjustable operating parameters, and the adjustable operating parameters of the sound effect are set to change along with the playing time according to a specific curve;
the audio mixing processor comprises a main track and an auxiliary track, wherein the main track and the auxiliary track are used for bearing audio clips; the sound mixing processor performs sound mixing processing operation on the sound clips in the main track and the auxiliary track according to the script to obtain a sound mixing processing result;
wherein the paragraphs of the script comprise a text paragraph and an audio paragraph; the sound segments comprise sound segments in a main track corresponding to the text paragraphs and sound segments in an auxiliary track corresponding to the audio paragraphs;
The positions of the sound segments in the main track are determined by the presentation sequence of the paragraphs set by the script, and the positions of the sound segments in the auxiliary track are set by an editing user;
the main track is positioned according to the paragraph presentation sequence, the paragraph presentation duration and the leading silence of the sound clip carried by the main track;
specifically, the clip content of the sound clip in the main track includes an association relationship between the sound clip in the main track and a text paragraph in the script, the text content of the associated text paragraph, an association relationship between the sound clip in the main track and a recording material, the associated recording material, and editing information of the sound clip in the main track;
the text paragraphs for which the recording materials are not recorded do not have corresponding recording materials, and the fragment contents of the sound fragments in the corresponding main track do not include the association relationship between the sound fragments in the main track and the recording materials;
when the sound clip in the main track is associated with the recording material, namely when the clip content of the sound clip in the main track comprises the association relation between the sound clip in the main track and the recording material, the clip content of the sound clip in the main track also comprises text alignment information between the text content and the recording material;
For the script which finishes recording the recording material, sound fragments associated with all text paragraphs are arranged on the main track according to the paragraph presentation sequence set by the script and are separated from each other by taking leading silence as an interval;
the editing information of the sound clip in the main track comprises leading mute time, a material playing starting position, the time of the sound clip in the main track and effect setting information;
in particular, the segment content of a sound segment in the main track comprises one or more anchors;
wherein the anchor point is set at a position in the text alignment information; or, the anchor point is arranged at a specific position of the sound clip in the main track, and the specific position includes but is not limited to the beginning and the end of the sound clip in the main track; or the anchor point is arranged at the relative position based on the recording material set by the editing user;
wherein the anchor point comprises semantic information; the semantic information comprises characters or words or phrases in the text, audio event description, start time and end time;
the anchor point is used for text content positioning based on semantics; positioning a corresponding text paragraph from the audio time according to the anchor point, or positioning a corresponding audio time from the text paragraph;
The sound fragment in the main track is a main presentation object of the recording material corresponding to the script in the post-production of the audio book and the text paragraph in the script; that is, only the recording material corresponding to the text passage in the script is usually placed in the main track, and the recording material is recorded by a sound recorder or obtained through voice synthesis;
the sound segments in the primary track may also be used to present audio material, i.e. to place audio material in the primary track;
inserting the audio material into the middle of the main audio segment of the main track, converting the audio material into the main audio segment, and positioning according to the positioning mode of the main track;
specifically, the clip content of the sound clip in the secondary track includes an association relationship between the sound clip in the secondary track and the audio material, the associated audio material, and editing information of the sound clip in the secondary track;
particularly, the editing information of the sound clip in the auxiliary track includes a material playing start position, the duration of the sound clip in the auxiliary track, the start position of the sound clip in the auxiliary track, the circular playing setting information, and the track description information;
the positioning mode of the starting position of the sound clip in the auxiliary track comprises an absolute time positioning mode and a relative binding positioning mode; in the absolute time positioning mode, the starting position of the sound clip in the auxiliary track is located at a specific time defined by the global time axis; in the relative binding positioning mode, the starting position of the sound clip in the auxiliary track is at an offset moment relative to the anchor point, and the anchor point is provided by the sound clip in the main track;
When the loop playing is effective, the loop playing setting information comprises a loop ending position; setting the cycle end position comprises setting by the absolute time length after the cycle or setting by an anchor point provided by a sound segment in the main track;
the editing display displays an editing view in the process of mixing sound by the sound mixing processor, wherein the editing view comprises a text editing view and a multi-track editing view; the positions of the cursors in the text editing view and the multi-track editing view are bound with each other;
wherein, in the editing display, the text editing view and the multi-track editing view are presented simultaneously, or an editing user selects one of the views to present and can switch each other;
the multi-track editing view provides a visual multi-track audio nonlinear editing interface for an editing user;
wherein the multi-track editing view presents a primary track and a secondary track, and a timeline of the multi-track editing view is correlated with the text editing view by text alignment information of sound clips in the primary track;
in the sound mixing process, the sound mixing processor acquires audio materials subjected to sound effect processing through the sound effect processor, and adds the acquired audio materials into an editing view of an editing display, wherein the added audio materials are placed at a position set by an editing user;
Specifically, when the audio mixing processor adds the acquired audio material to the multi-track editing view, the added audio material is placed at an editing user set position in an absolute time positioning manner;
after the audio material is added to the multi-track editing view, the positioning mode of the audio material can be selected to be changed into a relative binding positioning mode;
specifically, when the audio mixing processor adds the acquired audio material into the text editing view, the added audio material is placed at a position set by an editing user in a relative binding and positioning manner;
displaying the audio material positioned in an absolute time positioning mode on a side bar of a text editing view in a floating mode in the text editing view;
in particular, in a text editing view or a multitrack editing view, the editing user can edit the properties of the added audio material;
particularly, in the process of sound mixing processing, when a text passage is not recorded yet and a main audio segment lacks associated recording materials, the duration of the main audio segment on a main track is determined by the estimated value of the voice speed of a sound recorder and the number of words of the passage in the engineering default configuration information, or an editing user selects a voice synthesis result as the recording materials to be temporarily used;
The sound mixing processor performs sound mixing operation on the sound clip according to the script to obtain a sound mixing processing result; the post-production equipment generates a post-production result according to the sound mixing processing result;
the post-production equipment carries out post-production processing on the sound fragments through post-production engineering to generate a post-production result and outputs the post-production result to audio presentation equipment connected with the post-production result;
the post-production result comprises an audio file, a script and an anchor point of the sound mixing processing result;
the audio presentation device presents the post-production results;
specifically, the audio presentation device plays the recording material and the audio material in the main track and the auxiliary track according to the paragraph presentation sequence set by the script.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A text and audio presentation processing method, comprising:
the script editing equipment generates a script; the script comprises one or more paragraphs;
the post-production equipment acquires scripts from script editing equipment connected with the post-production equipment and creates a post-production project according to the scripts; the post production project comprises sound segments, wherein the sound segments correspond to paragraphs in the script;
the post-production equipment carries out post-production processing on the sound fragments through post-production engineering to generate a post-production result and outputs the post-production result to audio presentation equipment connected with the post-production result;
the audio presentation device presents the post-production results.
2. The text and audio presentation processing method of claim 1,
the script comprises a recording material, an audio material, a sound effect processing mode and a paragraph presenting sequence corresponding to the paragraphs;
the post-production equipment comprises a sound effect processor, a sound mixing processor, a material manager and an editing display;
the material manager comprises a material library, wherein the material library locally stores materials; the material manager organizes and manages the materials in a hierarchical structure in a material library, wherein the material types in the material library comprise audio materials and effect materials;
The sound effect processor acquires audio materials through the material manager connected with the sound effect processor, applies sound effect processing operation to the audio materials according to a sound effect processing mode set by the script, and outputs the audio materials subjected to sound effect processing to the sound mixing processor connected with the sound mixing processor;
the audio mixing processor comprises a main track and an auxiliary track, wherein the main track and the auxiliary track are used for bearing audio clips; the sound mixing processor performs sound mixing processing operation on the sound clips in the main track and the auxiliary track according to the script to obtain a sound mixing processing result;
wherein the script paragraphs comprise text paragraphs and audio paragraphs; the sound fragments comprise sound fragments in a main track corresponding to the text paragraphs and sound fragments in an auxiliary track corresponding to the audio paragraphs;
the position of the sound fragment in the main track is determined by the paragraph presentation sequence set by the script, and the position of the sound fragment in the auxiliary track is set by an editing user;
the editing display displays an editing view in the process of mixing processing performed by the mixing processor, wherein the editing view comprises a text editing view and a multi-track editing view; the cursor positions in the text editing view and the multi-track editing view are bound with each other;
In the editing display, the text editing view and the multi-track editing view are presented simultaneously, or an editing user selects one of the views for presentation and can switch between them.
3. The text and audio presentation processing method of claim 2,
the segment content of the sound segment in the main track comprises an incidence relation between the sound segment in the main track and a text paragraph in the script, the text content of the associated text paragraph, the incidence relation between the sound segment in the main track and a recording material, the associated recording material and editing information of the sound segment in the main track;
when the sound clip in the main track is associated with the recording material, the clip content of the sound clip in the main track also comprises text alignment information between the text content and the recording material; for the script which finishes recording the recording material, sound segments associated with all text paragraphs are arranged on the main track according to the paragraph presentation sequence set by the script and are separated from each other by taking leading silence as an interval;
wherein the multi-track editing view presents a primary track and a secondary track, and a timeline of the multi-track editing view is correlated with the text editing view by text alignment information of sound clips in the primary track;
The editing information of the sound clip in the main track comprises leading mute time, a material playing starting position, the time of the sound clip in the main track and effect setting information;
the segment content of the sound segment in the main track comprises one or more anchors; wherein the anchor point is set at a position in the text alignment information; or the anchor point is arranged at a specific position of the sound segment in the main track, and the specific position comprises the beginning and the end of the sound segment in the main track; or the anchor point is arranged at the relative position based on the recording material set by the editing user; wherein the anchor point comprises semantic information, and the semantic information comprises words or phrases in the text, audio event description, start time and end time.
4. The text and audio presentation processing method of claim 3,
the clip content of the sound clip in the auxiliary track comprises an incidence relation between the sound clip in the auxiliary track and the audio material, the associated audio material and editing information of the sound clip in the auxiliary track;
the editing information of the sound clip in the auxiliary track comprises a material playing start position, the duration of the sound clip in the auxiliary track, the start position of the sound clip in the auxiliary track, circulating playing setting information and track description information;
The positioning mode of the starting position of the sound clip in the auxiliary track comprises an absolute time positioning mode and a relative binding positioning mode; in the absolute time positioning mode, the start position of the sound clip in the secondary track is located at a specific time defined by the global time axis; in the relative binding positioning mode, the starting position of the sound clip in the auxiliary track is at an offset moment relative to the anchor point, and the anchor point is provided by the sound clip in the main track;
when the loop playing is effective, the loop playing setting information comprises a loop ending position; setting the cycle end position comprises setting by the absolute time length after the cycle or setting by an anchor point provided by a sound segment in the main track;
in the sound mixing process, the sound mixing processor acquires audio materials subjected to sound effect processing through the sound effect processor and adds the acquired audio materials to an editing view of an editing display, the added audio materials being placed at editing user set positions.
5. The text and audio presentation processing method of claim 2,
wherein, the post-production engineering comprises engineering default configuration information; the engineering default configuration information comprises default interval values among sound clips in the main track, a target gain set value of sound mixing processing and a speech speed estimation value of a sound recorder; the default interval value between the sound clips in the main track is the default leading mute duration, and the target gain setting value of the sound mixing processing comprises target volume values of human voice, music and sound effect;
In the process of sound mixing processing, when a text passage is not recorded yet and a main audio segment lacks associated recording materials, the duration of the main audio segment on a main track is determined by the estimated value of the voice speed of a sound recorder and the number of words of the passage in the engineering default configuration information, or an editing user selects a voice synthesis result to be used as the recording materials temporarily.
6. A text and audio presentation processing system is characterized by comprising script editing equipment, post-production equipment and audio presentation equipment which are sequentially connected with one another;
the script editing equipment generates a script; the script comprises one or more paragraphs;
the post-production equipment acquires a script and creates a post-production project according to the script; the post production project comprises sound segments, and the sound segments correspond to paragraphs in the script;
the post-production equipment carries out post-production processing on the sound fragments through a post-production project to generate a post-production result and outputs the post-production result to the audio presentation equipment;
the audio presentation device presents the post-production results.
7. The text and audio presentation processing system of claim 6,
The script comprises a recording material, an audio material, a sound effect processing mode and a paragraph presenting sequence corresponding to the paragraphs;
the post-production equipment comprises a sound effect processor, a sound mixing processor, a material manager and an editing display; the sound effect processor is connected with the material manager; the sound effect processor is connected with the sound mixing processor; the mixing processor is connected with the editing display;
the material manager comprises a material library, wherein the material library locally stores materials; the material manager organizes and manages the materials in a hierarchical structure in a material library, wherein the material types in the material library comprise audio materials and effect materials;
the sound effect processor acquires audio materials through the material manager, applies sound effect processing operation to the audio materials according to a sound effect processing mode set by the script, and outputs the audio materials subjected to sound effect processing to the sound mixing processor;
the sound mixing processor comprises a main track and an auxiliary track, wherein the main track and the auxiliary track are used for bearing sound fragments; the sound mixing processor performs sound mixing processing operation on the sound clips in the main track and the auxiliary track according to the script to obtain a sound mixing processing result;
Wherein the script paragraphs comprise text paragraphs and audio paragraphs; the sound fragments comprise sound fragments in a main track corresponding to the text paragraphs and sound fragments in an auxiliary track corresponding to the audio paragraphs;
the position of the sound fragment in the main track is determined by the paragraph presentation sequence set by the script, and the position of the sound fragment in the auxiliary track is set by an editing user;
the editing display displays an editing view in the process of mixing processing performed by the mixing processor, wherein the editing view comprises a text editing view and a multi-track editing view; the cursor positions in the text editing view and the multi-track editing view are bound with each other;
in the editing display, the text editing view and the multi-track editing view are presented simultaneously, or one of the views is selected by an editing user for presentation and can be switched with each other.
8. The text and audio presentation processing system of claim 7,
the segment content of the sound segment in the main track comprises an incidence relation between the sound segment in the main track and a text paragraph in the script, the text content of the associated text paragraph, the incidence relation between the sound segment in the main track and a recording material, the associated recording material and editing information of the sound segment in the main track;
When the sound clip in the main track is associated with the recording material, the clip content of the sound clip in the main track also comprises text alignment information between the text content and the recording material; for the script which finishes recording the recording material, sound segments associated with all text paragraphs are arranged on the main track according to the paragraph presentation sequence set by the script and are separated from each other by taking leading silence as an interval;
wherein the multi-track editing view presents a primary track and a secondary track, and a timeline of the multi-track editing view is correlated with the text editing view by text alignment information of sound clips in the primary track;
the editing information of the sound clip in the main track comprises leading mute time, a material playing starting position, the time of the sound clip in the main track and effect setting information;
the segment content of the sound segment in the main track comprises one or more anchors; wherein the anchor point is set at a position in the text alignment information; or the anchor point is arranged at a specific position of the sound segment in the main track, and the specific position comprises the beginning and the end of the sound segment in the main track; or the anchor point is arranged at the relative position based on the recording material set by the editing user; the anchor point comprises semantic information, wherein the semantic information comprises words or phrases in the text, audio event description, start time and end time.
9. The text and audio presentation processing system of claim 8,
the clip content of the sound clip in the auxiliary track comprises an incidence relation between the sound clip in the auxiliary track and the audio material, the associated audio material and editing information of the sound clip in the auxiliary track;
the editing information of the sound clip in the auxiliary track comprises a material playing start position, the duration of the sound clip in the auxiliary track, the start position of the sound clip in the auxiliary track, circulating playing setting information and track description information;
the positioning mode of the starting position of the sound clip in the auxiliary track comprises an absolute time positioning mode and a relative binding positioning mode; in the absolute time positioning mode, the starting position of the sound clip in the auxiliary track is located at a specific time defined by the global time axis; in the relative binding positioning mode, the starting position of the sound clip in the auxiliary track is at an offset moment relative to the anchor point, and the anchor point is provided by the sound clip in the main track;
when the loop playing is effective, the loop playing setting information comprises a loop ending position; the setting of the loop ending position comprises setting by absolute time length after loop or setting by an anchor point provided by a sound segment in the main track;
In the process of sound mixing processing, the sound mixing processor acquires the audio material subjected to sound effect processing through the sound effect processor and adds the acquired audio material to an editing view of the editing display, and the added audio material is placed at a position set by an editing user.
10. The text and audio presentation processing system of claim 7,
wherein, the post-production engineering comprises engineering default configuration information; the engineering default configuration information comprises default interval values among sound clips in the main track, target gain set values of sound mixing processing and speech speed estimation values of a sound recorder; the default interval value between the sound clips in the main track is the default leading mute duration, and the target gain setting value of the sound mixing processing comprises target volume values of human voice, music and sound effect;
in the process of sound mixing, when the text paragraphs are not recorded yet and the main audio clip lacks associated recording materials, the time length of the main audio clip on the main track is determined by the estimated voice speed value of the sound recorder and the number of paragraph words in the default configuration information of the project, or the editing user selects the voice synthesis result as the recording materials for temporary use.
CN202210089590.4A 2022-01-26 2022-01-26 Text and audio presentation processing method and system Pending CN114595356A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210089590.4A CN114595356A (en) 2022-01-26 2022-01-26 Text and audio presentation processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210089590.4A CN114595356A (en) 2022-01-26 2022-01-26 Text and audio presentation processing method and system

Publications (1)

Publication Number Publication Date
CN114595356A true CN114595356A (en) 2022-06-07

Family

ID=81806507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210089590.4A Pending CN114595356A (en) 2022-01-26 2022-01-26 Text and audio presentation processing method and system

Country Status (1)

Country Link
CN (1) CN114595356A (en)

Similar Documents

Publication Publication Date Title
CA2477697C (en) Methods and apparatus for use in sound replacement with automatic synchronization to images
US8548618B1 (en) Systems and methods for creating narration audio
US7425674B2 (en) Method and apparatus for time compression and expansion of audio data with dynamic tempo change during playback
US20020129692A1 (en) Method and system for embedding audio titles
US20020166440A1 (en) Method of remixing digital information
CN109068163B (en) Audio and video synthesis system and synthesis method thereof
WO2017076304A1 (en) Audio data processing method and device
Audacity Audacity
WO2018120819A1 (en) Method and device for producing presentation
CN108429931A (en) A kind of method for broadcasting multimedia file and device
US20080115063A1 (en) Media assembly
US9014831B2 (en) Server side audio file beat mixing
CN114595356A (en) Text and audio presentation processing method and system
US8792818B1 (en) Audio book editing method and apparatus providing the integration of images into the text
US9817829B2 (en) Systems and methods for prioritizing textual metadata
JP2005044409A (en) Information reproducing device, information reproducing method, and information reproducing program
CN115762581A (en) Song serial connection transition method, terminal and storage medium
JP4542805B2 (en) Variable speed reproduction method and apparatus, and program
JP2003223199A (en) Preparation support system for writing-up text for superimposed character and semiautomatic superimposed character program production system
JP2005129971A (en) Semi-automatic caption program production system
JP4124416B2 (en) Semi-automatic subtitle program production system
JP3944830B2 (en) Subtitle data creation and editing support system using speech approximation data
US20090082887A1 (en) Method and User Interface for Creating an Audio Recording Using a Document Paradigm
CN114615612A (en) Text and audio presentation processing method and device
KR101143908B1 (en) Audio reproduction apparatus and method for providing disk jockey service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination