CN113115103A

CN113115103A - System and method for realizing real-time audio-to-text conversion in network live broadcast

Info

Publication number: CN113115103A
Application number: CN202110252429.XA
Authority: CN
Inventors: 王旭辉; 易长安; 王旭伟; 李洋; 张玉龙
Original assignee: Hangzhou Maiqu Network Technology Co ltd
Current assignee: Hangzhou Maiqu Network Technology Co ltd
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-07-13

Abstract

The invention relates to a system and a method for realizing real-time audio-to-text conversion in network live broadcast, which comprises a system configuration module: the system is used for configuring parameters such as stream codes, definition, language types and the like of live broadcast; the voice acquisition module: the device is connected to live broadcast equipment through a line, and audio signals in the equipment are collected; a voice conversion module: receiving voice data of a voice acquisition module in real time, and calculating, identifying and converting the voice data into characters at a high speed; and a subtitle processing module: and acquiring the characters converted by the voice conversion module in real time and editing the characters into a video screen for display. The invention designs a system and a method for converting voice into characters. The method is based on an audio recognition technology, aims to provide a novel method for displaying subtitles in real time in a live video broadcast process, and overcomes the defects that the traditional subtitle displaying method consumes financial resources and manpower, is not timely and randomly strained and the like.

Description

System and method for realizing real-time audio-to-text conversion in network live broadcast

Technical Field

The invention relates to the technical field of audio-to-text conversion, in particular to a system and a method for realizing real-time audio-to-text conversion through network live broadcast.

Background

With the rapid development of economy in China, people have increasingly improved lives, and in various regions, theatres, enterprises and the like, the characteristics of the people need to be displayed, and activities need to be organized to display performances. China is large in population, and people cannot arrive at the scene to watch the ground, so that people can watch the performance activities of all places at home by means of network live broadcast. In the drama of the performance, various language display forms such as songs, operas, small articles, phase sounds and recitations exist, the languages are various, some languages cannot be understood, and people want to visually see results in order to promote cultural communication, so that the results need to be displayed in characters. The live caption is very important at this time, and the live caption is still a short plate in the prior art at present, and the caption switching is manually carried out by means of a high-price machine and then arranging characters in advance to contrast with script lines to follow the scene condition. The technology can realize the display of the subtitles at present, but is financial and labor-consuming, the situation on the stage sometimes changes, and the subtitles may not be matched or have no subtitles if the situation changes.

Disclosure of Invention

In view of the defects in the background art, the invention relates to a system and a method for realizing real-time audio-to-text conversion in network live broadcast. The method is based on an audio recognition technology, aims to provide a novel method for displaying subtitles in real time in a live video broadcast process, and overcomes the defects that the traditional subtitle displaying method consumes financial resources and manpower, is not timely and randomly strained and the like.

The invention relates to a system and a method for realizing real-time audio-to-text conversion in network live broadcast, which comprises a system configuration module: the system is used for configuring parameters such as stream codes, definition, language types and the like of live broadcast; the voice acquisition module: the device is connected to live broadcast equipment through a line, and audio signals in the equipment are collected; a voice conversion module: receiving voice data of a voice acquisition module in real time, and calculating, identifying and converting the voice data into characters at a high speed; and a subtitle processing module: and acquiring the characters converted by the voice conversion module in real time and editing the characters into a video screen for display.

By adopting the scheme, the technical scheme of the invention applies the audio recognition technology, applies big data cloud analysis and greatly improves the transmission efficiency by combining the current 5G technology.

And the voice recognition module is used for acquiring the data of the voice acquisition module, recognizing the language type and sending the recognized language type to the voice conversion module so as to facilitate the coding of the voice conversion module.

By adopting the scheme, the module adopts the voice recognition platform for voice recognition of the huge scientific news, and the voice accuracy rate is ensured in the voice recognition process.

Furthermore, the voice acquisition module performs noise reduction, restoration and splitting on the audio data through processing the voice frequency, so that the audio signal is converted into a character signal with punctuation marks.

By adopting the scheme, punctuation mark adding during character conversion can be completed.

Further, the subtitle processing module obtains the converted text signals through the voice conversion module, and detects and corrects the problems of wrongly written characters, punctuation and punctuation marks of the text signals through the context.

By adopting the scheme, the accuracy of character conversion is improved, and the formation of wrong sentences and words is reduced.

Furthermore, the voice recognition module can also recognize blank voice, mark the blank voice signal, and stop transmitting the limit number to the caption transmission processing module after the voice conversion module receives the mark signal.

By adopting the scheme, the idle running power consumption of the equipment is avoided by identifying the blank audio.

Furthermore, the speech recognition module can also recognize the speech speed, and after two sections of speech are collected, corresponding data are recorded according to sentence break and pause, so that the speech is distinguished from blank speech, adaptive adjustment is achieved, and the recognition efficiency of the recognition module is improved.

Through adopting above-mentioned scheme, through the speech number sentence break and the interval data of discernment everyone, form the collection scheme of single live broadcast, the effective people of fundamentally different makes the discernment adjustment, and it is fast to distinguish sentence break or blank audio frequency, effectively saves systematic consumption.

Further, the voice conversion module comprises a big data cloud analysis database.

By adopting the scheme, the conversion accuracy is high, and wrong sentence identification is rapid.

Furthermore, the system adopts 5G transmission, and the conversion efficiency, cost and speed are improved.

By adopting the scheme, the 5G high-efficiency transmission, the voice recognition accuracy and the subtitle display module are realized. The efficiency, the cost and the accuracy provide strong support for the live broadcast industry.

A system method for realizing real-time audio-to-text conversion by network live broadcast comprises the following steps:

s1: the live broadcast equipment is assembled, and parameters of the live broadcast, such as code stream, definition, language type and the like, are configured in a system configuration module before use;

s2: the method comprises the steps that a live broadcast device is connected to an audio and video cable, an audio signal is transmitted to a voice acquisition module of a system through a line during live broadcast, data of the voice acquisition module is submitted to a voice recognition module in real time, the voice recognition module recognizes various language types according to the collected data, and the recognized voice is transmitted to a voice conversion module;

s3: the voice conversion module calculates, identifies and converts the characters at a high speed, and the transmitted characters enter the subtitle processing module;

s4: and the subtitle processing module is used for displaying the processed subtitle to a live video screen to finish the whole operation.

By adopting the scheme, the novel method for displaying the subtitles in real time in the live video broadcasting process is provided, and the defects that the traditional subtitle displaying method consumes financial resources and manpower, is not subjected to real-time random strain and the like are overcome.

Drawings

The invention is further illustrated with reference to the following figures and examples.

Fig. 1 is a schematic block diagram of embodiment 1 of the present invention.

Fig. 2 is a schematic block diagram according to embodiment 2 of the present invention.

Reference numeral, 1, a system configuration module; 2. a voice acquisition module; 3. a voice conversion module; 4. a subtitle processing module; 5. and a voice recognition module.

Detailed Description

While the embodiments of the present invention will be described and illustrated in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the specific embodiments disclosed, but is intended to cover various modifications, equivalents, and alternatives falling within the scope of the invention as defined by the appended claims.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking specific embodiments as examples with reference to the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

Embodiment 1 of the present invention is shown with reference to fig. 1, and includes four modules: the system comprises a system configuration module 1, a voice acquisition module 2, a voice conversion module 3 and a subtitle processing module 4. The technical scheme of the invention applies the audio recognition technology, applies the big data cloud analysis, greatly improves the transmission efficiency by combining the current 5G technology, applies the voice recognition platform in the voice recognition technology, ensures the voice accuracy in the voice recognition process, and has strong research and development capability and a set of live broadcast system developed by the company as a national-level new technology enterprise. 5G high-efficiency transmission, voice recognition accuracy and subtitle display module. The efficiency, the cost and the accuracy provide strong support for the live broadcast industry.

The audio-to-text method of the present embodiment 1 is as follows:

s1: the live broadcast equipment is assembled, and parameters of code streams, definition, language types and the like of the live broadcast are configured in a system configuration module 1 before the live broadcast equipment is used;

s2: the method comprises the following steps that a live broadcast device is connected into an audio and video cable, an audio signal is transmitted to a voice acquisition module 2 of a system through a line during live broadcast, and the voice acquisition module 2 transmits recognized voice to a voice conversion module 3;

s3: the voice conversion module 3 calculates, identifies and converts the characters at a high speed, and the transmitted characters enter the caption processing module 4;

s4: and the subtitle processing module 4 is used for displaying the processed subtitle on a live video screen to finish the whole operation.

Further, the voice acquisition module 2 performs noise reduction, restoration and splitting on the audio data by processing the voice frequency, so as to convert the audio signal into a character signal with punctuation marks. The caption processing module 4 acquires the converted text signals through the voice conversion module 3, and detects and corrects the problems of wrongly written characters, punctuation and punctuation marks through the context of the text signals.

Embodiment 2 of the present invention is illustrated with reference to fig. 1, and includes five modules: the system comprises a system configuration module 1, a voice acquisition module 2, a voice conversion module 3, a subtitle processing module 4 and a voice recognition shifting block 5. The system configuration module 1 sets up the parameter for live broadcast equipment, live broadcast equipment and voice acquisition module 2 electric connection, voice acquisition module 2 respectively with voice conversion module 3, voice recognition shifting block 5 electric connection, voice conversion module 3, voice recognition shifting block 5 respectively with subtitle processing module 4 electric connection, subtitle processing module 4 and live broadcast's video screen electric connection. The module is electrically connected with the module through a wire, and 5G transmission is adopted for audio transmission.

The voice recognition module 5 can also recognize blank voice, mark the blank voice signal, and stop transmitting the limit number to the caption transmission processing module after the voice conversion module 5 receives the mark signal. The speech recognition module 5 can also recognize the speech rate, and distinguish the two segments of speech from the blank speech by collecting the two segments of speech and recording the corresponding data according to sentence break and pause, and obtain self-adaptive adjustment, thereby improving the recognition efficiency of the recognition module.

The audio-to-text method of the present embodiment 2 is as follows:

The embodiment overcomes the defects that the traditional subtitle display method consumes financial resources and manpower, is not real-time and randomly strained and the like.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A system for realizing real-time audio-to-text conversion in network live broadcast is characterized in that: the system configuration module is included: the system is used for configuring parameters such as stream codes, definition, language types and the like of live broadcast;

the voice acquisition module: the device is connected to live broadcast equipment through a line, and audio signals in the equipment are collected;

a voice conversion module: receiving voice data of a voice acquisition module in real time, and calculating, identifying and converting the voice data into characters at a high speed;

and a subtitle processing module: and acquiring the characters converted by the voice conversion module in real time and editing the characters into a video screen for display.

2. The system of claim 1, wherein the system for converting live audio to text via network broadcasting comprises: the voice recognition module is used for acquiring the data of the voice acquisition module, recognizing the language type and sending the recognized language type to the voice conversion module so as to facilitate the coding of the voice conversion module.

3. The system of claim 2, wherein the system for converting live audio to text via network broadcasting comprises: the voice acquisition module carries out noise reduction, restoration and splitting on the audio data through processing the voice frequency, so that the audio signal is converted into a character signal with punctuation marks.

4. The system of claim 3, wherein the system for converting live audio to text via network broadcasting comprises: the caption processing module obtains the converted character signals through the voice conversion module, and detects and corrects the problems of wrongly written characters, punctuation and punctuation marks of the character signals through the context.

5. The system of claim 4, wherein the system for converting live audio to text via network broadcasting comprises: the voice recognition module can also recognize blank voice, mark blank voice signals, and stop transmitting limit signals to the caption transmission processing module after the voice conversion module receives the mark signals.

6. The system of claim 5, wherein the system for converting live audio to text via network broadcasting comprises: the voice recognition module can also recognize the speed of speech, and after two sections of speech are collected, corresponding data are recorded according to sentence break and pause, so that the speech is distinguished from blank speech, adaptive adjustment is achieved, and the recognition efficiency of the recognition module is improved.

7. The system of claim 6, wherein the system for converting live audio to text via network broadcasting comprises: the voice conversion module comprises a big data cloud analysis database.

8. The system of claim 7, wherein the system for converting live audio to text via network broadcasting comprises: the system adopts 5G transmission, and improves the conversion efficiency, cost and speed.

9. A system method for realizing real-time audio-to-text conversion in network live broadcast is characterized in that: the method comprises the following steps: