CN115002502B

CN115002502B - Data processing method and server

Info

Publication number: CN115002502B
Application number: CN202210908987.1A
Authority: CN
Inventors: 康凯; 朱基锋; 周辉
Original assignee: Guangzhou Qianjun Network Technology Co ltd
Current assignee: Guangzhou Qianjun Network Technology Co ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2023-01-03
Anticipated expiration: 2042-07-29
Also published as: CN115002502A

Abstract

The invention discloses a data processing method and a server, which can respectively obtain audio data which are sent by a plurality of anchor terminals in the audio microphone connecting interaction process and carry anchor identification; respectively preprocessing each audio data according to a predefined preprocessing mode to obtain each processed audio data carrying a main broadcast identifier; respectively carrying out voice recognition on each processed audio data to generate each subtitle text carrying the anchor identification; and distributing each subtitle text to the user side so that the user side displays each subtitle text in a distinguishing way according to the anchor identification. The server can effectively ensure the accuracy of the caption.

Description

Data processing method and server

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and a server.

Background

With the development of scientific technology, data processing technology is continuously improved.

In a live network scene, in the prior art, subtitles can be generated in real time and added to a live broadcast picture based on audio content in a live broadcast process, so that the requirements of specific people on the subtitles are met.

However, in the multi-main broadcast audio and video live broadcast process, the subtitle generation accuracy in the prior art is low.

Disclosure of Invention

In view of the above problems, the present invention provides a data processing method and a server for overcoming the above problems or at least partially solving the above problems, and the technical solution is as follows:

a data processing method is applied to a server, and comprises the following steps:

respectively obtaining audio data carrying a main broadcast identification, which are sent by a plurality of main broadcast terminals in an audio microphone connecting interaction process;

respectively preprocessing each audio data according to a predefined preprocessing mode to obtain each processed audio data carrying a main broadcast identifier;

respectively carrying out voice recognition on each processed audio data to generate each subtitle text carrying a anchor identification;

and distributing each subtitle text to a user side so that the user side displays each subtitle text in a distinguishing manner according to a anchor identifier.

Optionally, the pre-processing each audio data according to a pre-defined pre-processing manner includes:

transcoding each audio data according to a predefined transcoding processing mode;

the obtaining of the corresponding processed audio data carrying the anchor identifier includes:

and acquiring each processed audio data with a target format, wherein the target format is a format required by voice recognition.

Optionally, the obtaining of audio data carrying a anchor identifier sent by multiple anchor terminals in an audio microphone-connecting interaction process respectively includes:

the audio data is acquired from a content distribution network in which the audio data is stored.

Optionally, each of the audio data is live audio data.

A server, comprising: the device comprises a first obtaining unit, a preprocessing unit, a second obtaining unit, a voice recognition unit, a first generating unit and a distributing unit; wherein:

the first obtaining unit is used for respectively obtaining audio data which are sent by a plurality of anchor terminals in the audio microphone connecting interaction process and carry anchor identification;

the preprocessing unit is used for respectively preprocessing the audio data according to a predefined preprocessing mode;

the second obtaining unit is configured to obtain each piece of processed audio data carrying a anchor identifier;

the voice recognition unit is used for respectively carrying out voice recognition on each processed audio data;

the first generating unit is used for generating each subtitle text carrying the anchor identification;

the distribution unit is configured to distribute each subtitle text to a user side, so that the user side displays each subtitle text in a differentiated manner according to a anchor identifier.

Optionally, the preprocessing unit is configured to perform transcoding processing on each audio data according to a predefined transcoding processing manner;

the second obtaining unit is configured to obtain each processed audio data whose data formats are target formats, where the target formats are formats required for performing voice recognition.

Optionally, the first obtaining unit is configured to obtain each piece of audio data from a content distribution network in which each piece of audio data is stored.

Optionally, each of the audio data is live audio data.

The data processing method and the server provided by the embodiment can respectively obtain audio data carrying a main broadcast identifier, which are sent by a plurality of main broadcast terminals in an audio microphone connecting interaction process; respectively preprocessing each audio data according to a predefined preprocessing mode to obtain each processed audio data carrying a main broadcast identifier; respectively carrying out voice recognition on each processed audio data to generate each subtitle text carrying the anchor identification; and distributing each subtitle text to the user side so that the user side displays each subtitle text in a distinguishing way according to the anchor identification. The server can directly identify the anchor to which the server belongs based on the anchor identification in the audio data without sound source identification, and can effectively avoid the problem of sound source identification error.

The foregoing description is only an overview of the technical solutions of the present invention, and the following detailed description of the present invention is provided to enable the technical means of the present invention to be more clearly understood, and to enable the above and other objects, features, and advantages of the present invention to be more clearly understood.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart illustrating a first data processing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating differentiated display of subtitle texts of different anchor on an interface of a user side according to an embodiment of the present invention;

FIG. 3 is a flow chart of a second data processing method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a third data processing method according to the embodiment of the present invention;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, the present embodiment proposes a first data processing method, which is applicable to a server, and the method may include the following steps:

s101, respectively obtaining audio data which are sent by a plurality of anchor terminals in an audio microphone connecting interaction process and carry anchor identification;

the anchor terminal may be an electronic device used by the anchor in the audio microphone interaction process with other anchors, such as a mobile phone, a tablet computer, a desktop computer, and the like.

Optionally, the audio microphone interaction may be a live scene in which a plurality of anchor broadcasters only perform audio microphone connection without sharing a live picture;

optionally, the audio-microphone interaction may also be an audio-video-microphone live broadcast scene performed by multiple anchor broadcasts.

It should be noted that the present invention is not limited to specific scene types of audio-microphone interaction.

The anchor identification may be identification information for identifying an anchor, such as a name or a live room number. The anchor identification may uniquely identify an anchor.

Specifically, the anchor mark may be formed of at least one of a number, a chinese character, a letter, and a symbol. The present invention is not limited to the specific configuration of the anchor identifier.

The audio data may be voice data output by the anchor in the audio microphone interaction process.

In the process of audio microphone connection interaction, each anchor terminal can acquire audio data corresponding to each anchor, adds anchor identifications corresponding to each anchor in the acquired audio data, and then sends the audio data carrying the anchor identifications to a server. For example, in the process of audio and video connection between a first anchor terminal and a second anchor terminal, the first anchor terminal may collect audio data of a first anchor using the first anchor terminal to perform audio connection and wheat interaction, the first anchor terminal adds an anchor identifier of the first anchor to the collected audio data, and then the first anchor terminal may send the audio data carrying the anchor identifier of the first anchor to a server; the second anchor terminal can collect audio data of a second anchor which uses the second anchor terminal to perform audio microphone connection interaction, the second anchor terminal adds an anchor identifier of the second anchor in the collected audio data, and then the second anchor terminal can send the audio data carrying the anchor identifier of the second anchor to the server.

Specifically, each anchor terminal may send the audio data collected in the corresponding time period to the server at regular intervals of predefined time, so as to enhance the synchronization of data processing. It should be noted that the predefined time period may be set by a technician according to actual needs, and the present invention is not limited thereto.

Specifically, the server may also actively acquire the acquired audio data from each anchor terminal at regular intervals of a predefined time.

Optionally, each audio data is live audio data. It can be understood that when the predefined time length is shorter, the server can obtain the audio data sent by each anchor terminal in real time, the audio data sent by each anchor terminal is live audio data, and the server can perform real-time voice recognition processing on the live audio data of the anchor terminal, so as to realize real-time display of subtitles in the multi-anchor live broadcast and microphone live broadcast process.

Specifically, when the audio-microphone interaction process of each anchor terminal is not required to be live broadcast but recorded, the server may not be required to obtain the audio data sent by each anchor terminal in the audio-microphone interaction process in real time. At this time, each anchor terminal can send the audio data collected in the audio microphone connecting interaction process to the target storage space for storage, and then the server can obtain the audio data of each anchor terminal from the target storage space when needed.

S102, respectively preprocessing each audio data according to a predefined preprocessing mode;

the preprocessing mode may be a data processing mode for preprocessing the audio data to obtain processed audio data meeting the corresponding format requirement, and better performing subsequent processing.

It should be noted that the present invention is not limited to specific processing types of the preprocessing manner, such as data cleansing, data enhancement, transcoding, format conversion, or restoring the compressed audio data.

Specifically, the server may respectively pre-process the audio data carrying the anchor identifier at each anchor terminal in a pre-processing manner, so as to obtain corresponding processed audio data carrying the anchor identifier. It can be understood that, after the server preprocesses an audio data carrying a certain anchor identifier, the generated processed audio data may also carry the anchor identifier.

Specifically, the server may perform preprocessing on the audio data carrying the anchor identifier of one anchor end by using a preprocessing mode each time the audio data carrying the anchor identifier of the anchor end is obtained; for example, the server may perform preprocessing on the audio data carrying the anchor identifier of the first anchor by using a preprocessing method when obtaining the audio data carrying the anchor identifier of the first anchor, and may perform preprocessing on the audio data carrying the anchor identifier of the second anchor by using a preprocessing method when obtaining the audio data carrying the anchor identifier of the second anchor.

Specifically, the server may also perform preprocessing on the audio data carrying the anchor identifier of each anchor after obtaining the audio data carrying the anchor identifier of each anchor. For example, after obtaining the audio data carrying the anchor identifier of the first anchor and the second anchor, the server may first pre-process the audio data carrying the anchor identifier of the first anchor, and then pre-process the audio data carrying the anchor identifier of the second anchor.

S103, acquiring corresponding processed audio data carrying the anchor identification;

the processed audio data is the audio data obtained after preprocessing the audio data carrying the anchor identification at an anchor terminal.

Specifically, the server may obtain a corresponding one of the processed audio data after preprocessing each pair of audio data carrying the anchor identifier of one anchor terminal. For example, the server may obtain corresponding first processed audio data after preprocessing the audio data carrying the anchor identifier at the first anchor, and may obtain corresponding second processed audio data after preprocessing the audio data carrying the anchor identifier at the second anchor.

S104, respectively carrying out voice recognition on the processed audio data;

specifically, the server may perform speech recognition on each processed audio data to obtain each corresponding subtitle text.

It can be understood that, after the server performs voice recognition on processed data carrying a certain anchor identifier, the generated subtitle text may also carry the anchor identifier.

Specifically, the server may perform voice recognition on the processed audio data each time it obtains one processed audio data, and obtain a corresponding subtitle text carrying the anchor identifier; for example, the server may perform voice recognition on the first processed audio data to obtain a first subtitle text carrying the anchor identifier when obtaining the first processed audio data, and the server may perform voice recognition on the second processed audio data to obtain a corresponding second subtitle text when obtaining the second processed audio data.

Specifically, the server may also perform speech recognition on each processed audio data after obtaining each processed audio data, so as to obtain a plurality of corresponding caption texts carrying the anchor identifier; for example, after obtaining the first processed audio data and the second processed audio data, the server may perform voice recognition on the first processed audio data in advance to obtain a corresponding first subtitle text carrying the first anchor identifier, and then perform voice recognition on the second processed audio data to obtain a corresponding second subtitle text carrying the second anchor identifier.

S105, generating each subtitle text carrying the anchor identification;

and the caption text carrying the anchor identification is the caption text obtained after the voice recognition processing is carried out on the processed voice data carrying the anchor identification.

And S106, distributing each subtitle text to the user side so that the user side displays each subtitle text in a distinguishing mode according to the anchor identification.

Specifically, the server may generate each subtitle text carrying the anchor identifier, and then transmit each subtitle text carrying the anchor identifier to the user side for differentiated display. For example, after generating the subtitle text carrying the anchor identifier a, the subtitle text carrying the anchor identifier B, and the subtitle text carrying the anchor identifier C, the server may send the subtitle text carrying the anchor identifier a, the subtitle text carrying the anchor identifier B, and the subtitle text carrying the anchor identifier C to the user side together for differentiated display as shown in fig. 2. In fig. 2, the user side may simultaneously display live broadcast pictures of the anchor a, the anchor B, and the anchor C, and after receiving the subtitle text carrying the anchor identifier a, the subtitle text carrying the anchor identifier B, and the subtitle text carrying the anchor identifier C, display the subtitle texts carrying different anchor identifiers in rows below the user side interface. Here, "xxxxx" in fig. 2 is the subtitle text.

It should be noted that, in the prior art, audio data collected by multiple anchor terminals during the course of a microphone connection interaction are mixed into one path of audio data, and then the path of audio data is sent to a server for voice recognition. In the prior art, the audio data of each anchor terminal does not carry an anchor identifier. Therefore, when the server in the prior art performs voice recognition, it is necessary to perform sound source recognition analysis in the mixed audio data to distinguish sounds of different anchor broadcasters, which may cause a problem of wrong sound source recognition, and when a plurality of anchor broadcasters speak simultaneously and interfere with each other, the accuracy of voice recognition and the accuracy of audio transcription of subtitle texts may decrease. The server of the present invention can respectively obtain the audio data carrying the anchor identification sent by each anchor terminal by executing steps S101 to S106 in fig. 1, and respectively perform voice recognition on the audio data carrying the anchor identification to generate subtitle texts carrying the anchor identification, wherein the server of the present invention can directly recognize the anchor to which the server belongs based on the anchor identification in the audio data without performing sound source recognition, and can effectively avoid the problem of sound source recognition error.

The data processing method provided by the embodiment can be applied to a server. The invention can respectively obtain the audio data which are sent by a plurality of anchor terminals in the audio microphone connecting interaction process and carry the anchor identification; respectively preprocessing each audio data according to a predefined preprocessing mode to obtain each processed audio data carrying a main broadcast identifier; respectively carrying out voice recognition on each processed audio data to generate each subtitle text carrying the anchor identification; and distributing each subtitle text to the user side so that the user side displays each subtitle text in a distinguishing way according to the anchor identification. The server can directly identify the anchor to which the server belongs based on the anchor identification in the audio data without sound source identification, and can effectively avoid the problem of sound source identification error.

Based on fig. 1, the present embodiment proposes a second data processing method as shown in fig. 3. In the method, step S102 may include step S201, and step S103 may include step S202, wherein:

s201, transcoding each audio data according to a predefined transcoding processing mode;

specifically, the transcoding processing mode may be a data processing mode for converting the format of the audio data into the target format.

The target format may be a data format which can meet the requirements of subsequent voice recognition and is set by technicians according to actual conditions, and the specific data format type of the target format is not limited by the invention.

S202, obtaining each processed audio data with a target format, wherein the target format is a format required by voice recognition.

Specifically, the server may obtain one processed audio data with a target format as a data format after transcoding the audio data sent by one anchor.

It should be noted that, by executing steps S201 and S202, the present invention can effectively ensure that the processed audio data can satisfy the format required by the voice recognition, and ensure the normal processing and processing efficiency of the voice recognition, thereby effectively ensuring the subtitle generation efficiency and accuracy.

The data processing method provided by the embodiment can effectively ensure that the processed audio data can meet the format required by voice recognition, and ensure the normal processing and processing efficiency of the voice recognition, thereby effectively ensuring the generation efficiency and accuracy of the subtitles.

Based on fig. 1, the present embodiment proposes a third data processing method as shown in fig. 4. In the method, step S101 may include step S301, wherein:

s301, obtaining each audio data from a Content Delivery Network (CDN) storing each audio data.

Specifically, each anchor terminal can send the acquired audio data to a content distribution network during the audio microphone interaction. The server may then pull the audio data of the individual anchor separately from the content distribution network, if desired. At the moment, the method and the device can effectively avoid the meaningless occupation of the audio data on the storage space, and improve the utilization rate of the data storage resources of the server.

Specifically, after the server receives the audio data sent by each anchor terminal through the content distribution network, the server can obtain the audio data of each anchor terminal from the content distribution network in real time and perform corresponding data processing, and provide real-time caption display service for users in the process of audio microphone connection interaction of each anchor terminal, so that the service quality is ensured, and the user experience and the stickiness are enhanced.

Optionally, each anchor terminal may send the audio and video data acquired by each anchor terminal to the direct broadcast CDN, the direct broadcast CDN may send the audio and video data to the direct broadcast CDN, and the direct broadcast CDN may send the audio and video data to the user side for live video broadcast; the server can respectively pull the audio and video data of each anchor terminal from the connecting CDN, so that the audio data of each anchor terminal is obtained.

The data processing method provided by the embodiment can effectively avoid meaningless occupation of the storage space by the audio data, improve the utilization rate of data storage resources of the server, provide real-time caption display service for users in the process of audio-microphone interaction at each anchor terminal, ensure the service quality and enhance the user experience and viscosity.

Corresponding to the method shown in fig. 1, as shown in fig. 5, the present embodiment proposes a server. The server may include: a first obtaining unit 101, a preprocessing unit 102, a second obtaining unit 103, a voice recognition unit 104, a first generating unit 105, and a distribution unit 106; wherein:

a first obtaining unit 101, configured to obtain audio data with anchor identifiers sent by multiple anchor terminals in an audio microphone connection interaction process, respectively;

the preprocessing unit 102 is configured to respectively preprocess each audio data according to a predefined preprocessing manner;

a second obtaining unit 103, configured to obtain each piece of processed audio data carrying the anchor identifier;

a voice recognition unit 104, configured to perform voice recognition on each processed audio data;

a first generating unit 105, configured to generate each subtitle text carrying a anchor identifier;

a distributing unit 106, configured to distribute each subtitle text to the user side, so that the user side performs differentiated display on each subtitle text according to the anchor identifier.

The specific processing procedures of the first obtaining unit 101, the preprocessing unit 102, the second obtaining unit 103, the speech recognition unit 104, the first generating unit 105, and the distribution unit 106 and the technical effects thereof may refer to the related descriptions of steps S101 to S106 in fig. 1 in this embodiment, and are not described herein again.

Optionally, the preprocessing unit 102 is configured to perform transcoding processing on each audio data according to a predefined transcoding processing manner;

a second obtaining unit 103, configured to obtain each processed audio data with a target format, where the target format is a format required for performing voice recognition.

Optionally, the first obtaining unit 101 is configured to obtain each piece of audio data from a content distribution network in which each piece of audio data is stored.

Optionally, each audio data is live audio data.

The server provided by this embodiment can respectively obtain audio data carrying a anchor identifier sent by a plurality of anchor terminals in an audio microphone connection interaction process; respectively preprocessing each audio data according to a predefined preprocessing mode to obtain each processed audio data carrying a main broadcast identifier; respectively carrying out voice recognition on each processed audio data to generate each subtitle text carrying the anchor identification; and distributing each subtitle text to the user side so that the user side displays each subtitle text in a distinguishing way according to the anchor identification. The server can directly identify the anchor to which the server belongs based on the anchor identification in the audio data without sound source identification, and can effectively avoid the problem of sound source identification error.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A data processing method is applied to a server, and the data processing method comprises the following steps:

respectively obtaining audio data carrying a main broadcast identification, which are sent by a plurality of main broadcast ends in an audio connection interaction process, from a connection-to-microphone CDN, wherein each audio data is the audio data sent to the connection-to-microphone CDN by each main broadcast end every uniform predefined time; the direct broadcast CDN is used for sending the audio and video data of each anchor end to the direct broadcast CDN so that the direct broadcast CDN sends the audio and video data to each user end; the anchor identification comprises an anchor name or a live room number;

respectively carrying out voice recognition on each processed audio data to generate each subtitle text carrying a main broadcasting identification;

and distributing each subtitle text to a user side so that the user side displays each subtitle text in a distinguishing way according to the anchor identification, wherein when each subtitle text is displayed, the anchor identification and the corresponding subtitle text are displayed together, and the subtitle texts with different anchor identifications are displayed in different lines.

2. The data processing method according to claim 1, wherein the pre-processing each audio data according to a predefined pre-processing manner comprises:

3. The data processing method of claim 1, wherein each of the audio data is live audio data.

4. A server, comprising: the device comprises a first obtaining unit, a preprocessing unit, a second obtaining unit, a voice recognition unit, a first generating unit and a distributing unit; wherein:

the first obtaining unit is configured to obtain, from the CDN, audio data that is sent by multiple anchor terminals in an audio inter-microphone interaction process and carries an anchor identifier, where each of the audio data is audio data sent by each of the anchor terminals to the CDN every other uniform predefined time; the direct broadcast CDN is used for sending the audio and video data of each anchor end to the direct broadcast CDN so that the direct broadcast CDN sends the audio and video data to each user end; the anchor identification comprises an anchor name or a live room number;

the distribution unit is used for distributing each subtitle text to a user side so that the user side can display each subtitle text in a distinguishing mode according to the anchor identification, wherein when each subtitle text is displayed, the anchor identification and the corresponding subtitle text are displayed together, and the subtitle texts with different anchor identifications are displayed in different lines.

5. The server according to claim 4, wherein the preprocessing unit is configured to transcode each of the audio data according to a predefined transcoding processing manner;

the second obtaining unit is configured to obtain each processed audio data with a target format, where the target format is a format required for performing voice recognition.

6. The server according to claim 4, wherein each of the audio data is live audio data.