CN108829751B

CN108829751B - Method and device for generating lyrics and displaying lyrics, electronic equipment and storage medium

Info

Publication number: CN108829751B
Application number: CN201810513535.7A
Authority: CN
Inventors: 冯穗豫
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2018-05-25
Filing date: 2018-05-25
Publication date: 2022-02-25
Anticipated expiration: 2038-05-25
Also published as: SG11202011715PA; WO2019223394A1; CN108829751A

Abstract

The invention discloses a method, a device, electronic equipment and a storage medium for generating and displaying lyrics, and belongs to the technical field of internet. The method comprises the following steps: acquiring lyrics of a target song and an audio file of the target song; determining candidate pronunciation of the character to be marked according to the character to be marked in the characters in the lyrics and the audio frequency fragment corresponding to the characters in the audio file; if the character to be marked has a candidate pronunciation, the candidate pronunciation is determined as the corresponding pronunciation of the character to be marked in the target song; if the character to be marked has at least two candidate pronunciations, determining the candidate pronunciation matched with the semantics of the character to be marked as the corresponding pronunciation of the character to be marked in the target song; and generating a first lyric file of the target song according to the corresponding pronunciation of the characters and the characters to be marked in the target song, so that the pronunciation can be synchronously displayed when the lyrics are displayed subsequently, and the user can be ensured to sing the pronunciation of each character based on correctness.

Description

Method and device for generating lyrics and displaying lyrics, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of internet, in particular to a method, a device, electronic equipment and a storage medium for generating and displaying lyrics.

Background

With the development of internet technology, many music players not only support the online playing of massive songs, but also can provide a user with a karaoke service, which is a music accompaniment for playing songs by a music player and displays lyrics on a current playing interface, so that the user can sing the songs along with the music accompaniment while watching the lyrics.

At present, on a current playing interface, a terminal generally displays a sentence of lyrics corresponding to a current playing time in a karaoke manner, and dynamically marks which character of the lyrics is currently played through font color. When a song is actually sung, in order to express the emotion expressed by the song better, the original singer usually changes the original pronunciation of some characters in the lyric into the pronunciation of characters of more literary works for singing. When the user is unfamiliar with the original singing of the song, the user often sings the original pronunciation of the character, but the character does not accord with the original singing corresponding to the song, so that the singing error of the user is caused. Therefore, a method for displaying and generating lyrics is needed to ensure that the user can sing the original reading of each character accurately.

Disclosure of Invention

The embodiment of the invention provides a method, a device, electronic equipment and a storage medium for generating lyrics and displaying the lyrics, which can solve the problem that a user sings wrong pronunciation of characters in the lyrics in the related technology. The technical scheme is as follows:

in a first aspect, a method for generating lyrics is provided, the method comprising:

acquiring lyrics of a target song and an audio file of the target song;

determining candidate pronunciation of the character to be marked according to the character to be marked in the characters in the lyrics and the audio frequency fragment corresponding to the characters in the audio frequency file;

if the character to be marked has a candidate pronunciation, determining the candidate pronunciation as the corresponding pronunciation of the character to be marked in the target song;

if the character to be marked has at least two candidate pronunciations, determining the candidate pronunciation matched with the semantics of the character to be marked as the corresponding pronunciation of the character to be marked in the target song;

and generating a first lyric file of the target song according to the corresponding pronunciation of the characters and the characters to be marked in the target song.

In a second aspect, a method for displaying lyrics is provided, and the method is applied to a terminal, and the method includes:

when a lyric display instruction is received, acquiring a first lyric file of a target song, wherein the lyric display instruction is used for displaying the lyrics of the target song;

acquiring lyrics of the target song and corresponding pronunciation of characters to be labeled in the target song from the first lyric file;

displaying a plurality of characters of the lyrics;

when the character to be marked is displayed, marking the corresponding pronunciation of the character to be marked in the target song on the target position of the character to be marked, wherein the target position is above the character to be marked.

In a third aspect, an apparatus for generating lyrics is provided, the apparatus comprising:

the acquisition module is used for acquiring lyrics of a target song and an audio file of the target song;

the determining module is used for determining candidate pronunciation of the character to be marked according to the character to be marked in the characters in the lyrics and the audio frequency fragment corresponding to the characters in the audio frequency file;

the determining module is further configured to determine, if the character to be annotated has a candidate pronunciation, the candidate pronunciation as a pronunciation corresponding to the character to be annotated in the target song;

the determining module is further configured to determine, if the character to be annotated has at least two candidate pronunciations, a candidate pronunciation that matches the semantics of the character to be annotated as a pronunciation corresponding to the character to be annotated in the target song;

and the generating module is used for generating a first lyric file of the target song according to the corresponding pronunciation of the characters and the characters to be marked in the target song.

In a fourth aspect, an apparatus for displaying lyrics is provided, the apparatus being applied to a terminal, the apparatus comprising:

the device comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a first lyric file of a target song when receiving a lyric display instruction, and the lyric display instruction is used for displaying the lyric of the target song;

the obtaining module is further configured to obtain lyrics of the target song from the first lyric file, and corresponding reading of characters to be labeled in the target song from among a plurality of characters of the lyrics;

the display module is used for displaying a plurality of characters of the lyrics;

and the marking module is used for marking the corresponding pronunciation of the character to be marked in the target song at the target position of the character to be marked when the character to be marked is displayed, wherein the target position is above the character to be marked.

In a fifth aspect, an electronic device is provided, and the electronic device includes a processor and a memory, where at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to implement the operations performed by the method for generating lyrics according to the first aspect or the method for displaying lyrics according to the second aspect.

In a sixth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, the instruction being loaded and executed by a processor to implement the operations performed by the method for generating lyrics according to the first aspect or the method for displaying lyrics according to the second aspect.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the server can determine the corresponding pronunciation of each character to be marked in the lyrics in the target song based on the lyrics of the target song and the audio file of the target song; and further generating a first lyric file of the target song according to the plurality of characters and the pronunciation of the character to be marked, so that the corresponding pronunciation is bound to each character to be marked, the pronunciation can be synchronously displayed when the lyric is displayed subsequently, and the user can be ensured to sing the pronunciation of each character of the target song based on correctness. Moreover, when the terminal displays the lyrics, the pronunciation can be marked above the corresponding character to be marked, so that the pronunciation is clear and visible, a user can accurately and quickly find the pronunciation corresponding to the character to be marked, and the accuracy of displaying the lyrics is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the invention;

FIG. 2 is a flow chart of a method for generating lyrics according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method for displaying lyrics according to an embodiment of the present invention;

FIG. 4 is a schematic interface diagram of a lyric display according to an embodiment of the present invention;

FIG. 5 is a schematic interface diagram of a lyric display according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for generating lyrics according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an apparatus for displaying lyrics according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention, where the implementation environment includes: a terminal 101 and a server 102. The terminal 101 and the server 102 are connected via a network. An application program is installed on the terminal 101, and the terminal 101 can acquire a target song from the server 102 based on the application program and perform playing, lyric display and the like of the target song.

The server 102 may determine, in advance, based on a plurality of characters to be labeled in lyrics of a target song and an audio file of the target song, a reading of the character to be labeled in the plurality of characters, so as to obtain a reading corresponding to the character to be labeled in the target song, store the lyrics and the reading corresponding to the character to be labeled in the target song, and send the lyrics of the target song and the reading corresponding to the character to be labeled to the terminal 101. When the terminal 101 plays the target song, the lyrics of the target song are synchronously displayed, and the pronunciation corresponding to the character to be marked is marked above the character to be marked.

The above process of obtaining the readings of the characters to be labeled with respect to the plurality of characters may also be executed by the terminal 101, that is, after the terminal 101 obtains the lyrics of the target song from the server 102, the readings are determined for the characters to be labeled in the lyrics and labeled.

It should be noted that in some songs, in order to better express the emotion expressed by the song, the original pronunciation of some characters in the lyrics is usually changed to a more literary pronunciation. For example, in some Japanese songs, the "future" will often be sung as "あす" and the "あす" corresponding Chinese characters as "tomorrow" rather than the correct pronunciation in the future. For the sake of artistic expression of the song, "future" is sung as "tomorrow" reading "あす.

In the application, the pronunciation of the character to be marked when the target song sings originally is taken as the pronunciation of the correct singing and is written into the lyric file of the target song for storage, so that the pronunciation of the character to be marked in the original singing is displayed based on the lyric file subsequently.

The application program may be a music player or an application program installed with a music playing plug-in, and the terminal may be a mobile phone terminal, a PAD (Portable Android Device) terminal, a computer terminal, or the like. Server 102 is the backend server for the application. The server 102 may be a server, a server cluster composed of several servers, or a cloud computing server center.

Fig. 2 is a flowchart of a method for generating lyrics according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a terminal, and referring to fig. 2, the method includes:

201. the server obtains the lyrics of the target song and the audio file of the target song.

Wherein, the audio file comprises audio segments corresponding to a plurality of characters in the lyrics. The target song is any song needing lyric generation, and is generally a song with lyrics including Chinese characters, wherein the Chinese characters include but are not limited to: simplified Chinese characters, traditional Chinese characters, etc. The plurality of characters may be textual symbols in the target language, such as kanji, shakana, and the like in japanese. In the embodiment of the invention, the server can determine the pronunciation of each character to be marked in the lyric based on the pronunciation of the target song during the original singing, thereby accurately obtaining the pronunciation which is consistent with the original singing of the target song.

In this step, the server may first obtain a second lyric file of the target song and an original audio file, obtain lyrics of the target song from the second lyric file, extract audio segments corresponding to the plurality of characters from the original audio file, and generate the audio file according to the audio segments corresponding to the plurality of characters.

The audio clip may be an audio clip of a human audio frequency band. In this step, the server may obtain the audio file by eliminating the accompaniment. The step of the server obtaining the audio file of the target song may be: the server acquires an original audio file of the target song, wherein the original audio file comprises audio clips in all collected frequency bands, extracts the audio clips in the vocal frequency bands from the original audio file, and generates the audio file of the target song according to the audio clips in the vocal frequency bands, so that the accompaniment of the target song is eliminated. In general, the server extracts an audio clip of the target song right in front of the space from the original audio file, and selects an audio clip of a human vocal range from the audio clip. Generally, the human voice frequency range is medium frequency and high frequency.

In some song lyric files, a display time of a plurality of characters is often included, and the display time is used for indicating when each character is displayed in the song playing process. Moreover, in the lyric file, each character and the display time of the character are stored in a spliced manner, and there often exists a display time between the characters, for example, the lyric "rain-located air port" is actually stored in the lyric file in the form of "(39112) rain (39803) located at (40356) air (40606) port (41176)", wherein the numerals in parentheses represent the display time of the characters, and the characters are all cut off by the display time; the server may obtain a second lyric file of the target song, extract the lyrics of the target song and the display time of a plurality of characters in the lyrics from the second lyric file, and establish an index relationship between each character and the display time of each character to ensure that the display may be performed based on the display time during subsequent display. The second lyric file is an original lyric file of the target song, and the second lyric file comprises lyrics of the target song and display time of a plurality of characters of the lyrics.

In one possible design, the step of the server obtaining the lyrics of the target song may be: the server acquires a second lyric file of the target song, and establishes a first array and a third array, wherein the first array is used for storing a plurality of characters in the lyrics, and the third array is used for storing the display time of the characters. The server writes the characters into the first array, writes the display time of the characters into the third array, and determines the characters in the first array as the lyrics of the target song. In the third array, the server may store the display time of each character in the third array according to the storage order of each character in the first array, and based on the storage order, each character in the first array corresponds to the display time of the character in the third array one to one, thereby improving the accuracy of the lyrics and the display time.

It should be noted that the server extracts the characters to enable subsequent determination of the pronunciation based on the characters, and meanwhile, the server extracts the lyrics by establishing an index relationship between each character and the display time of each character on the premise of not destroying the original content of the second lyric file, thereby ensuring that an accurate lyric file can be finally obtained.

202. And the server determines the Chinese characters in the characters as the characters to be labeled.

In the embodiment of the invention, the server can identify the Chinese characters in the characters through a preset identification algorithm, and the Chinese characters are determined as the characters to be marked. The Chinese characters include but are not limited to: simplified Chinese characters, traditional Chinese characters, etc. The server generally stores encoded data of the plurality of characters after encoding, and the preset identification algorithm may be an encoding table based on an encoding mode of the plurality of characters to identify the plurality of characters. Of course, the embodiment of the present invention does not limit the encoding method of the character, and the encoding table may be an encoding table of a Unicode (uniform code) encoding method, for example.

It should be noted that the server identifies the characters to be annotated by first identifying the characters, so that the pronunciation of the characters to be annotated in the characters is determined directly based on the characters to be annotated, the number of the characters to be processed subsequently is reduced, and the efficiency of determining the pronunciation is further improved. In addition, the embodiment of the invention takes the Chinese characters as the characters to be labeled, and the pronunciation of the Chinese characters in the lyrics is labeled subsequently, so that for the user whose mother language is not Chinese or Japanese, the user can conveniently learn a foreign language by singing, thereby satisfying the hearing interest, achieving the effect of learning the foreign language, greatly enriching the user experience and greatly improving the practical value of the method.

203. The server identifies the pronunciations of the characters according to the audio segments corresponding to the characters, and determines the candidate pronunciations of the characters to be marked according to the characters to be marked in the characters.

In the embodiment of the present invention, the server may identify the multiple pronunciations based on the audio segments corresponding to the multiple characters by using a preset speech recognition algorithm, and determine a candidate pronunciation of the character to be labeled according to a relative position between the character to be labeled and the character other than the character to be labeled in the multiple characters.

However, noise may exist in the audio segment corresponding to the characters in the audio file, or during actual singing, the syllable of a certain character may be elongated, so that when speech recognition is performed, some fuzzy readings may be included in the readings of the characters, and the server may temporarily use intermediate readings to represent the readings that are not uniquely determined, so this step includes the following two cases.

In the first case, if the pronunciation of the characters identified based on the audio clip does not include the ambiguous pronunciation, that is, when each pronunciation is uniquely determined, the server determines the candidate pronunciation of the character to be labeled from the pronunciations of the characters according to the character to be labeled in the characters.

The server determines the candidate pronunciation corresponding to the character to be marked in the pronunciations according to the positions of other characters except the character to be marked in the characters in the pronunciations. Of course, when the number of the to-be-annotated character is multiple, the server may also determine, according to the positions of other characters adjacent to the to-be-annotated character in the multiple pronunciations, a candidate pronunciation of the to-be-annotated character corresponding to the multiple pronunciations.

It should be noted that, in this step, the number of the candidate pronunciations corresponding to the character to be annotated determined by the server is only one. Subsequently, the corresponding pronunciation of the character to be marked in the target song can be determined through the following step 204.

In the second case, if the pronunciation of the characters identified based on the audio clip includes fuzzy pronunciation, the server determines the middle pronunciation of the character to be labeled from the pronunciations of the characters according to the character to be labeled in the characters, and identifies at least two candidate pronunciations of the character to be labeled according to the middle pronunciation.

The intermediate pronunciation may be a roman pronunciation, and the server may match at least two candidate pronunciations corresponding to the character to be labeled from the dictionary engine based on the roman pronunciation. The at least two candidate pronunciations are shakana for representing the pronunciation of the Chinese character in Japanese. For example, taking japanese as an example, for an audio clip corresponding to "ペガサス fantasy そうさだけは" in lyrics, the multiple pronunciations recognized by the speech may be "ペガサス fantaji そうさゆめだけは", where "fantaji" is roman sound, and the server may determine, based on the roman sound "fantaji", that the 3 candidate pronunciations corresponding to the character to be annotated are: "ふぁ brake layer た brake", "ファンタジ" and "ファンタジー".

It should be noted that, in this step, the number of candidate pronunciations corresponding to the character to be annotated determined by the server is at least two. Subsequently, the corresponding pronunciation of the character to be marked in the target song can be determined through the following step 205.

It should be noted that, in the above step 202-203, the server determines the candidate reading of the character to be labeled according to the character to be labeled in the multiple characters in the lyric and the audio segment corresponding to the multiple characters in the audio file "is a specific implementation manner, actually, the step is a process of determining the character to be labeled in the lyric first and then performing speech recognition based on the audio file, however, the server may also perform speech recognition based on the audio file first to determine the reading of the multiple characters and then determine the character to be labeled in the lyric, so as to determine the candidate reading, which is not limited in this embodiment of the present invention.

204. If the character to be marked has a candidate pronunciation, the server determines the candidate pronunciation as the corresponding pronunciation of the character to be marked in the target song.

When the server identifies that the candidate pronunciation corresponding to the character to be marked is the only one, for each character to be marked, the server can directly determine the candidate pronunciation corresponding to the character to be marked as the pronunciation corresponding to the character to be marked in the target song.

In japanese, for example, the server may represent the pronunciations of the characters by shakana. Based on multiple characters " でももし in lyrics

が

でありたいなら' and a plurality of pronunciations だれでももしひと "" ひとで recognized by the audio segments corresponding to the charactersありたいなら ", the process of determining the readings of" "," men and women "or" people "in the plurality of characters may be:

a. matching the plurality of pronunciations with the plurality of characters determines " でももし" respectively

が

でありたいなら "" でももし "" でありたいなら "are located in the front side of the" "だれでももしひと" "ひとでありたいなら".

b. From the relative positions of the characters "", "male and female", "people" and the recognized characters "でももし", "" and "でありたいなら", the readings of "", "male and female" and "people" in "だれでももしひと" "ひとでありたいなら" are determined:

"" corresponds to a reading of "だれ";

"Man and woman" corresponds to the pronunciation of "ひと";

"people" corresponds to the reading "ひと";

it should be noted that, because only the pronunciation of the lyrics in the japanese song needs to be determined, when performing speech recognition based on the audio clip, only the pronunciation of each kana in the japanese is needed to be recognized, thereby narrowing the range of the pronunciation to be recognized. Moreover, the server can further define a voice interval corresponding to each character to be marked in a plurality of pronunciations based on each character to be marked and a plurality of shakanas, so that the accuracy of the subsequent determination of the pronunciations of the characters to be marked is improved.

205. If the character to be annotated has at least two candidate pronunciations, the server determines the candidate pronunciation matched with the semantic meaning of the character to be annotated as the pronunciation corresponding to the character to be annotated in the target song.

In this step, the server may create and store a dictionary engine in advance, and query the candidate pronunciation matched with the voice based on the dictionary engine. For each candidate pronunciation, the server searches an intermediate character corresponding to each candidate pronunciation from the first dictionary engine, searches the semantics of the intermediate character from the second dictionary engine, selects a candidate pronunciation with the semantics of the intermediate character matched with the semantics of the character to be annotated from the at least two candidate pronunciations according to the semantics of the character to be annotated, and determines the selected candidate pronunciation as the pronunciation of the character to be annotated corresponding to the target song.

The first dictionary engine comprises a plurality of candidate pronunciations and intermediate characters corresponding to the candidate pronunciations, and the second dictionary engine comprises a plurality of intermediate characters and a plurality of semantemes corresponding to the intermediate characters. The corresponding pronunciation of the character to be marked in the target song can be represented by a kana in japanese. The intermediate characters may be word symbols in a language, for example, word symbols in languages such as english, chinese, german, japanese, french, etc., and the first dictionary engine may be a japanese-chinese dictionary, a japanese-japanese dictionary, a japanese-english dictionary, a japanese-de dictionary, a japanese-french dictionary, etc., which is not limited in this embodiment of the present invention.

Taking japanese as an example, for a plurality of readings "ペガサス fantaji そうさゆめだけは" corresponding to "ペガサス fantasy そうさだけは" in the lyrics, taking "fantasy" as an example, the procedure for determining the readings is as follows:

a. firstly, a plurality of characters in the lyrics are matched with the plurality of pronunciations, so that the following results are obtained: the pronunciation corresponding to the 'fantasy' is the middle pronunciation, the roman pronunciation 'fantaji';

b. determining that the 3 candidate pronunciations corresponding to the roman pronunciation "fantaji" are respectively: "ふぁ brake layer た", "ファンタジ", "ファンタジー";

c. for each candidate pronunciation, respectively inquiring an intermediate character corresponding to each candidate pronunciation in the first dictionary engine, and inquiring the semantics of the intermediate character from the second dictionary engine to determine the candidate pronunciation of 'ファンタジー', wherein the English character corresponding to the Japanese-English dictionary is 'Fantasy';

d. the semantics of each intermediate character are searched from the second dictionary engine, the semantics of each intermediate character is semantically matched with the semantics of 'Fantasy', the English character 'Fantasy' corresponding to the candidate reading 'ファンタジー' is determined, and the corresponding semantics in the second dictionary engine can be 'Fantasy', and are the same as the semantics of 'Fantasy' in Japanese.

e. The pronunciation corresponding to "fantasy" is determined to be "ファンタジー".

Through the above steps a-e, a dictionary engine established in advance is searched by using a foreign language as an intermediate character, and a kana "ファンタジー" having the same semantic meaning as "fantasy" is further specified from a plurality of kanas.

It should be noted that, the server uses the foreign language as an intermediate character, and based on a plurality of possible candidate pronunciations corresponding to the ambiguous roman pronunciation, the pronunciation of the character to be annotated is first transitioned into the intermediate character, the semantics corresponding to the intermediate character is inquired, and the candidate pronunciation closest to the semantics of the character to be annotated is matched from the plurality of candidate pronunciations, so that the pronunciation after the artistic evolution of the character to be annotated during the actual singing is further analyzed, and the accuracy of determining the pronunciation corresponding to the character to be annotated is greatly improved.

206. And the server generates a first lyric file of the target song according to the plurality of characters and the corresponding pronunciation of the character to be marked in the target song.

In this step, the server creates a first lyric file, stores the corresponding pronunciation of the characters and the characters to be labeled in the target song in the first lyric file, and establishes an index relationship between the characters to be labeled and the corresponding pronunciation of the characters to be labeled in the target song. The server can establish an index relation between the characters and the pronunciation based on the arrays, establish a second array and write the corresponding pronunciation of the character to be marked in the target song into the second array; the first array and the second array are added to the first lyric file. For each character to be marked, the server stores the pronunciation corresponding to the character to be marked in the second storage location of the third array according to the first storage location of the character to be marked in the first array, wherein the first storage location and the third storage location may be both associated bytes in the first array and the second array, for example, both are the first byte.

In step 201, the server may further obtain a display time of the plurality of characters, and in this step, the server may further add the display time to the first lyric file. Therefore, the step of generating, by the server, the first lyric file of the target song according to the corresponding pronunciation of the plurality of characters and the character to be labeled in the target song may further be: the server creates the first lyric file, establishes a second array and writes the corresponding pronunciation of the character to be marked in the target song into the second array; the server adds the first array, the second array and the third array to the first lyric file.

In one possible design, there may be some cases where the number of characters is not equal to the number of pronunciations in some languages, such as in Japanese

The two characters included correspond to three pronunciations actually, so that when a target word exists in the lyric, the server can also determine the display time of the target word according to the display time of at least two adjacent characters to be marked included in the target word, and update the third array according to the display time of the target word; finally, the server adds the first array, the second array and the updated third array to the first lyric file; the target word comprises at least two adjacent characters to be labeled, and the number of the included characters to be labeled is not equal to the number of pronunciations of the target word. The mode of the server for determining the display time of the target word is as follows: merging the display time of at least two adjacent characters to be marked included in the target word, and determining the merged display time as the display time of the target wordAnd (3) removing the solvent.

In the japanese text, the pronunciation of many words is not equal to the number of characters it comprises, for example,

う、

う、

and the like. To be provided with

For example, the display time stored in the second lyric file before updating the third array is shown in table 1 below:

TABLE 1

Third array	39112	39803	40356	41176
					First array	Is owed to	Sheet	の	Color(s)

Obviously, the "default" and the "piece" correspond to one display time respectively, however, in actual display, since the default corresponds to three readings together, the "default" and the "piece" are displayed separately according to the two display times, which obviously causes display errors, and the updated display time stored in the third array obtained by combining the two display times is as shown in the following table 2:

TABLE 2

It should be noted that the server merges the display time of a plurality of adjacent characters in the target word, so that the plurality of adjacent characters can be used as one word and correspond to a single display time, thereby avoiding the problem that the display time of each character is not matched with the actually performed pronunciation due to the fact that each character corresponds to one display time to display when a song is actually played, further enabling the display time of each character to be matched with the actual pronunciation of the character, enabling the display time of each word to be matched with the pronunciation of the word, enabling the display time of the lyrics in the first lyric file to be accurately matched with the pronunciation of the lyrics, and further improving the accuracy of the finally obtained first lyric file.

In the embodiment of the invention, the server can determine the corresponding pronunciation of each character to be marked in the lyrics in the target song based on the lyrics of the target song and the audio file of the target song; and further generating a first lyric file of the target song according to the plurality of characters and the pronunciation of the character to be marked, so that the corresponding pronunciation is bound to each character to be marked, the pronunciation can be synchronously displayed when the lyric is displayed subsequently, and the user can be ensured to sing the pronunciation of each character of the target song based on correctness.

Fig. 3 is a flowchart of a method for displaying lyrics according to an embodiment of the present invention. The execution subject of the embodiment of the present invention is a terminal, and referring to fig. 3, the method includes:

301. when a lyric display instruction is received, the terminal acquires a first lyric file of a target song.

In the embodiment of the invention, the lyric display instruction is used for displaying the lyrics of the target song, and when the terminal receives the lyric display instruction, the terminal can acquire a first lyric file of the target song from a local or server according to the identifier of the target song. The lyric display instruction can be obtained when the user triggers the terminal to play the target song or the lyric file is triggered and displayed by the user.

It should be noted that the first lyric file is a lyric file generated in advance based on lyrics and a target reading of a plurality of characters, and the specific generating process is as shown in the

above step

201 and 206, where the first lyric file at least includes the lyrics of the target song and the corresponding reading of the characters to be labeled in the plurality of characters of the lyrics in the target song.

302. And the terminal acquires the lyrics of the target song and the corresponding pronunciation of the characters to be marked in the target song from the first lyric file.

In the embodiment of the invention, the lyrics and the pronunciation of the character to be marked can be respectively stored in the first lyric file in an array form. The terminal obtains a first array and a second array from the first lyric file, reads a plurality of characters in the lyric from the first array, and reads the pronunciation of the character to be marked in the plurality of characters from the second array. The terminal can determine a second storage position associated in the second array based on the first storage position of each character to be marked in the first array, and read the pronunciation of the character to be marked from the second storage position.

In a possible design, the first lyric file may further include a display time of a plurality of characters in the lyric, and the terminal may further obtain the display time of the plurality of characters from the first lyric file. The process may be: and the terminal acquires a third array from the first lyric file and acquires the display time of the characters from the third array. The display time of the plurality of characters includes a display time of a target word in the lyric and a display time of each character other than the target word.

The target word comprises at least two adjacent characters to be labeled, the number of the included characters to be labeled is not equal to the number of pronunciations of the target word, and the display time of the target word is the display time obtained by combining the display times of the at least two adjacent characters to be labeled included in the target word.

303. The terminal displays a plurality of characters of the lyric.

In the embodiment of the invention, the terminal highlights the currently played target word in a plurality of characters according to the display time of the target word; and highlighting the currently played character in the characters except the target word according to the display time of each character except the target word. Wherein the terminal can highlight the character being played or the target word by the font color of the character. For example, characters or target words that have already been played are rendered in a first color, and characters or target words that have not yet been played are rendered in a second color.

304. And when the character to be marked is displayed, the terminal marks the corresponding pronunciation of the character to be marked in the target song at the target position of the character to be marked.

In the embodiment of the invention, the target position is above the character to be marked, and when the terminal displays the character to be marked, the currently played character to be marked and the pronunciation of the currently played character to be marked in the target song are highlighted according to the display time of the character to be marked. Of course, the character to be labeled may include a target word, and the terminal may highlight the currently played target word and the pronunciation of the currently played target word in the target song according to the display time of the target word.

It should be noted that, in some prior art, in order to simplify the logic, for the unique reading of the lyrics, the reading is labeled in parentheses, as shown in fig. 4, for example, でももし boy and girl (ひと) people of china (ひと) でありたいなら, and such labeling causes a certain misunderstanding for the user, such as considering that the "girl" sends the music of ひと alone, and the reading of the man is unknown, especially a compound word, e.g., for travel in time (タイムトリップ), and the user cannot distinguish whether the reading in the parentheses is a travel or a travel in time after browsing. Meanwhile, when the lyrics are highlighted, there is no guarantee that the characters and the pronunciations can be highlighted according to the same display rhythm, for example, men and women (ひと) have only two syllables, and become 6 characters (including parentheses) corresponding to one line of characters, wherein when "men" is highlighted based on the display time of "men", the "men and women" (rendered to the first color, and the remaining part "ひと" corresponding to the display of "women") needs to be rendered to the first color, thereby causing display errors and bringing poor user experience to the user.

In the embodiment of the invention, as shown in fig. 5, the terminal can display the pronunciation above the character to be marked, and when the user browses the character to be marked, the user can clearly and accurately find the pronunciation of the character to be marked, so that the efficiency of the user for acquiring the pronunciation of the character to be marked is improved, and the accuracy of lyric display is improved. Meanwhile, the user sings based on the pronunciation, and the pronunciation of each character can be sung accurately by ensuring the singing.

In the embodiment of the invention, when a lyric display instruction is received, a terminal can display the lyric of a target song and the corresponding pronunciation of a character to be marked in a plurality of characters of the lyric in the target song; therefore, the pronunciation of each character can be sung accurately when the user sings is guaranteed, the terminal can mark the pronunciation above the corresponding character to be marked, the pronunciation is clear and visible, the user can accurately and quickly find the pronunciation corresponding to the character to be marked, and the accuracy of displaying lyrics is improved.

Fig. 6 is a schematic structural diagram of an apparatus for generating lyrics according to an embodiment of the present invention. Referring to fig. 6, the apparatus includes: an acquisition module 601, a determination module 602, and a generation module 603.

An obtaining module 601, configured to obtain lyrics of a target song and an audio file of the target song, where the audio file includes audio segments corresponding to multiple characters in the lyrics;

a determining module 602, configured to determine, according to a character to be labeled in the multiple characters and an audio segment corresponding to the multiple characters, a candidate pronunciation of the character to be labeled;

the determining module 602 is further configured to determine, if the character to be annotated has a candidate pronunciation, the candidate pronunciation as a pronunciation corresponding to the character to be annotated in the target song;

the determining module 602 is further configured to determine, if the character to be annotated has at least two candidate pronunciations, a candidate pronunciation that matches the semantics of the character to be annotated as a pronunciation corresponding to the character to be annotated in the target song;

the generating module 603 is configured to generate a first lyric file of the target song according to the corresponding pronunciation of the plurality of characters and the character to be labeled in the target song.

Optionally, the determining module 602 includes:

the determining unit is used for determining the plurality of character Chinese characters as the characters to be marked;

the recognition unit is used for recognizing the pronunciations of the characters according to the audio clips corresponding to the characters;

the determining unit is further configured to determine, according to a character to be annotated in the plurality of characters, a candidate pronunciation of the character to be annotated from the pronunciations of the plurality of characters, or determine, according to the character to be annotated in the plurality of characters, an intermediate pronunciation of the character to be annotated from the pronunciations of the plurality of characters, and identify, according to the intermediate pronunciation, at least two candidate pronunciations of the character to be annotated.

Optionally, the determining module 602 includes:

the searching unit is further used for searching an intermediate character corresponding to each candidate reading from a first dictionary engine and searching the semanteme of the intermediate character from a second dictionary engine, wherein the first dictionary engine comprises a plurality of candidate readings and intermediate characters corresponding to the candidate readings, and the second dictionary engine comprises a plurality of intermediate characters and a plurality of semantemes corresponding to the intermediate characters;

and the selecting unit is used for selecting a candidate pronunciation of which the semantic of the middle character is matched with the semantic of the character to be annotated from the at least two candidate pronunciations according to the semantic of the character to be annotated, and determining the selected candidate pronunciation as the pronunciation of the character to be annotated corresponding to the target song.

Optionally, the obtaining module 601 is configured to obtain a second lyric file of the target song, where the second lyric file includes lyrics of the target song and display times of multiple characters of the lyrics; establishing a first array and a third array, writing the characters into the first array, writing the display time of the characters into the third array, and determining the characters in the first array as the lyrics of the target song; and acquiring an original audio file of the target song, extracting an audio clip of a human sound frequency band in the original audio file, and generating the audio file.

Optionally, the generating module 603 is configured to establish a second array, and write the pronunciation of the character to be labeled in the target song into the second array; when a target word exists in the lyric, determining the display time of the target word according to the display time of at least two adjacent characters to be labeled included in the target word, wherein the target word includes at least two adjacent characters to be labeled, and the number of the included characters to be labeled is not equal to the number of pronunciations of the target word; writing the display time of the target word into the third array; adding a first array, the second array and the third array to the first lyric file;

the mode of determining the display time of the target word is as follows: and merging the display time of at least two adjacent characters to be marked included in the target word, and determining the merged display time as the display time of the target word.

Optionally, the generating module 603 is configured to establish a second array, and write the pronunciation of the character to be labeled in the target song into the second array; adding a first array and the second array to the first lyric file, the first array for storing a plurality of characters of the lyric.

Fig. 7 is a schematic structural diagram of an apparatus for displaying lyrics according to an embodiment of the present invention. The apparatus is applied to a terminal, see fig. 7, and includes: an acquisition module 701, a display module 702 and a labeling module 703.

An obtaining module 701, configured to obtain a first lyric file of a target song when a lyric display instruction is received, where the lyric display instruction is used to display lyrics of the target song;

the obtaining module 701 is further configured to obtain lyrics of the target song from the first lyric file, and corresponding reading of a character to be labeled in the target song from among a plurality of characters of the lyrics;

a display module 702 for displaying a plurality of characters of the lyric;

the marking module 703 is configured to mark, when the character to be marked is displayed, the corresponding pronunciation of the character to be marked in the target song at the target position of the character to be marked, where the target position is above the character to be marked.

Optionally, the display module 702 includes:

the first display unit is used for highlighting the currently played target word in the plurality of characters according to the display time of the target word in the plurality of characters;

a second display unit, configured to highlight a currently played character of the characters other than the target word according to a display time of each character of the plurality of characters other than the target word;

Optionally, the second display unit is configured to highlight, when the character to be annotated is displayed, the currently played character to be annotated and the pronunciation of the currently played character to be annotated in the target song according to the display time of the character to be annotated.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that: in the above embodiment, when the apparatus for generating lyrics generates lyrics, or when the apparatus for displaying lyrics displays lyrics, only the division of the above functional modules is used as an example, in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the embodiments of the apparatus and the method for generating lyrics, and the apparatus and the method for displaying lyrics provided by the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the embodiments of the methods and will not be described herein again.

Fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present invention. The terminal 800 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 800 includes: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the method of generating lyrics, the method of displaying lyrics provided by the method embodiments herein.

In some embodiments, the terminal 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a touch screen display 805, a camera 806, an audio circuit 807, a positioning component 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, providing the front panel of the terminal 800; in other embodiments, the display 805 may be at least two, respectively disposed on different surfaces of the terminal 800 or in a folded design; in still other embodiments, the display 805 may be a flexible display disposed on a curved surface or a folded surface of the terminal 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is used to locate the current geographic position of the terminal 800 for navigation or LBS (Location Based Service). The Positioning component 808 may be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 809 is used to provide power to various components in terminal 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the touch screen 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user with respect to the terminal 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side bezel of terminal 800 and/or underneath touch display 805. When the pressure sensor 813 is disposed on the side frame of the terminal 800, the holding signal of the user to the terminal 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the touch display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal 800. When a physical button or a vendor Logo is provided on the terminal 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, the processor 801 may control the display brightness of the touch screen 805 based on the ambient light intensity collected by the optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 805 is increased; when the ambient light intensity is low, the display brightness of the touch display 805 is turned down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also known as a distance sensor, is typically provided on the front panel of the terminal 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 gradually decreases, the processor 801 controls the touch display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal 800 becomes gradually larger, the processor 801 controls the touch display 805 to switch from the screen-on state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 800 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the method for generating lyrics provided by the above-mentioned method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of generating lyrics or the method of displaying lyrics in the above embodiments is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of generating lyrics, the method comprising:

acquiring a second lyric file of a target song, wherein the second lyric file comprises lyrics of the target song and display time of a plurality of characters of the lyrics;

establishing a first array and a third array, writing the characters into the first array, writing the display time of the characters into the third array, and determining the characters in the first array as the lyrics of the target song;

acquiring an original audio file of the target song, extracting an audio clip of a human sound frequency band in the original audio file, and generating the audio file;

establishing a second array, and writing the corresponding pronunciation of the character to be marked in the target song into the second array;

when a target word exists in the lyrics, determining the display time of the target word according to the display time of at least two adjacent characters to be labeled included in the target word, wherein the target word includes at least two adjacent characters to be labeled, and the number of the included characters to be labeled is not equal to the number of pronunciations of the target word;

writing the display time of the target word into the third array;

adding a first array, the second array and the third array to a first lyric file;

the mode of determining the display time of the target word is as follows: merging the display time of at least two adjacent characters to be marked included in the target word, determining the merged display time as the display time of the target word, and storing the pronunciation corresponding to each character to be marked in a second storage position of a third array according to a first storage position of each character to be marked in a first array, wherein the first storage position and the third storage position are associated bytes in the first array and the second array.

2. The method of claim 1, wherein the determining the candidate reading of the character to be labeled according to the character to be labeled in the plurality of characters in the lyric and the audio segment corresponding to the plurality of characters in the audio file comprises:

determining Chinese characters in the characters as the characters to be labeled;

recognizing the pronunciation of the characters according to the audio frequency fragments corresponding to the characters;

and determining candidate pronunciations of the characters to be marked from the pronunciations of the characters to be marked according to the characters to be marked in the characters, or determining intermediate pronunciations of the characters to be marked from the pronunciations of the characters to be marked according to the characters to be marked in the characters to be marked, and identifying at least two candidate pronunciations of the characters to be marked according to the intermediate pronunciations.

3. The method according to claim 1, wherein when there are at least two candidate readings of the character to be annotated, determining the candidate reading that matches the semantic meaning of the character to be annotated as the reading that the character to be annotated corresponds to in the target song comprises:

for each candidate reading, searching a first dictionary engine for an intermediate character corresponding to each candidate reading, and searching a second dictionary engine for a semantic meaning of the intermediate character, wherein the first dictionary engine comprises a plurality of candidate readings and intermediate characters corresponding to the plurality of candidate readings, and the second dictionary engine comprises a plurality of intermediate characters and a plurality of semantic meanings corresponding to the plurality of intermediate characters;

and selecting a candidate pronunciation with the semantic of the middle character matched with the semantic of the character to be marked from the at least two candidate pronunciations according to the semantic of the character to be marked, and determining the selected candidate pronunciation as the pronunciation corresponding to the character to be marked in the target song.

4. The method of claim 1, wherein the generating a first lyric file of the target song according to the corresponding readings of the plurality of characters and the character to be labeled in the target song comprises:

adding a first array and the second array to the first lyric file, the first array for storing a plurality of characters of the lyric.

5. A method for displaying lyrics is applied to a terminal, and the method comprises the following steps:

when a lyric display instruction is received, acquiring a first lyric file of a target song, wherein the lyric display instruction is used for displaying the lyrics of the target song, and the first lyric file is generated based on the method for generating the lyrics in claim 1;

displaying a plurality of characters of the lyrics;

6. The method of claim 5, wherein the first lyric file further comprises a display time of the plurality of characters, and wherein correspondingly, the displaying the plurality of characters of the lyric comprises:

highlighting the currently played target word in the plurality of characters according to the display time of the target word in the plurality of characters;

highlighting the currently played character in the characters except the target word according to the display time of each character except the target word in the characters;

7. The method according to claim 6, wherein the marking, at the target position of the character to be marked, the corresponding pronunciation of the character to be marked in the target song when the character to be marked is displayed comprises:

and when the character to be marked is displayed, highlighting the currently played character to be marked and the corresponding pronunciation of the currently played character to be marked in the target song according to the display time of the character to be marked.

8. An apparatus for generating lyrics, the apparatus comprising:

the acquisition module is used for acquiring a second lyric file of a target song, wherein the second lyric file comprises lyrics of the target song and display time of a plurality of characters of the lyrics; establishing a first array and a third array, writing the characters into the first array, writing the display time of the characters into the third array, and determining the characters in the first array as the lyrics of the target song; acquiring an original audio file of the target song, extracting an audio clip of a human sound frequency band in the original audio file, and generating the audio file;

the generating module is used for establishing a second array and writing the corresponding pronunciation of the character to be marked in the target song into the second array; when a target word exists in the lyrics, determining the display time of the target word according to the display time of at least two adjacent characters to be labeled included in the target word, wherein the target word includes at least two adjacent characters to be labeled, and the number of the included characters to be labeled is not equal to the number of pronunciations of the target word; writing the display time of the target word into the third array; adding a first array, the second array and the third array to a first lyric file; the mode of determining the display time of the target word is as follows: merging the display time of at least two adjacent characters to be marked included in the target word, determining the merged display time as the display time of the target word, and storing the pronunciation corresponding to each character to be marked in a second storage position of a third array according to a first storage position of each character to be marked in a first array, wherein the first storage position and the third storage position are associated bytes in the first array and the second array.

9. The apparatus of claim 8, wherein the determining module comprises:

the determining unit is used for determining Chinese characters in the characters as the characters to be labeled;

the recognition unit is used for recognizing the pronunciation of the characters according to the audio clips corresponding to the characters;

the determining unit is further configured to determine, according to a character to be labeled in the plurality of characters, a candidate pronunciation of the character to be labeled from the pronunciations of the plurality of characters, or determine, according to the character to be labeled in the plurality of characters, an intermediate pronunciation of the character to be labeled from the pronunciations of the plurality of characters, and identify, according to the intermediate pronunciation, at least two candidate pronunciations of the character to be labeled.

10. The apparatus of claim 9, wherein the determining module comprises:

the searching unit is used for searching an intermediate character corresponding to each candidate reading from a first dictionary engine and searching the semanteme of the intermediate character from a second dictionary engine for each candidate reading, wherein the first dictionary engine comprises a plurality of candidate readings and intermediate characters corresponding to the candidate readings, and the second dictionary engine comprises a plurality of intermediate characters and a plurality of semantemes corresponding to the intermediate characters;

and the selecting unit is used for selecting a candidate pronunciation with the semantic of the middle character matched with the semantic of the character to be annotated from the at least two candidate pronunciations according to the semantic of the character to be annotated, and determining the selected candidate pronunciation as the pronunciation corresponding to the character to be annotated in the target song.

11. An apparatus for displaying lyrics, the apparatus being applied to a terminal, the apparatus comprising:

an obtaining module, configured to obtain a first lyric file of a target song when a lyric display instruction is received, where the lyric display instruction is used to display lyrics of the target song, and the first lyric file is generated based on the method for generating lyrics in claim 1;

12. An electronic device, comprising a processor and a memory, wherein at least one instruction is stored in the memory, and wherein the instruction is loaded and executed by the processor to implement the operations performed by the method for generating lyrics of any one of claims 1 to 4 or the method for displaying lyrics of any one of claims 5 to 7.

13. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor, to implement the operations performed by the method of generating lyrics of any one of claims 1 to 4 or the method of displaying lyrics of any one of claims 5 to 7.