CN106528714A

CN106528714A - Method and device for obtaining character prompt file

Info

Publication number: CN106528714A
Application number: CN201610951816.1A
Authority: CN
Inventors: 刘勇; 庄正中; 刘翠; 陈传艺; 李祖辉
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2017-03-22
Anticipated expiration: 2036-10-26
Also published as: CN106528714B

Abstract

The invention discloses a method and a device for obtaining a character prompt file and belongs to the technical field of networks. The method comprises the following steps of obtaining a first character prompt file and at least one second character prompt file; if the similarity between each row in multiple rows of prompt messages in the first character prompt file and at least one row of any one second character prompt file is greater than a first value, determining at least one row of the second character prompt file as a first target row and determining the second character prompt file where the first target row is located as a first target character prompt file; if the ratio of the number of the first target character prompt file to the number of at least one second character prompt file is greater than a second value, determining the first target row having the highest similarity with the row as a to-be-synthesized row; and synthesizing the character prompt file according to the to-be-synthesized row corresponding to the multiple rows of prompt messages in the first character prompt file. The invention provides the method for obtaining the character prompt file more accurately.

Description

Method and device for acquiring text prompt file

Technical Field

The present invention relates to the field of network technologies, and in particular, to a method and an apparatus for acquiring a text prompt file.

Background

With the development of network technology, multimedia resources provided by networks are becoming richer, such as a large number of audio files and video files. Of course, in order to fully represent the voice content of the multimedia file, the multimedia file streamed in the network is usually accompanied by a text prompt file corresponding to the voice content, for example, the audio file is often accompanied by a lyric file. Typically, these text prompt files are uploaded by the user. However, since these text prompt files are created by the user himself, the prompt information included in the file often has errors, for example, the lyric file includes the user's blessing words, advertisement words, or advertisement links.

Since the prompt information included in the text prompt file uploaded by the user usually has errors, which may result in poor accuracy of the text prompt file, a method for accurately acquiring the text prompt file is urgently needed.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for obtaining a text prompt file. The technical scheme is as follows:

on one hand, a method and a device method for acquiring a text prompt file are provided, and the method comprises the following steps:

acquiring a first text prompt file and at least one second text prompt file, wherein the first text prompt file and the second text prompt file both correspond to the same multimedia file;

for each of a plurality of lines of hinting information in the first text hinting file,

if the similarity between the line and at least one line of any second text prompt file is larger than a first numerical value, determining at least one line of the second text prompt file as a first target line, and determining the second text prompt file in which the first target line is positioned as a first target text prompt file;

if the ratio of the number of the first target text prompt files to the number of the at least one second text prompt file is greater than a second value, determining a first target row with the highest similarity to the row as a row to be synthesized;

and synthesizing the character prompt file according to the lines to be synthesized corresponding to the lines of prompt information in the first character prompt file.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in one possible implementation, the method further includes:

for each of a plurality of lines of hinting information in the first text hint file and for each of the at least one second text hint file,

comparing the number of characters included in the line of the first text prompt file with the number of characters included in at least one line of the second text prompt file;

when the number of characters included in the line of the first text prompt file is not less than the number of characters included in at least one line of the second text prompt file, determining the number of characters included in the line as a target number of characters; or,

when the number of characters included in the line of the first text prompt file is smaller than the number of characters included in at least one line of the second text prompt file, determining the number of characters included in at least one line of the second text prompt file as a target number of characters;

determining the same number of characters in the line of the first text prompt file and at least one line of the second text prompt file;

and acquiring the ratio of the same number of characters to the target number of characters as the similarity between the line and at least one line of the second text prompt file.

In one possible implementation manner, the obtaining the first text prompt file and the at least one second text prompt file includes:

acquiring a plurality of versions of text prompt files, wherein the plurality of versions of text prompt files all correspond to the same multimedia file;

determining the number of file characters included in each text prompt file and the median of a plurality of file characters in the text prompt files of the plurality of versions;

and in the file prompt files of the multiple versions, acquiring the character prompt file with the file character number closest to the median as the first character prompt information, and acquiring the character prompt files except the first character prompt file as the second character prompt file.

In one possible implementation manner, the obtaining multiple versions of the text prompt file includes:

acquiring character prompt files of a plurality of versions to be detected, wherein the character prompt files of the plurality of versions to be detected all correspond to the same multimedia file;

if the English characters exist in the characters included in the to-be-detected multi-version text prompt files, converting the English characters into English characters with preset word shapes; or,

if the complex Chinese characters exist in the characters included in the text prompt files of the multiple versions to be detected, converting the complex Chinese characters into simplified Chinese characters;

and acquiring the character prompt files of the multiple versions to be detected after the character conversion as the character prompt files of the multiple versions.

In one possible implementation, the method further includes:

if the similarity between the line and at least one line of any second text prompt document is not greater than the first numerical value, or if the ratio of the number of the first target text prompt documents to the number of the at least one second text prompt document is not greater than the second numerical value, merging the line with the next line of the line, and comparing the similarity with the at least one line of the second text prompt document;

if the similarity between the merged row and at least one row of the second text prompt file is greater than the first numerical value, determining the at least one row of the second text prompt file as a second target row, and determining the second text prompt file in which the second target row is located as a second target text prompt file;

and if the ratio of the number of the second target text prompt files to the number of the at least one second text prompt file is greater than the second numerical value, determining a second target row with the highest similarity to the row as the row to be synthesized.

In one possible implementation manner, for each line of the plurality of lines of prompt information in the first text prompt file, at least one line of any one of the second text prompt files refers to: the first line, the first line and the previous line of the first line or the next line of the first line which are not subjected to similarity calculation in the second text prompt file; or, if it is determined that the similarity between the line of the third numerical value of the first text prompt file and the corresponding at least one line of the second text prompt file is not greater than the first numerical value, the at least one line of the second text prompt file corresponding to the next line of the line refers to: and the second text prompt file comprises a second line which is not subjected to similarity calculation, an upper line of the second line and the second line, or a lower line of the second line and the second line.

On the other hand, a method and a device for acquiring a text prompt file are provided, and the device comprises:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a first text prompt file and at least one second text prompt file, and the first text prompt file and the second text prompt file both correspond to the same multimedia file;

a determining module for determining, for each of a plurality of lines of hinting information in the first text hinting file,

and the synthesis module is used for synthesizing the character prompt file according to the lines to be synthesized corresponding to the lines of prompt information in the first character prompt file.

In one possible implementation, the apparatus further includes:

a similarity calculation module for calculating a similarity between each of a plurality of lines of prompt information in the first text prompt document and each of the at least one second text prompt document,

In one possible implementation, the obtaining module is configured to:

In one possible implementation, the determining module is further configured to:

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for obtaining a text prompt file according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for obtaining a text prompt file according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a prompt message according to an embodiment of the present invention;

fig. 4A is a schematic structural diagram of an apparatus for obtaining a text prompt file according to an embodiment of the present invention;

FIG. 4B is a schematic structural diagram of an apparatus for obtaining a text prompt file according to an embodiment of the present invention

Fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for obtaining a text prompt file according to an embodiment of the present invention. Referring to fig. 1, the method includes:

101. the method comprises the steps of obtaining a first text prompt file and at least one second text prompt file, wherein the first text prompt file and the second text prompt file correspond to the same multimedia file.

102A, for each line in the plurality of lines of prompt information in the first text prompt file, if the similarity between the line and at least one line of any second text prompt file is greater than a first numerical value, determining at least one line of the second text prompt file as a first target line, and determining the second text prompt file where the first target line is located as the first target text prompt file.

102B, if the ratio of the number of the first target text prompt files to the number of the at least one second text prompt file is larger than a second value, determining the first target row with the highest similarity to the row as the row to be synthesized.

103. And synthesizing the character prompt file according to the lines to be synthesized corresponding to the lines of prompt information in the first character prompt file.

In the embodiment of the invention, for each line of a plurality of lines of prompt messages in a first text prompt file, when the similarity between at least one line of a second text prompt file and the line is greater than a first numerical value, and the proportion of the number of second text prompt files where at least one line meeting the similarity is located to the number of at least one second text prompt file is greater than a second numerical value, at least one line with the highest similarity to the line is determined as a line to be synthesized, and the line to be synthesized is synthesized into a new text prompt file, so that each line in the new text prompt file is derived from one line of prompt messages commonly determined by a plurality of text prompt files, and the method for acquiring the text prompt files more accurately is provided.

In one possible implementation of the method according to the invention,

for each of a plurality of lines of hinting information in a first text hint file and for each of at least one second text hint file,

and obtaining the similarity between the line and at least one line of the second text prompt file according to the ratio of the same number of characters to the target number of characters.

In one possible implementation, the obtaining the first text prompt file and the at least one second text prompt file includes:

determining the number of file characters included in each text prompt file and the median of the number of the plurality of file characters in the text prompt files of the plurality of versions;

in the file prompt files of multiple versions, the character prompt file with the file character number closest to the median is obtained as first character prompt information, and the character prompt files except the first character prompt file are obtained as second character prompt files.

In one possible implementation, obtaining multiple versions of the text prompt file includes:

acquiring character prompt files of multiple versions to be detected, wherein the character prompt files of the multiple versions to be detected all correspond to the same multimedia file;

if the English characters exist in the characters included in the text prompt files of the multiple versions to be detected, converting the English characters into English characters of preset word shapes; or,

if the traditional Chinese characters exist in the characters included in the text prompt files of the multiple versions to be detected, the traditional Chinese characters are converted into simplified Chinese characters;

and acquiring the character prompt files of multiple versions to be detected after the character conversion into the character prompt files of multiple versions.

In one possible implementation, the method further comprises:

for each of a plurality of lines of prompt information in the first text prompt file,

if the similarity between the row and at least one row of any second text prompt file is not greater than a first numerical value, or if the ratio of the number of the first target text prompt files to the number of at least one second text prompt file is not greater than a second numerical value, merging the row with the next row of the row, and comparing the similarity with at least one row of the second text prompt files;

if the similarity between the combined line and at least one line of the second text prompt file is larger than a first numerical value, determining the at least one line of the second text prompt file as a second target line, and determining the second text prompt file in which the second target line is positioned as a second target text prompt file;

and if the ratio of the number of the second target text prompt files to the number of the at least one second text prompt file is greater than a second value, determining a second target row with the highest similarity to the row as the row to be synthesized.

In one possible implementation, for each line of the plurality of lines of prompt information in the first text prompt file, at least one line of any one of the second text prompt files refers to: the first line, the first line and the previous line of the first line or the next line of the first line which are not subjected to similarity calculation in the second text prompt file; or, if it is determined that the similarity between the row of the third numerical value of the first text prompt file and the at least one row of the corresponding second text prompt file is not greater than the first numerical value, the at least one row of the second text prompt file corresponding to the next row of the row means: and the second text prompt file comprises a second line which is not subjected to similarity calculation, an upper line of the second line and the second line or a lower line of the second line and the second line.

In an actual scene, the embodiment of the invention can be applied to any equipment capable of acquiring the text prompt files of multiple versions. The device may be a server or a terminal, for example. Fig. 2 is a flowchart of a method for obtaining a text prompt file according to an embodiment of the present invention. Referring to fig. 2, this embodiment specifically includes:

201. and acquiring a plurality of versions of text prompt files, wherein the plurality of versions of text prompt files all correspond to the same multimedia file.

The multimedia file is not limited to an audio file or a video file. The text prompt file is used to indicate the voice content of a multimedia file, such as a lyric file of an audio file, a subtitle file of a video file. Because the text prompt files streamed in the network are usually uploaded by different users, the content and content format of the text prompt file made by each user may be different for the same multimedia file, and thus, the text prompt files of multiple versions of one multimedia file can be obtained. The different content formats refer to: one section of the same prompt information corresponds to one line or a plurality of lines of the text prompt file.

When the text prompt files of multiple versions are obtained, the text prompt files matched with the multimedia file identification and the text prompt file format are searched through the known multimedia file identification and the known text prompt file format, and the searched files are obtained as the text prompt files of the multiple versions. The multimedia file identifier is not limited to the name and author of the multimedia file, and the format of the text prompt file is not limited to the extension of the text prompt file, for example, the extension of the lyric file is typically lrc (lyrics).

In this step, the source of obtaining the text prompt files of multiple versions is not limited in the embodiment of the present invention. For example, for a server providing multimedia services (e.g., a server providing music-like applications and video-like applications), a large number of file prompt files are often already configured in the database. For another example, in an actual scenario, there are also a lot of resources of text prompt files in the network. Therefore, based on the above two sources, the following two methods can be used for obtaining:

in the first mode, a plurality of versions of text prompt files are acquired through a configured database of a server.

In this way, the searching and acquiring process can be performed in the configured database of the server through the multimedia file identifier and the text prompt file format.

And the second mode is to obtain multiple versions of text prompt files from the network.

In the mode, the searching and acquiring process can be automatically carried out in a network by a web crawler or a search engine tool through the multimedia file identification and the text prompt file format.

It should be noted that this step 201 is an optional step in the embodiment of the present invention. In practice, considering that the shapes of words included in the acquired multiple versions of text prompt files may be different, in order to avoid the influence of the shapes of words when the line matching degree is calculated in step 204 and improve the calculation accuracy, the multiple versions of text prompt files may be acquired through the following steps 201A to 201C:

201A, acquiring character prompt files of multiple versions to be detected, wherein the character prompt files of the multiple versions to be detected all correspond to the same multimedia file.

The process of acquiring multiple versions of text prompt files to be detected in step 201A is similar to the process of acquiring multiple versions of text prompt files in step 201.

201B, if the English characters exist in the characters included in the text prompt files of the multiple versions to be detected, converting the English characters into English characters of preset word shapes; or if the traditional Chinese characters exist in the characters included in the text prompt files of the multiple versions to be detected, the traditional Chinese characters are converted into simplified Chinese characters.

In this step 201B, if an english character is detected, the method of converting the english character is adopted, and if a chinese character is detected, the method of converting the chinese character is adopted.

For the above situation with english characters, because the shapes of the english characters are affected by grammar (abbreviation, complex number of words) and tenses, the possible expressions of the english characters of two different shapes have the same meaning, so that the english characters can be converted into the english characters of the same preset shape, which facilitates the subsequent step of calculating the line similarity. It should be noted that the english characters of the preset word form that need to be configured in advance locally need the english characters with the same meaning, and the embodiment of the present invention does not specifically limit the english characters of the preset word form, and the following table 1 is taken as an example for explanation:

TABLE 1

English character with same meaning	Preset word shape English character
		’ve,had,has	have
’s,is,are,was,were,am	be
		did,does,	do
n’t	not

In table 1, taking the example of detecting "I've been live a lie, there's not leading inside", the english character after conversion is "I, ve, be, live, a, lie, there, be, not leading, inside".

For the situation of the traditional Chinese characters, because the traditional Chinese characters and the simplified Chinese characters are actually different expression forms of Chinese characters, the traditional Chinese characters and the simplified Chinese characters can be converted into the simplified Chinese characters through configured traditional-simplified conversion relations. Taking table 2 as an example:

TABLE 2

Traditional Chinese character	Simplified Chinese characters
		A	A
Golden character	Sample (A)
			Tear
Inside lining	Lining (Chinese character of 'li')

According to the above Table 2, taking the detection of "pain of golden character of A" as an example, the converted simplified Chinese character is "the same pain as tear".

And 201C, acquiring the character prompt files of the multiple versions to be detected after the character conversion into the character prompt files of the multiple versions.

Based on the above steps 201A and 201B, the preprocessing process of the text prompt files of multiple versions to be detected is completed, and errors caused by different word shapes (e.g., grammar and tense) and expression forms of characters in the following similarity calculation process are avoided, so that the accuracy of the obtained similarity is improved.

It should be noted that, the time for triggering the process of acquiring the text prompt file is not specifically limited in the embodiment of the present invention. For example, the acquisition process may be performed automatically when a multimedia file playing in an application is detected. As another example, the acquisition process is performed periodically based on locally downloaded multimedia files.

202. Determining the number of file characters included in each text prompt file in a plurality of versions of text prompt files and the median of the number of the plurality of file characters.

In this step, the number of file characters of each version of the text prompt file in the plurality of versions of the text prompt file is calculated, and a median is determined from the number of file characters of the plurality of versions of the text prompt file. The median means: after the file character numbers of the text prompt files of a plurality of versions are arranged according to the size, if the file character numbers are odd, the file character number with the central arrangement position is used as a median; if there is an even number of document characters, the average of the two document characters centered at the arrangement position is taken as the median. Taking the file character numbers of the text prompt files of the multiple versions as 100, 130, 120, 100 and 100 as examples, if there are 5 file character numbers, the determined median is 120.

203. In the file prompt files of multiple versions, the character prompt file with the file character number closest to the median is obtained as a first character prompt file, and the character prompt files except the first character prompt file are obtained as a second character prompt file.

The word prompt file with the most similar number of characters to the median is as follows: and the text prompt file with the smallest absolute value of the difference between the number of the file characters and the median. Taking the example of the median in step 202, the text presentation file with the file character number closest to the median is the text presentation file with the file character number of 120 (the absolute value of the difference between 120 and 120 is 0).

The inventors have recognized that the number of characters in multiple versions of a word prompt file is often different, most likely because some of the characters of a version of the word prompt file are missing or include extra characters, indicating that the version of the word prompt file with the middle number of characters in the multiple versions of the word prompt file may be closer to the correct version of the word prompt file. In addition, in the process of calculating the similarity, compared with the text prompt file with more file characters, the text prompt file with the file character number most similar to the median is used as the first text prompt file, so that the calculation resource is saved, and the calculation efficiency is higher.

In step 203, the text prompt file with the smallest absolute difference value between the file character number and the median is obtained as a first prompt file by comparing the file character number and the median of each version of the text prompt file, and the text prompt files except the first text prompt file in the text prompt files of multiple versions are all obtained as second text prompt files. The first text prompt file is used for being matched with each second text prompt file, namely, each second text prompt file is compared with the first text prompt file by taking the first text prompt file as a reference.

It should be noted that the above-mentioned steps 201-203 are optional steps in the embodiment of the present invention, and are only used as a possible implementation manner for obtaining the first text prompt file and the at least one second text prompt file. In fact, it is also possible to directly obtain multiple versions of the text prompt file (the obtaining process in step 201), use any one version of the text prompt file as the first text prompt file, use the remaining text prompt files as the second text prompt file, and combine steps 204 and 205 described below to obtain the text prompt file accurately.

204A, for each line in the plurality of lines of prompt messages in the first text prompt file, if the similarity between the line and at least one line of any second text prompt file is greater than a first numerical value, determining at least one line of the second text prompt file as a first target line, and determining the second text prompt file in which the first target line is positioned as a first target text prompt file.

The inventor has recognized that, for the text prompt document, one line of one text prompt document may correspond to one line of another text prompt document, and may also correspond to multiple lines in another text prompt asking price, for example, one line of the lyric document a is "willing to get one heart only", two lines of the lyric document B are not "willing to get one heart" and "the first is not to get away", and therefore, in order to improve the accuracy of calculating the similarity, the embodiment of the invention calculates between each line of the first text prompt document and at least one line of any one second text prompt document.

In step 204A, the line is the first line of the first text prompt file for which similarity calculation has not been performed. At least one line of any one second text prompt file refers to at least one line in which similarity calculation is not performed in the second text prompt file in the process of acquiring the text prompt file at this time. The first line, and the previous line of the first line or the next line of the first line and the first line, which are not subjected to the similarity calculation in the second text presentation file, are taken as at least one example of the behavior of the second text presentation file.

As shown in fig. 3, a is the first text prompt, B, C and D are both the second text prompt, where the lines refer to the second line of a, at least one line of B refers to the third line, third line and second line, third line and fourth line of B, at least one line of C refers to the second line, second line and first line, second line and third line of C, and at least one line of D refers to the second line, second line and first line, second line and third line of D. Therefore, the second row of a is calculated to be similar to the third row, the third row and the second row, the third row and the fourth row of B, respectively, the second row of a is calculated to be similar to the second row, the second row and the first row, the second row and the third row of C, respectively, and the second row of a is calculated to be similar to the second row, the second row and the first row, the second row and the third row of D, respectively. And if the similarity between the second line of the A and only the third line and the fourth line of the B and the second line of the C is larger than the first numerical value, which indicates that prompt information similar to the prompt information of the second line of the A exists in the B and the C, determining the third line of the B, the fourth line and the second line of the C as the first target line, and determining the B and the C as the first target character prompt file.

The first numerical value is used for judging whether prompt information which is consistent with the line of prompt information exists in the second character prompt file or not. The first value is not particularly limited in the embodiment of the present invention, and for example, the first value is 0.9.

The way of calculating the similarity in step 204A is not limited in the embodiment of the present invention. For example, the following calculation methods (1) to (4) are taken as examples for each line of the plurality of lines of presentation information in the first text presentation file and for each second text presentation file in the at least one second text presentation file:

(1) and comparing the number of characters included in the line of the first text prompt file with the number of characters included in at least one line of the second text prompt file.

In the step (1), each character included in the line of the first text prompt file and each character included in at least one line of the second text prompt file are extracted one by one, the number of the characters is calculated respectively, and the characters are compared. Of course, the number of characters may be disregarded for the number of characters of the punctuation mark in order to avoid the influence of the punctuation mark.

It should be noted that the same characters in a certain row are extracted and calculated respectively, so as to improve the accuracy of calculating the similarity.

For example, in fig. 3, it is assumed that the second behavior "wish only to leave one's head without leaving one's head", the third behavior "wish only to leave one's head", the fourth behavior "leave one's head without leaving one's head", the second behavior "wish only to leave one's head without leaving one's head", and the second behavior "wish only to leave one's head without leaving one's head". Assuming that the second row of B, the first row and the third row of C, and the first row and the third row of D are all "la", then:

the number of characters of the second line of a is 11,

b has the characters of 11 in the third line and the fourth line, 6 in the third line, 7 in the second line and the third line,

the number of characters of the second line of C is 11, the number of characters of the second line and the first line is 12, the number of characters of the second line and the third line is 12,

the number of characters in the second row of D is 13, the number of characters in the second row and the first row is 14, and the number of characters in the second row and the third row is 14.

(2) When the number of the characters included in the line of the first text prompt file is not less than the number of the characters included in at least one line of the second text prompt file, determining the number of the characters included in the line as a target number of characters; or when the number of characters included in the line of the first text prompt file is smaller than the number of characters included in at least one line of the second text prompt file, determining the number of characters included in at least one line of the second text prompt file as the target number of characters.

According to the example in (1), the target number of characters in the 3 comparison cases is 11 for a and B;

for a and C, the target number of characters in the 3 comparison cases may be 11, 12, and 12, respectively;

for a and D, the target number of characters in the 3 comparison cases may be 13, 14, and 14, respectively.

(3) And determining the same number of characters in the line of the first text prompt file and at least one line of the second text prompt file.

In (3), the characters existing in the line of the first text prompt file and in at least one line of the second text prompt file are determined, and the determination manner is not specifically limited in the embodiment of the present invention, for example, the determination manner is determined in a manner of taking a difference set, and by way of example in (2), if the characters of the second line of a extracted one by one are all added to the list a, the characters of the third line of B extracted one by one are all added to the list B1, the characters of the third line and the second line are all added to the list B2, the characters of the third line and the fourth line are all added to the list B3, the characters of the second line of C extracted one by one are all added to the list C1, the characters of the second line and the first line are all added to the list C2, the characters of the second line and the third line are all added to the list 387c 2, and the characters of the second line of D extracted one by one are all added to the list D1, the characters of the second line and the, The characters of the second and third rows are added to the list d 3. The list including a small number of characters in a and d1(b1, b2, b3, c1, c2, c3, d2, d3 are the same) is differentiated from another list, and the number of remaining characters in the list having the small number of characters is determined to be the same number of characters. Here, taking the difference between a and d1, that is, { one, heart, white, first, no, phase, and leave } - { one, desired, obtained, one, human, heart, go, white, head, no, leave, and leave } - { first, phase }, the number 9 of the remaining characters in a is taken as the same number of characters, and the same number of characters in the above example is shown in table 3:

TABLE 3

(4) And obtaining the similarity between the action line and at least one line of the second text prompt file according to the ratio of the same number of characters to the target number of characters.

From the above table 3, the similarity is obtained as in table 4:

TABLE 4

According to table 4, when the first value is 0.9, since only 1 and 10/11 are greater than 0.9, the third row of B and the fourth row of C may be determined as the first target row, respectively.

204B, if the ratio of the number of the first target text prompt files to the number of the at least one second text prompt file is larger than a second value, determining the first target row with the highest similarity to the row as the row to be synthesized.

In step 204B, in order to improve the accuracy of obtaining the line to be synthesized, it is determined whether the ratio of the first target prompt file in the second text prompt file reaches the second value, that is, the line to be synthesized is determined in a manner of "voting" by the second text prompt file. The second numerical value is used for judging whether at least one line of prompt information included in a certain number of second text prompt files conforms to the line of prompt information. The second value is not limited in the embodiment of the present invention, and generally, in order to extract the first target row more accurately, the second value should be not less than 0.5. For example, the second value may be 0.6 to indicate that the prompt message (at least one) of the plurality of second text prompt files is the same as the prompt message (the row) of the first text prompt file.

Thus, according to the example of step 204A, if the at least one second text prompt file is B, C, D, the number is 3, the first target text prompt files are B and C, and the number is 2, then the ratio of the number of the first target text prompt files to the number of the at least one second text prompt files is 2/3, which is greater than 0.6. Since the third row and the fourth row (similarity is 1) of the first target row B having the highest similarity with the row, the third row and the fourth row of B may be determined as rows to be synthesized.

204C, if the similarity between the line and at least one line of any second text prompt file is not more than a first numerical value, or if the ratio of the number of the first target text prompt files to the number of at least one second text prompt file is not more than a second numerical value, merging the line and the next line of the line, and comparing the similarity with at least one line of the second text prompt files.

In the above steps 204A and 204B, since the line may correspond to at least one line of any one of the second text prompt files, the similarity between the two lines is calculated. In fact, the row and the next row to the row may also correspond to a row of any of the second text prompt files, causing the situation in this step 204C to occur. Therefore, in order to improve the accuracy of calculating the similarity, the line may be merged with the line next to the line, and the merged line may be compared with at least one line of the second text prompt file for similarity. The process of comparing the similarity is the same as the above step 204A.

It should be noted that, because the similarity between the previous line of the line and any second text prompt file is calculated, and the line to be synthesized is successfully determined, the next line of the line may be considered here.

204D, if the similarity between the merged line and at least one line of the second text prompt file is larger than the first numerical value, determining at least one line of the second text prompt file as a second target line, and determining the second text prompt file where the second target line is located as a second target text prompt file.

The same as step 204A described above.

204E, if the ratio of the number of the second target text prompt files to the number of the at least one second text prompt file is larger than a second value, determining a second target row with the highest similarity to the row as a row to be synthesized.

The same procedure as in step 204B is performed.

It should be noted that, when the row completes step 204, the next row of the row proceeds to step 204 (if the row and the next row of the row proceed to steps 204C to 204E, the second row below the row proceeds to step 204), and at least one behavior of the corresponding second text prompt file: and at least one line which is not subjected to similarity calculation in each second text prompt file. Taking the example in step 204 as an illustration, the third row of a continues with step 204, and corresponds to the fifth row of B, the third row of C, and the third row of D.

Of course, there may be a case where at least one line of a second text prompt file cannot always satisfy the first numerical value of the similarity with the line, for example, the first line of the lyric file 1 is lyrics, and the first three lines of the lyric file 2 are advertisements, so that the lines compared between the lyric file 1 and the lyric file 2 are likely to be different. Therefore, in order to avoid the similar situation and improve the comparison efficiency, if it is determined that the similarity between the line of the third numerical value of the first text prompt file and the at least one line of the corresponding second text prompt file is not greater than the first numerical value, the at least one line of the second text prompt file corresponding to the next line of the line refers to: and the second text prompt file comprises a second line which is not subjected to similarity calculation, an upper line of the second line and the second line or a lower line of the second line and the second line. In fact, once the next or next to the first row of the second text prompt file is matched to at least one row of the second text prompt file by more than the first value, the next comparison of the similarity may be performed again with the first row of the second text prompt file, the first row, and the previous row of the first row or the next row of the first row, which is not subjected to the similarity calculation, and the current row of the first text prompt file.

The third numerical value is not specifically limited in the embodiment of the present invention. For example, the third value is 3. For example, in the above step 204, if the second row of a and the second row, the second row and the first row, the second row and the third row of D, the third row of a and the third row, the second row and the third row of D, and the similarity between the fourth row of a and the fourth row, the fourth row and the third row, and the fourth row and the fifth row of D are not greater than the first value, the at least one row of D corresponding to the fifth row of a may be the sixth row, the fifth row and the sixth row, and the sixth row and the seventh row of D.

205. And synthesizing the character prompt file according to the lines to be synthesized corresponding to the lines of prompt information in the first character prompt file.

Based on the above step 204, the lines to be synthesized corresponding to each line of prompt information in the first text prompt file can be obtained, so that in the step 205, each line to be synthesized can be synthesized into the correct text prompt file according to the line arrangement order.

In the related art, since a prompt message included in a text prompt file uploaded by a user is usually wrong, which may result in poor accuracy of the text prompt file, a method for accurately acquiring the text prompt file is urgently needed.

In addition, in the line of the first text prompt file and at least one line of the second text prompt file, the number of the characters of which the number is less than that of the characters included in the line of the first text prompt file and the line of the second text prompt file is determined as the target number of the characters, and the proportion of the same number of the characters included in the line of the first text prompt file and the same number of the characters included in the line of the second text prompt file to the target number of the characters is used as the similarity, so that a specific method for.

In addition, by acquiring a plurality of versions of the text prompt files and determining the number of file characters included in each text prompt file, and a median of a plurality of document characters, determining the text prompt document with the document character number closest to the median as a first prompt document, acquiring the text prompt documents except the first text prompt document as a second text prompt document, since a version of the text prompt file may lack some characters or include redundant characters, therefore, in the text prompt files of multiple versions, the text prompt file of the version with the central number of characters is often closer to the text prompt file of the correct version and has certain referential property, and then the first character prompt file is used as the first character prompt file, the second character prompt file is compared with the second character prompt file in the similarity, and finally the accuracy of the obtained character prompt file is higher.

In addition, by acquiring the multiple versions of the word prompt files to be detected and detecting whether English characters or Chinese characters exist, a method for uniformly converting English characters of different characters into English characters of preset word shapes is provided, and a method for converting traditional Chinese characters into simplified Chinese characters is also provided, so that the similarity calculation is not influenced by character forms, and the calculated similarity is more accurate.

In addition, the line is merged with the next line of the line, and the similarity is compared with at least one line of any one second text prompt file, so that the line to be synthesized is obtained, the problem that the line to be synthesized is not successfully obtained from the corresponding at least one second text prompt file for each line in the first text prompt file is solved, and another way for obtaining the line to be synthesized is also provided.

In addition, each line of the first text prompt file is compared with at least one line of the second text prompt file, the condition that the prompt information of the line corresponds to multiple lines of prompt information in another text prompt file is considered, and the process of comparing the similarity is more comprehensive. And if the similarity of the third numerical line of the first text prompt file and at least one line of the corresponding second text prompt file is not greater than the first numerical value, which indicates that the second text prompt file is more likely to have a plurality of lines of prompt information in front than the first text prompt file, a line skipping comparison mechanism is provided, and when the similarity is compared next time, the second text prompt file is used as at least one line of the second text prompt file, the second line which is not subjected to similarity calculation, the previous line of the second line and the second line, or the next line of the second line and the second line, so that the comparison efficiency is increased, and the comparison success rate is increased.

Fig. 4A is a schematic diagram of an apparatus for acquiring a text prompt file according to an embodiment of the present invention. Referring to fig. 4A, the apparatus specifically includes:

an obtaining module 401, configured to obtain a first text prompt file and at least one second text prompt file, where the first text prompt file and the second text prompt file both correspond to a same multimedia file;

a determining module 402 for determining, for each of a plurality of lines of hinting information in a first text hinting file,

if the ratio of the number of the first target text prompt files to the number of the at least one second text prompt file is larger than a second numerical value, determining a first target row with the highest similarity with the row as a row to be synthesized;

the synthesizing module 403 is configured to synthesize a text prompt file according to lines to be synthesized corresponding to multiple lines of prompt information in the first text prompt file.

In the embodiment of the invention, for each line of a plurality of lines of prompt messages in a first text prompt file, when the similarity between at least one line of a second text prompt file and the line is greater than a first numerical value, and the proportion of the number of second text prompt files where at least one line meeting the similarity is located to the number of at least one second text prompt file is greater than a second numerical value, at least one line with the highest similarity to the line is determined as a line to be synthesized, and the line to be synthesized is synthesized into a new text prompt file, so that each line in the new text prompt file is derived from one line of prompt messages commonly determined by a plurality of text prompt files, and a device for acquiring the text prompt files more accurately is provided.

In one possible implementation, based on the apparatus composition shown in fig. 4A, referring to fig. 4B, the apparatus further includes:

a similarity calculation module 404 for calculating, for each line of the plurality of lines of prompt information in the first text prompt document and for each second text prompt document in the at least one second text prompt document,

In one possible implementation, the obtaining module 401 is configured to:

In one possible implementation, the determining module 402 is further configured to:

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that: in the device for acquiring a text prompt file provided in the above embodiment, when the text prompt file is acquired, only the division of the above functional modules is taken as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the apparatus for acquiring a text prompt file and the method embodiment for acquiring a text prompt file provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

Fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention, where the terminal may be configured to execute the method for obtaining the text prompt file in each embodiment. Referring to fig. 5, the terminal 500 includes:

the terminal 500 may include RF (Radio Frequency) circuitry 110, memory 120 including one or more computer-readable storage media, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a WiFi (Wireless Fidelity) module 170, a processor 180 including one or more processing cores, and a power supply 190. Those skilled in the art will appreciate that the terminal structure shown in fig. 5 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 110 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information from a base station and then sends the received downlink information to the one or more processors 180 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 110 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (Low Noise Amplifier), a duplexer, and the like. In addition, the RF circuitry 110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access), WCDMA (Wideband Code Division Multiple Access), LTE (Long Term Evolution), e-mail, SMS (short messaging Service), etc.

The memory 120 may be used to store software programs and modules, and the processor 180 executes various functional applications and data processing by operating the software programs and modules stored in the memory 120. The memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal 500, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 120 may further include a memory controller to provide the processor 180 and the input unit 130 with access to the memory 120.

The input unit 130 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, the input unit 130 may include a touch-sensitive surface 131 as well as other input devices 132. The touch-sensitive surface 131, also referred to as a touch display screen or a touch pad, may collect touch operations by a user on or near the touch-sensitive surface 131 (e.g., operations by a user on or near the touch-sensitive surface 131 using a finger, a stylus, or any other suitable object or attachment), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface 131 may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 180, and can receive and execute commands sent by the processor 180. Additionally, the touch-sensitive surface 131 may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. In addition to the touch-sensitive surface 131, the input unit 130 may also include other input devices 132. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 140 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal 500, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 140 may include a Display panel 141, and optionally, the Display panel 141 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like. Further, the touch-sensitive surface 131 may cover the display panel 141, and when a touch operation is detected on or near the touch-sensitive surface 131, the touch operation is transmitted to the processor 180 to determine the type of the touch event, and then the processor 180 provides a corresponding visual output on the display panel 141 according to the type of the touch event. Although in FIG. 5, touch-sensitive surface 131 and display panel 141 are shown as two separate components to implement input and output functions, in some embodiments, touch-sensitive surface 131 may be integrated with display panel 141 to implement input and output functions.

The terminal 500 can also include at least one sensor 150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 141 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 141 and/or a backlight when the terminal 500 is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal 500, detailed descriptions thereof are omitted.

Audio circuitry 160, speaker 161, and microphone 162 may provide an audio interface between a user and terminal 500. The audio circuit 160 may transmit the electrical signal converted from the received audio data to the speaker 161, and convert the electrical signal into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 160, and then outputs the audio data to the processor 180 for processing, and then to the RF circuit 110 to be transmitted to, for example, another terminal, or outputs the audio data to the memory 120 for further processing. The audio circuit 160 may also include an earbud jack to provide communication of peripheral headphones with the terminal 500.

WiFi belongs to a short-distance wireless transmission technology, and the terminal 500 can help a user send and receive e-mails, browse web pages, access streaming media, and the like through the WiFi module 170, and it provides wireless broadband internet access for the user. Although fig. 5 shows the WiFi module 170, it is understood that it does not belong to the essential constitution of the terminal 500 and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 180 is a control center of the terminal 500, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal 500 and processes data by operating or executing software programs and/or modules stored in the memory 120 and calling data stored in the memory 120, thereby performing overall monitoring of the handset. Optionally, processor 180 may include one or more processing cores; preferably, the processor 180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 180.

The terminal 500 further includes a power supply 190 (e.g., a battery) for supplying power to the various components, which may preferably be logically connected to the processor 180 via a power management system, such that functions of managing charging, discharging, and power consumption are performed via the power management system. The power supply 190 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown, the terminal 500 may further include a camera, a bluetooth module, etc., which will not be described herein. In this embodiment, the display unit of the terminal is a touch screen display, and the terminal further includes a memory and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors. The one or more programs include instructions for:

acquiring a first text prompt file and at least one second text prompt file, wherein the first text prompt file and the second text prompt file both correspond to the same multimedia file; for each line in a plurality of lines of prompt messages in a first text prompt file, if the similarity between the line and at least one line of any second text prompt file is greater than a first numerical value, determining at least one line of the second text prompt file as a first target line, and determining a second text prompt file in which the first target line is positioned as a first target text prompt file; if the ratio of the number of the first target text prompt files to the number of the at least one second text prompt file is larger than a second numerical value, determining a first target row with the highest similarity with the row as a row to be synthesized; and synthesizing the character prompt file according to the lines to be synthesized corresponding to the lines of prompt information in the first character prompt file.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention. Referring to fig. 6, the apparatus 600 includes a processing component 622 that further includes one or more processors and memory resources, represented by memory 632, for storing instructions, such as applications, that are executable by the processing component 622. The application programs stored in memory 632 may include one or more modules that each correspond to a set of instructions. Further, the processing component 622 is configured to execute instructions to perform the above-described method of obtaining a text prompt file.

The apparatus 600 may also include a power component 626 configured to perform power management of the apparatus 600, a wired or wireless network interface 650 configured to connect the apparatus 600 to a network, and an input/output (I/O) interface 658. The apparatus 600 may operate based on an operating system, such as Windows Server, stored in the memory 632^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMOr the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for obtaining a text prompt file is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein obtaining the first text prompt file and the at least one second text prompt file comprises:

4. The method of claim 3, wherein obtaining multiple versions of the text prompt file comprises:

5. The method of claim 1, further comprising:

6. The method of claim 1,

at least one line of any one second text prompt file is as follows: the first line, the first line and the previous line of the first line or the next line of the first line which are not subjected to similarity calculation in the second text prompt file; or,

if it is determined that the similarity between the row of the third numerical value of the first text prompt file and the corresponding at least one row of the second text prompt file is not greater than the first numerical value, the corresponding at least one row of the second text prompt file next to the row means: and the second text prompt file comprises a second line which is not subjected to similarity calculation, an upper line of the second line and the second line, or a lower line of the second line and the second line.

7. An apparatus for obtaining a text prompt document, the apparatus comprising:

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 7, wherein the obtaining module is configured to:

10. The apparatus of claim 9, wherein the obtaining module is configured to:

11. The apparatus of claim 7, wherein the determining module is further configured to:

12. The apparatus of claim 7,