CN112652295A

CN112652295A - Language model training method, device, equipment and medium, and video subtitle checking method, device and medium

Info

Publication number: CN112652295A
Application number: CN202011529805.7A
Authority: CN
Inventors: 李恬静; 朱威
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Shenzhen Ping An Smart Healthcare Technology Co ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-13

Abstract

The invention relates to the technical field of artificial intelligence, and provides a method, a device, equipment and a medium for language model training and video subtitle verification. The language model training method includes inputting sample sentences only containing Chinese characters in a character sample set into an initial character-splitting pre-training model containing initial parameters, and sequentially performing word segmentation processing, radical splitting, granularity splitting and decoding identification on the sample sentences to obtain sample decoded sentences; determining a text loss value according to the sample decoded sentence and the sample sentence only containing Chinese characters; and when the text loss value does not reach the preset convergence condition, updating the iteration initial parameters until the text loss value reaches the preset convergence condition, and recording the converged initial character-splitting pre-training model as a Chinese pre-training language model based on the character splitting. The invention also relates to a block chain technology, wherein the Chinese pre-training language model based on the word splitting is stored in the block chain, and the accuracy of the word or text pre-processing can be improved.

Description

Language model training method, device, equipment and medium, and video subtitle checking method, device and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for language model training and video subtitle verification.

Background

With the development of science and technology, the field of artificial intelligence is developing more and more rapidly. In the scenes of character recognition, text verification and the like, characters or texts are often preprocessed by using a word-based pre-training language model.

In the prior art, a pre-training language model based on words used in scenes such as character recognition, text verification and the like is large in the whole vocabulary (usually more than twenty thousand), and although the vocabulary contains large-scale words, the pre-training language model is large in size and low in reasoning speed, so that the pre-training language model is not suitable for training a small model. If the model in the prior art is adopted for training, the model parameters obtained by training are too much, and further the model calculation amount is large in the recognition process, so that the recognition speed is low. Secondly, in some specific application scenarios with less strict word usage and more wrong words, the existing word-based pre-training language model has higher sensitivity to words but lower robustness, and thus the accuracy of pre-processing characters or texts is lower.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a medium for language model training and video subtitle verification, which are used for improving the accuracy of language model identification.

A method of language model training comprising:

acquiring a text sample set and an initial character splitting pre-training model containing initial parameters, wherein the text sample set comprises at least one sample sentence, and one sample sentence comprises at least one Chinese character; the initial character splitting pre-training model comprises a character coding model and a character decoding model;

when the sample sentence only contains Chinese characters, inputting the sample sentence into the initial character-splitting pre-training model, and performing word segmentation processing on the sample sentence through the character coding model to obtain each Chinese sample word in the sample sentence;

performing radical splitting on all the Chinese characters in each Chinese sample word through the character coding model to obtain a radical decomposition result of each Chinese character;

carrying out granularity splitting on all the radical decomposition results through the character coding model to obtain splitting results;

decoding and identifying the splitting result through the character decoding model to obtain a sample decoding sentence;

determining a text loss value according to the sample decoded sentence and the sample sentence containing only Chinese characters;

and updating and iterating the initial parameters of the initial character-splitting pre-training model when the text loss value does not reach a preset convergence condition, and recording the converged initial character-splitting pre-training model as a Chinese pre-training language model based on character splitting when the text loss value reaches the preset convergence condition.

A language model training device comprising:

the system comprises a data acquisition module, a word pre-training module and a word analysis module, wherein the data acquisition module is used for acquiring a word sample set and an initial word-splitting pre-training model containing initial parameters, the word sample set comprises at least one sample sentence, and one sample sentence comprises at least one Chinese character; the initial character splitting pre-training model comprises a character coding model and a character decoding model;

the word segmentation processing module is used for inputting the sample sentence into the initial character splitting pre-training model when the sample sentence only contains Chinese characters, and performing word segmentation processing on the sample sentence through the character coding model to obtain each Chinese sample word in the sample sentence;

the radical splitting module is used for splitting radicals of all Chinese characters in each Chinese sample word through the character coding model to obtain a radical decomposition result of each Chinese character;

the granularity splitting module is used for carrying out granularity splitting on all the radical decomposition results through the character coding model to obtain splitting results;

the decoding and identifying module is used for decoding and identifying the splitting result through the character decoding model to obtain a sample decoding sentence;

a text loss value determining module for determining a text loss value according to the sample decoded sentence and the sample sentence containing only Chinese characters;

and the convergence judging module is used for updating and iterating the initial parameters of the initial character-splitting pre-training model when the text loss value does not reach a preset convergence condition, and recording the converged initial character-splitting pre-training model as a Chinese pre-training language model based on character splitting when the text loss value reaches the preset convergence condition.

A video subtitle checking method includes:

acquiring a video subtitle checking model and a video to be checked; the video subtitle checking model comprises a voice recognition model and a subtitle recognition model; the subtitle recognition model is obtained by training based on a Chinese pre-training language model for separating characters; the Chinese pre-training language model based on the character splitting is obtained according to the language model training method;

acquiring voice data in the video to be verified, and performing voice recognition on the voice data through the voice recognition model to obtain a voice sentence corresponding to the voice data;

acquiring a subtitle sentence corresponding to the voice data in the video to be checked, and performing splitting recognition on the subtitle sentence through the subtitle recognition model to obtain a split sentence;

acquiring the similarity between the voice sentence and the split sentence to obtain sentence similarity;

and when the sentence similarity is greater than a preset similarity threshold value, confirming that the video to be verified is qualified in verification.

A video subtitle verifying apparatus, comprising:

the model acquisition module is used for acquiring a video subtitle verification model and a video to be verified; the video subtitle checking model comprises a voice recognition model and a subtitle recognition model; the subtitle recognition model is obtained by training based on a Chinese pre-training language model for separating characters; the Chinese pre-training language model based on the character splitting is obtained according to the language model training method;

the voice recognition module is used for acquiring voice data in the video to be verified and carrying out voice recognition on the voice data through the voice recognition model to obtain a voice sentence corresponding to the voice data;

the splitting and identifying module is used for acquiring a subtitle sentence corresponding to the voice data in the video to be verified, and splitting and identifying the subtitle sentence through the subtitle identifying model to obtain a split sentence;

a similarity obtaining module, configured to obtain a similarity between the voice sentence and the split sentence to obtain a sentence similarity;

and the video verification module is used for confirming that the video to be verified is verified to be qualified when the sentence similarity is greater than a preset similarity threshold.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned duplicate case detection method when executing the computer program, or the processor implementing the above-mentioned video subtitle verification method when executing the computer program.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method for detecting a duplicate case, or which, when executed by a processor, implements the above-described method for checking a video subtitle.

According to the language model training and video subtitle verification method, device, equipment and medium, a text sample set and an initial character-splitting pre-training model containing initial parameters are obtained, wherein the text sample set comprises at least one sample sentence, and one sample sentence comprises at least one Chinese character; the initial character splitting pre-training model comprises a character coding model and a character decoding model; when the sample sentence only contains Chinese characters, inputting the sample sentence into the initial character-splitting pre-training model, and performing word segmentation processing on the sample sentence through the character coding model to obtain each Chinese sample word in the sample sentence; performing radical splitting on all the Chinese characters in each Chinese sample word through the character coding model to obtain a radical decomposition result of each Chinese character; carrying out granularity splitting on all the radical decomposition results through the character coding model to obtain splitting results; decoding and identifying the splitting result through the character decoding model to obtain a sample decoding sentence; determining a text loss value according to the sample decoded sentence and the sample sentence containing only Chinese characters; and updating and iterating the initial parameters of the initial character-splitting pre-training model when the text loss value does not reach a preset convergence condition, and recording the converged initial character-splitting pre-training model as a Chinese pre-training language model based on character splitting when the text loss value reaches the preset convergence condition.

According to the Chinese character pre-training method, the Chinese character is divided into the radical structure, the Chinese pre-training language model based on the character division is obtained through training, so that the Chinese pre-training language model based on the character division can receive the internal characteristics of the character, and the expression capability of the Chinese pre-training language model is improved; moreover, the vocabulary used for granularity splitting in the character coding model in the model can better restore the structure type of each character, and the parameters (500- & ltSP & gt 2500-) of the vocabulary are much smaller than the parameters (usually more than twenty thousand) of the vocabulary in the prior art, so that the model recognition speed is high, and the model is favorable for quickly training other models.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a language model training method and a video subtitle verification method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of training a language model according to an embodiment of the present invention;

FIG. 3 is a flowchart of step S13 in the language model training method according to an embodiment of the present invention;

FIG. 4 is another flowchart of step S13 of the language model training method according to an embodiment of the present invention;

FIG. 5 is a flowchart of a video subtitle checking method according to an embodiment of the present invention;

FIG. 6 is a schematic block diagram of a language model training apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a first splitting module in the speech model training apparatus according to an embodiment of the present invention;

FIG. 8 is another schematic block diagram of a first splitting module in the apparatus for training language models according to an embodiment of the present invention;

FIG. 9 is a schematic block diagram of a video caption checking device according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The language model training method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the language model training is applied to a language model training system, which includes a client and a server as shown in fig. 1, where the client and the server communicate through a network to improve the accuracy of language model recognition. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a language model training method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s11: acquiring a text sample set and an initial character splitting pre-training model containing initial parameters, wherein the text sample set comprises at least one sample sentence, and one sample sentence comprises at least one Chinese character; the initial character splitting pre-training model comprises a character coding model and a character decoding model.

The text sample set comprises at least one sample sentence, and the sample sentence can be any sentence comprising at least one Chinese character. The character coding model is used for coding Chinese characters in the sample sentences and comprises a Chinese word segmentation module, a character splitting module and a preset BPE vocabulary. The character decoding model is used for decoding and identifying the result output by the character coding model.

S12: when only Chinese characters are contained in the sample sentence, the sample sentence is input into an initial character-splitting pre-training model, and the sample sentence is subjected to word segmentation processing through a character coding model to obtain each Chinese sample word in the sample sentence.

The Chinese sample words are obtained by word segmentation in the form of words in the sample sentences after word segmentation processing.

Specifically, after a text sample set and an initial splitting pre-training model containing initial parameters are obtained, if a sample sentence in the text sample set only contains Chinese characters (that is, the sample sentence does not contain English letters, Arabic numbers and the like except the Chinese characters), the sample sentence is input into the initial splitting pre-training model, and a final word segmentation module in a character coding model is used for performing word segmentation on the sample sentence to obtain each Chinese sample word in the sample sentence. The Chinese words are obtained by the Chinese word segmentation module according to the conventional word combination.

S13: and splitting radicals of all Chinese characters in each Chinese sample word through a character coding model to obtain a radical decomposition result of each Chinese character.

Specifically, after a sample sentence is input into an initial character splitting pre-training model and word segmentation processing is performed on the sample sentence through a character coding model to obtain each Chinese sample word in the sample sentence, the radicals of all Chinese characters in each Chinese sample word are split through a character splitting module in the character coding model to obtain a radical decomposition result of each Chinese character. Illustratively, after the radical decomposition is performed on the "safety" in the "safety most important", the decomposition results of the radicals are "宀", "female", "human" and "king".

S14: and carrying out granularity splitting on all the radical decomposition results through a character coding model to obtain splitting results.

Specifically, after the character coding model is used for performing radical splitting on all the Chinese characters in each Chinese sample word to obtain a radical decomposition result of each Chinese character, although the radical splitting is performed on each Chinese character at the moment, the effect of reducing the character dimension is achieved, for the radical decomposition result after the radical splitting is performed, the character decoding model has no way to identify how the radical decomposition result specifically corresponds to the original Chinese character in the subsequent steps, namely, the radical decomposition result cannot be combined and restored to the Chinese character according to the radical decomposition result, but is randomly combined to generate the character, so that the identification accuracy is reduced.

Further, in this embodiment, all the radical decomposition results are subjected to granularity splitting through a preset BPE vocabulary in the character coding model, so as to obtain a splitting result. The radical decomposition result of each character is divided in granularity, so that the character decoding model in the subsequent step S15 can identify the split result, and how to combine and reduce the split result to the original Chinese character can be determined, and the identification accuracy is improved. The preset BPE vocabulary is generated through an open source package (sentencepece), and the BPE vocabulary can cover 99.95% of characters in the existing corpus, so that the phenomenon that the radical decomposition result cannot be identified basically does not occur when the granularity decomposition is carried out through the BPE vocabulary, and the accuracy and the efficiency of identification are further ensured. Meanwhile, the size of the BPE vocabulary is 500-2500, and compared with the Chinese vocabulary (usually more than twenty thousand) in the prior art, the BPE vocabulary is smaller, so that the model parameters are smaller, the recognition efficiency is high in the character splitting and recognition process, and the recognition accuracy is high.

S15: and decoding and identifying the splitting result through the character decoding model to obtain a sample decoded sentence.

Specifically, after the character coding model is used for carrying out granularity splitting on all the radical decomposition results to obtain splitting results, decoding and identifying the splitting results through a character decoding model in an initial character splitting pre-training model, namely combining the splitting results to restore the splitting results to the corresponding Chinese characters; and after decoding and identifying all the split results, namely after reducing the split results to Chinese characters, obtaining a sample decoded sentence according to all Chinese character combinations.

S16: a text loss value is determined based on the sample decoded sentence and the sample sentence containing only Chinese characters.

The text loss value refers to a difference value between the sample decoded sentence and the sample sentence containing only the chinese character, that is, a ratio of the number of different characters between the sample decoded sentence and the sample sentence containing only the chinese character to the total number of characters.

Specifically, after the splitting result is decoded and identified by the character decoding model to obtain the sample decoded sentence, it is necessary to verify whether the sample decoded sentence is the same as the sample sentence corresponding to the sample decoded sentence and only containing the chinese character, and therefore, it is determined whether each chinese character corresponds to one another according to the sample decoded sentence and the sample sentence only containing the chinese character, and thus, the text loss value is determined according to the ratio of the number of different characters to the total number of characters between the sample decoded sentence and the sample sentence only containing the chinese character.

S17: and when the text loss value does not reach the preset convergence condition, updating the initial parameters of the iterative initial character-removing pre-training model until the text loss value reaches the preset convergence condition, and recording the converged initial character-removing pre-training model as a Chinese pre-training language model based on character removal.

It can be understood that the convergence condition may be a condition that the text loss value is smaller than a set threshold, that is, when the text loss value is smaller than the set threshold, the training is stopped; the convergence condition may also be a condition that the text loss value is small and does not decrease after 10000 times of calculation, that is, when the text loss value is small and does not decrease after 10000 times of calculation, the training is stopped, and the initial character-splitting pre-training model after convergence is recorded as a Chinese pre-training language model based on character splitting.

Further, after determining a text loss value based on the sample decoded sentence and the sample sentence containing only Chinese characters, when the text loss value does not reach the preset convergence condition, adjusting the initial parameters of the initial character splitting pre-training model according to the text loss value, and the sample sentence is input into the initial character-splitting pre-training model after the initial parameters are adjusted again, when the text loss value corresponding to the sample sentence reaches the preset convergence condition, another sample sentence only containing Chinese characters in the character sample set is selected, and performing steps S12-S16 to obtain a text loss value corresponding to the sample sentence, and when the text loss value does not reach a preset convergence condition, and adjusting the initial parameters of the initial character splitting pre-training model again according to the text loss value, so that the text loss value corresponding to the sample sentence reaches a preset convergence condition.

Therefore, after the initial character-removing pre-training model is trained through all the sample sentences only containing Chinese characters in the character sample set, the result output by the initial character-removing pre-training model can be continuously closed to an accurate result, the recognition accuracy is higher and higher, and the converged initial character-removing pre-training model is recorded as a Chinese character-removing-based Chinese pre-training language model until the text loss values corresponding to all the sample sentences only containing Chinese characters reach the preset convergence condition.

In the embodiment, a Chinese pre-training language model based on character splitting is obtained by training in a mode of splitting a Chinese character into a radical structure, so that the Chinese pre-training language model based on character splitting can receive internal characteristics of characters, and the representation capability of the Chinese pre-training language model is improved; moreover, the vocabulary used for granularity splitting in the character coding model in the model can better restore the structure type of each character, and the parameters (500- & ltSP & gt 2500-) of the vocabulary are much smaller than the parameters (usually more than twenty thousand) of the vocabulary in the prior art, so that the model recognition speed is high, and the model is favorable for quickly training other models.

In another embodiment, to ensure the privacy and security of the Chinese language model based on word splitting in the above embodiments, the Chinese language model based on word splitting may be stored in a blockchain. The Block chain (Blockchain) is an encrypted and chained transaction storage structure formed by blocks (blocks).

For example, the header of each block may include hash values of all transactions in the block, and also include hash values of all transactions in the previous block, so as to achieve tamper resistance and forgery resistance of the transactions in the block based on the hash values; newly generated transactions, after being filled into the tiles and passing through the consensus of nodes in the blockchain network, are appended to the end of the blockchain to form a chain growth.

In one embodiment, the language model training method further includes the following steps:

when the sample sentence contains the non-Chinese characters, acquiring the position information of all the non-Chinese characters in the sample sentence, intercepting all the non-Chinese characters according to the position information, and inputting the sample sentence with the intercepted non-Chinese characters into an initial character-disassembling pre-training model.

In the example, it is assumed that position information of each character is encoded from a first character in the sample sentence, the position information of the first character is V1, the position information of a second character is V2, and if a third character is a non-chinese character, the position information of the non-chinese character is V3.

Specifically, after the text sample set is obtained, if the sample sentence includes the non-chinese character, the non-chinese character is not split in the initial character splitting pre-training model, so the non-chinese character in the sample sentence needs to be intercepted. Acquiring the position information of all non-Chinese characters in the sample sentence, intercepting all non-Chinese characters according to the position information, inputting the sample sentence with the intercepted non-Chinese characters into an initial character-disassembling pre-training model, and executing the steps S12-S16 in the embodiment to obtain a text loss value corresponding to the sample sentence.

In one embodiment, as shown in fig. 3, in step S13, that is, the step of performing radical splitting on all chinese characters in each chinese sample word by using a character coding model to obtain a radical decomposition result of each chinese character includes:

s131: when the Chinese character contains a detachable radical structure, each Chinese character is subjected to primary radical splitting to obtain a first split character.

The detachable radical structure means that the Chinese character contains known radicals (such as ' kou ', ' ri ', and the like), and the radicals can be detached (i.e. the Chinese character contains other parts besides the radicals; such as ' kou ', female ', and ' kou '). The first decomposed character contains the radical structure ("e.g.," women in the word ") and the non-radical structure (" e.g., "mouths in the word") of the Chinese character.

Specifically, after the word segmentation processing is performed on the sample sentence through the character coding model to obtain each Chinese sample word in the sample sentence, the radical structure detection is performed on all Chinese characters in each Chinese sample word, and when the Chinese characters contain the detachable radical structure, the first radical segmentation is performed on all Chinese characters containing the detachable radical structure through a character segmentation module in the character coding model to obtain a first segmented character corresponding to each Chinese character.

For example, if a chinese sample word is "fresh" and the chinese sample word contains a detachable radical structure, the first radical of the chinese sample word is detached, and the first decomposed character corresponding to "new" is "parent" and "jin"; the first decomposed characters corresponding to "xian" are "fish" and "sheep". It can be understood that, when the Chinese sample word is subjected to initial radical splitting, splitting recognition can be performed according to the existing radical structures in dictionaries such as the Xinhua dictionary, that is, the existing radical structures can be all led into the character splitting module in the character coding model.

Further, when the Chinese sample word containing the detachable radical structure is subjected to initial radical splitting, the corresponding structure is recorded at the same time, such as: an up-down structure, a left-right structure, a semi-enclosed structure, etc.; illustratively, the first decomposed character corresponding to the new character is labeled while being the parent and the jin, and a classification block of a left structure and a right structure can be generated in the character coding model to represent that the splitting is the splitting of the left structure and the right structure.

S132: it is detected whether the first decomposed character is a minimum character unit.

Wherein the minimum character unit is the non-resolvable component in the first decomposed character (e.g., "soil in" word is the minimum character unit).

S133: if the first decomposed character is the minimum character unit, recording the first decomposed character corresponding to each minimum character unit as the radical decomposition result of the Chinese character corresponding to the first decomposed character.

Specifically, when the chinese character includes a detachable radical structure, performing initial radical splitting on each chinese character to obtain a first decomposed character, and then detecting whether the first decomposed character corresponding to each chinese character is a minimum character unit, and if the first decomposed character is the minimum character unit, recording the first decomposed character corresponding to each minimum character unit as a radical decomposition result of the chinese character corresponding to the first decomposed character.

Exemplarily, if the chinese sample word is "fresh", the first radical splitting is performed on the chinese sample word, and then the first split character corresponding to "new" can be obtained as "parent" and "jin"; after the first decomposed characters corresponding to the freshness are fish and sheep, detecting whether each first decomposed character is the minimum character unit, and splitting the parent character to obtain the parent character, so that the parent character is not the minimum character unit; for the 'jin' character, it can not be continuously split, so 'jin' is the minimum character unit, so 'jin' is the radical decomposition result of the first decomposed character 'jin' in the 'new' Chinese character, and the radical decomposition result of the 'new' total needs to be combined with 'jin' into the radical decomposition result of the 'new' Chinese character after the 'parent' character is completely split.

In an embodiment, as shown in fig. 4, after the step S132, that is, after detecting whether the first decomposed character is the minimum character unit, the method further includes:

s134: and if the first decomposed character is not the minimum character unit, performing structural analysis on the first decomposed character to obtain a first character structure of the first decomposed character, and performing secondary radical splitting on the first decomposed character according to the first character structure to obtain a second decomposed character.

Wherein the first character structure characterizes a structural classification in the first decomposed character, and the first character structure includes, but is not limited to, a top-bottom structure, a left-right structure, a semi-surrounding structure, and the like.

Specifically, after detecting whether the first decomposed character is the minimum character unit, if the first decomposed character is not the minimum character unit, that is, the first decomposed character still includes the detachable radical structure, performing structural analysis on the first decomposed character to obtain a first character structure of the first decomposed character; and carrying out secondary radical splitting on the first decomposed character corresponding to the first character structure according to the first character structure to obtain a second decomposed character.

Exemplarily, assuming that one of the chinese characters is "budding", after the initial radical splitting is performed on the chinese character, the obtained first decomposed character includes "+" - "and" clear ", the" + "-and" clear "are detected to check whether the character unit is the minimum character unit, the" + "-is the minimum character unit can be found, so the" + "-" is a part of the radical splitting result of the "budding", the "clear" is not the minimum character unit, the chinese character can be continuously split into two second decomposed characters of "day" and "month", and while the chinese character is split, the structural classification corresponding to the chinese character is recorded, and the chinese character is the left and right structure at this time.

S135: and if the second decomposed characters are all the minimum character units, recording the first decomposed character and the second decomposed character of the minimum character units as the decomposition result of the radicals.

Specifically, after performing structural analysis on the first decomposed character to obtain a first character structure of the first decomposed character, performing secondary radical splitting on the first decomposed character according to the first character structure to obtain a second decomposed character, detecting whether all the second decomposed characters are minimum character units, and recording the first decomposed character and the second decomposed character of the minimum character unit as radical decomposition results of the corresponding Chinese character if all the second decomposed characters are the minimum character units.

For example, assuming that the chinese character is "budded" as above, after the secondary radical splitting, the obtained second decomposed character has "day" and "month", and neither of the two second decomposed characters can be split continuously, i.e. the second decomposed character is the minimum character unit, so the "+", "day" and "month" of the first decomposed character of the minimum character unit and the "day" and "month" of the second decomposed character of the minimum character unit are recorded as the "budded" radical splitting result.

It should be noted that, no matter whether the first radical is split or the second radical is split, or when the second split character is still split, the other radicals are split, and during splitting, the character structure classification corresponding to the split character is recorded, that is, the upper and lower structures, the left and right structures, the semi-enclosed structure, and the like.

In an embodiment, after the step S131, that is, when the chinese character includes a detachable radical structure, the first radical splitting is performed on each chinese character to obtain a first split character, and the method further includes:

detecting whether the first decomposed character is an existing character; and if the first decomposed character is not the existing character, encoding the first decomposed character to obtain an encoded character corresponding to the first decomposed character.

Wherein, the existing character refers to a character (a character available for query) which can exist independently in the prior art, such as: mouth, woman, etc. Examples of characters that are not present are: after the first radical splitting is carried out on the ' in ', the ' out ' soil ' is split, and the rest part is not the character which independently exists in the existing data or linguistic data and can be inquired. The encoded character refers to a character obtained by specially encoding a first decomposed character which is not an existing character.

Specifically, when the chinese character includes a detachable radical structure, performing initial radical splitting on each chinese character to obtain a first decomposed character, and then detecting whether each first decomposed character is an existing character, and if the first decomposed character does not exist, encoding the first decomposed character to obtain an encoded character corresponding to the first decomposed character.

For example, after the first radical splitting is performed on the "in", and the "in" is split into the "soil", the remaining part is not a character which is independent in the existing data or corpus and is available for query, so that the first decomposed character except the "in" character is encoded, the part which is not the existing character can be encoded into any form, for example, the CDP8861 and the like represent the part, and the CDP8861 is the encoded character corresponding to the first decomposed character which is not the existing character.

The method for encoding the first decomposed character can be set according to the preference of a user, but in the setting process, it needs to be noted that each first decomposed character which is not the existing character is associated with a unique encoded character, that is, the same encoded character cannot be adopted, and different first decomposed characters which are not the existing character are encoded, so that the problem that a subsequent character decoding module cannot accurately identify the characters is avoided, and the identification accuracy is reduced.

In an embodiment, step S13 further includes:

and when the Chinese characters in the Chinese sample words do not contain the detachable radical structures, directly recording the Chinese characters as corresponding radical decomposition results.

Specifically, after the word segmentation processing is performed on the sample sentence through the character coding model to obtain each Chinese sample word in the sample sentence, the radical structure detection is performed on all Chinese characters in each Chinese sample word, and when the Chinese characters do not contain a detachable radical structure, the Chinese characters are directly recorded as a corresponding radical decomposition result. Illustratively, if the Chinese sample word is stuttered, and the stuttered word in the Chinese sample word does not contain a detachable radical structure, i.e., smaller character units cannot be detached continuously, the stuttered word is directly recorded as a part of the radical decomposition result corresponding to the stuttered Chinese sample word.

In an embodiment, as shown in fig. 5, a method for checking a video subtitle is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s31: acquiring a video subtitle checking model and a video to be checked; the video subtitle checking model comprises a voice recognition model and a subtitle recognition model; the subtitle recognition model is obtained by training based on a Chinese pre-training language model for separating characters; the Chinese pre-training language model based on the word splitting is obtained according to the language model training method in the embodiment.

The video subtitle verification model refers to a model for verifying whether voice and subtitles in any video segment are matched, and comprises a voice recognition model and a subtitle recognition model; it should be emphasized that the subtitle recognition model is obtained by training based on a Chinese pre-training language model for separating characters, and the Chinese pre-training language model for separating characters is obtained according to the language model training method in the above embodiment. In a video subtitle checking scene, subtitles are recorded in common linguistic data in a conventional mode, and a model with large parameters is not needed for checking, so that in order to check by using a model with small parameters, the Chinese pre-training language model based on word splitting is obtained by the language model training method in the embodiment, and the trained subtitle recognition model can meet the checking requirement, and is small in parameters and high in recognition speed. The video to be checked refers to the video needing to check whether the subtitle is matched with the voice.

S32: and acquiring voice data in the video to be verified, and performing voice recognition on the voice data through a voice recognition model to obtain a voice sentence corresponding to the voice data.

The voice data refers to audio data in the video to be verified. The speech sentence refers to a speech text corresponding to the speech data.

Specifically, after a video subtitle verification model and a video to be verified are obtained, voice data in the video to be verified are obtained, and voice recognition is performed on the voice data through a voice recognition model in the video subtitle verification model to obtain a voice text corresponding to the voice data, namely a voice sentence.

S33: and acquiring caption sentences corresponding to the voice data in the video to be checked, and splitting and identifying the caption sentences through a caption identification model to obtain split sentences.

Specifically, after a video subtitle verification model and a video to be verified are obtained, a subtitle sentence corresponding to voice data in the video to be verified is obtained, and the subtitle sentence is split and recognized through a subtitle recognition model in the video subtitle verification model, so that a split sentence is obtained. Because the caption recognition model is obtained by training based on the Chinese pre-training language model for splitting characters, when the caption recognition model is used for splitting and recognizing the caption sentences, the characters in the caption sentences can be better restored.

S34: and acquiring the similarity between the voice sentence and the split sentence to obtain the sentence similarity.

Specifically, after the voice sentence and the split sentence are obtained, similarity comparison is performed on the voice sentence and the split sentence to obtain sentence similarity, that is, whether characters in the voice sentence correspond to characters in the split sentence one to one is determined to determine whether the voice data are matched with the subtitle sentence.

S35: and when the sentence similarity is greater than or equal to a preset similarity threshold value, confirming that the video to be verified is verified to be qualified.

The preset similarity threshold may be set by a user according to a matching requirement, and may be, for example, 0.9,0.95, and the like.

Specifically, after the similarity between the voice sentence and the split sentence is obtained and the sentence similarity is obtained, the sentence similarity is compared with a preset similarity threshold, and when the sentence similarity is greater than or equal to the preset similarity threshold, it is determined that the video to be verified is qualified for verification. When the similarity of the sentences is smaller than the preset similarity threshold, the similarity between the representation voice sentences and the split sentences does not accord with the standard, namely the voice data is not matched with the subtitle sentences, the subtitle sentences need to be readjusted to enable the voice data to be matched with the subtitle sentences, and the video to be verified is qualified in verification.

Exemplarily, if the speech sentence for the video to be verified is "where to go to eat at noon"; assuming that the obtained caption sentence corresponding to the voice data is 'what is eaten at noon', at this time, if the split sentence obtained after the caption recognition model obtained by the Chinese pre-training language model training based on the word splitting is identified is 'what is eaten at noon', similarity calculation is performed on the voice sentence and the split sentence, assuming that the obtained sentence similarity is 0.47 and the preset similarity threshold is 0.9, the voice sentence and the split sentence are not matched, the caption sentence corresponding to the voice data is adjusted to perform similarity judgment on the voice sentence and the adjusted caption sentence until the sentence similarity between the voice sentence and the caption sentence is greater than the preset similarity threshold, and the video to be verified is determined to be qualified.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a language model training device is provided, and the language model training device corresponds to the language model training method in the above embodiments one to one. As shown in fig. 6, the language model training device includes a data obtaining module 11, a first segmentation processing module 12, a first radical splitting module 13, a first granularity splitting module 14, a first decoding and identifying module 15, a text loss value determining module 16, and a first convergence judging module 17. The functional modules are explained in detail as follows:

a data obtaining module 11, configured to obtain a text sample set and an initial character-splitting pre-training model containing initial parameters, where the text sample set includes at least one sample sentence, and one sample sentence includes at least one chinese character; the initial character splitting pre-training model comprises a character coding model and a character decoding model.

And the word segmentation processing module 12 is configured to, when the sample sentence only includes chinese characters, input the sample sentence into the initial character-splitting pre-training model, and perform word segmentation processing on the sample sentence through the character coding model to obtain each chinese sample word in the sample sentence.

And the radical splitting module 13 is configured to split radicals of all the chinese characters in each chinese sample word by using the character coding model, so as to obtain a radical decomposition result of each chinese character.

And the granularity splitting module 14 is configured to perform granularity splitting on all the radical decomposition results through the character coding model to obtain splitting results.

And the decoding and identifying module 15 is configured to perform decoding and identifying on the splitting result through the character decoding model to obtain a sample decoded sentence.

A text loss value determining module 16, configured to determine a text loss value according to the sample decoded sentence and the sample sentence containing only chinese characters.

And the convergence judging module 17 is configured to update and iterate the initial parameters of the initial character-splitting pre-training model when the text loss value does not reach a preset convergence condition, and record the converged initial character-splitting pre-training model as a Chinese pre-training language model based on character splitting when the text loss value reaches the preset convergence condition.

Preferably, the language model training device further comprises the following modules:

a position information obtaining module 21, configured to, when the sample sentence includes a non-chinese character, obtain position information of all the non-chinese characters in the sample sentence, and intercept all the non-chinese characters according to the position information, and input the intercepted sample sentence into the initial character-splitting pre-training model.

Preferably, as shown in fig. 7, the first radical splitting module 13 includes the following units:

the primary radical splitting unit 131 is configured to, when the chinese character includes a detachable radical structure, perform primary radical splitting on each of the chinese characters to obtain a first split character.

A character detection unit 132, configured to detect whether the first decomposed character is a minimum character unit.

A first recording unit 133, configured to record, when the first decomposed character is a minimum character unit, the first decomposed character corresponding to each minimum character unit as a result of the radical decomposition of the chinese character corresponding thereto.

Preferably, as shown in fig. 8, the first radical splitting module 13 further includes the following units:

and the secondary radical splitting unit 134 is configured to, when the first decomposed character is not the minimum character unit, perform structural analysis on the first decomposed character to obtain a first character structure of the first decomposed character, and perform secondary radical splitting on the first decomposed character according to the first character structure to obtain a second decomposed character.

A second recording unit 135, configured to record, as the radical decomposition result, the first decomposed character and the second decomposed character of the minimum character unit when the second decomposed characters are all the minimum character unit.

Preferably, the first radical splitting module 13 further includes the following units:

an existing character detection unit, configured to detect whether the first decomposed character is an existing character;

and the character encoding unit is used for encoding the first decomposed character to obtain an encoded character corresponding to the first decomposed character when the first decomposed character is not the existing character.

For the specific definition of the language model training device, reference may be made to the above definition of the language model training method, which is not described herein again. The modules in the language model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In an embodiment, a video subtitle verifying apparatus is provided, where the video subtitle verifying apparatus corresponds to the video subtitle verifying method in the above embodiment one to one. As shown in fig. 9, the video subtitle verifying apparatus includes a model obtaining module 31, a speech recognition module 32, a split recognition module 33, a similarity obtaining module 34, and a video verifying module 35. The functional modules are explained in detail as follows:

the model obtaining module 31 is configured to obtain a video subtitle verification model and a video to be verified; the video subtitle checking model comprises a voice recognition model and a subtitle recognition model; the subtitle recognition model is obtained by training based on a Chinese pre-training language model for separating characters; the Chinese pre-training language model based on the character splitting is obtained according to the language model training method;

the voice recognition module 32 is configured to obtain voice data in the video to be verified, and perform voice recognition on the voice data through the voice recognition model to obtain a voice sentence corresponding to the voice data;

the splitting and identifying module 33 is configured to acquire a subtitle sentence corresponding to the voice data in the video to be verified, and split and identify the subtitle sentence through the subtitle identifying model to obtain a split sentence;

a similarity obtaining module 34, configured to obtain a similarity between the voice sentence and the split sentence, so as to obtain a sentence similarity;

and the video verification module 35 is configured to confirm that the video to be verified is verified to be qualified when the sentence similarity is greater than a preset similarity threshold.

For specific limitations of the video subtitle checking apparatus, reference may be made to the above limitations of the video subtitle checking method, which is not described herein again. All or part of the modules in the video subtitle checking apparatus may be implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the data used for the detection of the same cases. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a language model training method, or a video subtitle verification method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the language model training method in the above embodiments or the video subtitle verification method in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the language model training method in the above-described embodiment, or the video subtitle verification method in the above-described embodiment.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for training a language model, comprising:

2. The language model training method of claim 1, wherein said inputting the sample sentence into the initial de-word pre-training model comprises:

when the sample sentence contains non-Chinese characters, acquiring the position information of all the non-Chinese characters in the sample sentence, intercepting all the non-Chinese characters according to the position information, and inputting the sample sentence after intercepting the non-Chinese characters into the initial character-disassembling pre-training model.

3. The method for training a language model according to claim 1, wherein the obtaining the radical decomposition result of each chinese character by performing radical splitting on all the chinese characters in each chinese sample word through the character coding model comprises:

when the Chinese character contains a detachable radical structure, performing primary radical splitting on each Chinese character to obtain a first split character;

detecting whether the first decomposed character is a minimum character unit;

and if the first decomposed character is the minimum character unit, recording the first decomposed character corresponding to each minimum character unit as the radical decomposition result of the Chinese character corresponding to the first decomposed character.

4. A method for language model training as defined in claim 3, wherein said detecting whether said first decomposed character is a minimum character unit further comprises:

if the first decomposed character is not the minimum character unit, performing structural analysis on the first decomposed character to obtain a first character structure of the first decomposed character, and performing secondary radical splitting on the first decomposed character according to the first character structure to obtain a second decomposed character;

and if the second decomposed characters are all the minimum character units, recording the first decomposed character and the second decomposed character of the minimum character units as the radical decomposition result.

5. A method for language model training as defined in claim 3, wherein said initially splitting each chinese character to obtain a first split character further comprises:

detecting whether the first decomposed character is an existing character;

and if the first decomposed character is not the existing character, encoding the first decomposed character to obtain an encoded character corresponding to the first decomposed character.

6. A method for video subtitle verification, comprising:

acquiring a video subtitle checking model and a video to be checked; the video subtitle checking model comprises a voice recognition model and a subtitle recognition model; the subtitle recognition model is obtained by training based on a Chinese pre-training language model for separating characters; the Chinese pre-training language model based on the word separation is obtained according to the language model training method of any one of claims 1 to 5;

7. A language model training device, comprising:

the first word segmentation processing module is used for inputting the sample sentence into the initial character splitting pre-training model when the sample sentence only contains Chinese characters, and carrying out word segmentation processing on the sample sentence through the character coding model to obtain each Chinese sample word in the sample sentence;

the first radical splitting module is used for splitting radicals of all Chinese characters in each Chinese sample word through the character coding model to obtain a radical decomposition result of each Chinese character;

the first granularity splitting module is used for carrying out granularity splitting on all the radical decomposition results through the character coding model to obtain splitting results;

the first decoding and identifying module is used for decoding and identifying the splitting result through the character decoding model to obtain a sample decoding sentence;

and the first convergence judgment module is used for updating and iterating the initial parameters of the initial character-splitting pre-training model when the text loss value does not reach a preset convergence condition, and recording the converged initial character-splitting pre-training model as a Chinese pre-training language model based on character splitting when the text loss value reaches the preset convergence condition.

8. A video subtitle verifying apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the language model training method according to any one of claims 1 to 5 when executing the computer program or implements the video subtitle verification method according to claim 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the language model training method according to any one of claims 1 to 5, or which, when being executed by a processor, implements the video subtitle verification method according to claim 6.