CN112417867B

CN112417867B - Method and system for correcting video title error after voice recognition

Info

Publication number: CN112417867B
Application number: CN202011418650.XA
Authority: CN
Inventors: 周兴发; 方凡; 饶璐; 谭斌; 杨兰; 孙锐; 展华益
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-10-18
Anticipated expiration: 2040-12-07
Also published as: CN112417867A

Abstract

The invention discloses a video title error correction method after voice recognition, which comprises the following steps: preprocessing the text after voice recognition; extracting the video title contained in the preprocessed text; returning the video titles most similar to the extracted video titles in the video title library based on a similarity algorithm; using the text after the voice recognition and the user history information as the input of a language model to obtain the video title after error correction; and obtaining the video title after error correction and the most similar video title returned by the similarity algorithm according to the language model to obtain the final error correction title. Meanwhile, the invention also discloses a video title error correction system after voice recognition, and by the method and the system, the technical problems of word segmentation errors, limited error correction range and limited end-to-end method are solved, and the accuracy of voice recognition of the video title is improved by utilizing the watching habit of a specific user.

Description

Method and system for correcting video title error after voice recognition

Technical Field

The invention relates to the technical field of natural language processing and deep learning, in particular to a video title error correction method and system after voice recognition.

Background

In the process of man-machine interaction, compared with other modes, the voice interaction is more in line with the daily habits of people. Therefore, with the popularization of deep learning and the development of voice recognition technology, voice interaction has been widely applied to various fields such as smart home, industrial production, medical treatment, and automatic driving. Especially, in recent years, with the rapid update of smart televisions, voice interaction has been widely applied to smart televisions of various television brands as an important smart interaction mode. People can directly see the TV play or the movie which is wanted to be watched by saying a sentence, and the TV play or the movie can be seen without manually pressing a TV remote controller for many times like the traditional way. However, in the actual process of voice interaction with a television, the voice recognition error, especially the recognition error of the video title, is caused by the existence of dialects or the nonstandard Mandarin, so that the use experience of people is seriously reduced. Therefore, it is significant to correct the video title after speech recognition.

Most of the current text error correction after speech recognition is based on the following two methods: the first type of error correction method is a rule-based method, i.e., the place where the error is located is found first, and then the error is corrected. The error detection part firstly divides words by a Chinese word divider such as a Chinese word divider to form a suspected error position candidate set; and the error correction part is used for traversing all suspected error positions, replacing words in the error positions with the pre-constructed possibly correct words, then calculating sentence confusion degree through a language model, and comparing and sequencing results of all candidate sets to obtain the optimal corrected words. The error correction accuracy of the method is limited by the correctness of the participle, the quality of a pre-constructed word stock and the quality of a language model. The second type of error correction method is an end-to-end-based method, and the method adopts models such as RNN (radio network) and the like to directly correct the text, so that the manual feature extraction is avoided, and the manual workload is reduced. The error correction accuracy of such methods is limited by the size and quality of the corpus.

In summary, the above two methods have the following problems: firstly, errors in speech recognition can cause word segmentation errors, and the word segmentation errors can cause wrong corrected terms to be found; secondly, the end-to-end method is usually trained based on a corpus to obtain an error correction model, so that only some universal errors can be corrected; thirdly, the existing end-to-end method is trained based on data statistics, and does not consider the watching habit of a specific user, thereby leading to the fact that even if error correction is carried out, the returned movie or television play which may not be watched by the user is returned.

Disclosure of Invention

The invention aims to provide a video title error correction method and system after voice recognition, which are used for solving the problems of error correction caused by word segmentation quality, pre-constructed word bank quality, language model quality and training corpus quality in the existing method, and the defects that only some general errors can be corrected without considering user watching habits and the like.

In order to achieve the above object, the technical solution adopted by the present invention is a method for correcting the video title error after speech recognition, comprising:

a, preprocessing a text after voice recognition;

b, extracting the video title contained in the preprocessed text;

c, returning the video titles most similar to the extracted video titles in the video title library based on a similarity algorithm;

step D, using the text after the voice recognition and the user history information as the input of a language model to obtain the video title after error correction;

and E, obtaining the video title after error correction and the most similar video title returned by the similarity algorithm according to the language model to obtain the final error correction title.

Further, the method of step a at least comprises:

removing characters which can influence video title extraction or error correction in the text after voice recognition;

and converting the formats of part of characters in the text after the voice recognition to realize uniform formats.

Further, the method of step B at least comprises:

directly extracting the video title by utilizing a written regular expression based on a rule method;

and extracting the video title by using a data training model based on an entity identification method.

Further, the data training model is a CRF, LSTM + CRF or BERT model.

Further, the method of step C comprises: and (3) carrying out similarity calculation on the basis of the vectors of the Pinyin, the characters and the deep learning model to obtain corresponding video titles and similarity values, and obtaining the final most similar video titles by adopting a decision algorithm.

And further, the method in the step D comprises the steps of acquiring user history information, coding the text after voice recognition by using a coder of a language model to obtain a coding vector of the text after voice recognition, and obtaining the video title after error correction from the language model by using the coding vector of the text after voice recognition and the user history information as the input of a language model decoder.

Further, the language model is based on a seq2seq architecture, the encoder adopts an LSTM, GRU or BERT model, and the decoder adopts an LSTM or GRU model.

Further, in the method in step E, if the similarity value between the most similar video title and the video title after error correction is greater than the set threshold, the most similar video title is directly returned; otherwise, returning the video title after error correction.

In addition, another technical scheme adopted by the invention is a video title error correction system after voice recognition, which comprises:

the text preprocessing module is used for preprocessing the text after the voice recognition;

the video title extraction module is used for extracting video titles contained in the preprocessed text;

the similarity algorithm module is used for returning the video titles which are most similar to the extracted video titles in the video title library based on a similarity algorithm;

the language model module is used for using the text after the voice recognition and the user history information as the input of a language model to obtain the video title after error correction;

and the error correction title determining module is used for obtaining the final error correction title according to the video title obtained by the language model and the most similar video title returned by the similarity algorithm.

The invention has the beneficial effects that:

by the method for correcting the video title after voice recognition, the voice recognition text is processed in advance, and the video title is extracted by using rule or entity recognition and the most similar video title is obtained; then, the video title is corrected by utilizing the historical watching habit of the user, the corrected video title is obtained, and finally the final corrected video title is obtained by comparing the similarity of the two video titles. The method and the device avoid the problems that the wrong corrected word is extracted due to word segmentation errors, the error correction range of a rule method is limited by a word bank constructed in advance, and the end-to-end method is limited by a training corpus and can only correct general errors, and utilize the watching habit of a specific user, thereby improving the accuracy of voice recognition of video titles and improving the user experience.

Drawings

Fig. 1 is a flowchart of a video title error correction method after speech recognition according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a video title error correction system after speech recognition according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, a method for correcting error of a video title after speech recognition in this embodiment includes:

and step A, preprocessing the text after voice recognition.

The method for preprocessing the text after voice recognition comprises the following steps:

and A01, removing characters which can influence video title extraction and error correction in the text after voice recognition. For example, in the specific embodiment, the words that affect the video title extraction and error correction, such as the "set", "season", and "xth set" included in the text, are removed.

The method for preprocessing the text after the speech recognition further includes, but is not limited to:

and A02, converting the format of part of characters in the text after voice recognition to realize format unification. For example, in the embodiment, the operation of converting from upper case to lower case and from Arabic numerals to Chinese characters is performed.

And B, extracting the video title contained in the preprocessed text.

The method for extracting the video title contained in the text after the voice recognition comprises the following steps:

and B01, adopting a rule-based method. For example, in a specific embodiment, a rule method is used to extract the video title, "XXX", which can be directly extracted from sentence texts such as "i want to see XXX" and "play movie XXX", by writing a regular expression.

The method for extracting the video title contained in the text after the speech recognition further includes but is not limited to:

and B02, extracting the video title by using the data training model based on the entity recognition method. For example, if the video title cannot be extracted by using the rule method, the video title is extracted by using the data training model based on the entity recognition method, which can adopt but is not limited to extract the video title by using the entity recognition based on models such as CRF, LSTM + CRF and BERT.

And C, returning the video titles most similar to the extracted video titles in the video title library based on a similarity algorithm.

A method for returning a video title most similar to the extracted video title in a video title library based on a similarity algorithm, comprising,

step C01, adopting a similarity calculation method based on pinyin;

returning to a video title method which is most similar to the extracted video title in the video title library based on a similarity algorithm, and further comprising,

c02, adopting a similarity calculation method based on characters;

a method of returning a video title that is most similar to the extracted video title in the video title library based on a similarity algorithm, further including but not limited to,

and step C03, adopting a vector similarity calculation method based on a deep learning model.

In the specific embodiment, the similarity calculation method based on the pinyin comprises the steps of firstly converting the extracted video titles and videos in a video title library into the pinyin, and then obtaining a video title I which is most similar to the extracted video titles in the video title library and a similarity value by adopting similarity algorithms such as editing distance and the like based on the pinyin. And obtaining the most similar video title two based on the similarity calculation method of the characters and the similarity value in the same way. And obtaining the most similar video title three and a similarity value based on the vector similarity calculation method of the deep learning model. And finally, obtaining the most similar video titles by adopting a decision algorithm, wherein the decision algorithm comprises but is not limited to the video titles corresponding to the maximum value of the three similarity values or the weighted sum maximum value, and the weighting of the decision algorithm can adopt a weighting method in the prior art.

And D, using the text after the voice recognition and the user history information as the input of the language model to obtain the video title after error correction.

In a specific embodiment, before the method for obtaining the video title after error correction by using the text after speech recognition and the user history information as the input of the language model, the method further comprises:

and D01, acquiring user history information. The user history information includes, but is not limited to, the title of the video watched by the user N days before and the title of the searched video.

The method for obtaining the video title after error correction by using the text after voice recognition and the user history information as the input of the language model comprises the following steps:

and D02, coding the text after the voice recognition by using a coder of the language model to obtain a coding vector of the text after the voice recognition, and then obtaining the video title after error correction from the language model by using the coding vector of the text after the voice recognition and the user history information as the input of a language model decoder. The language model can adopt a structure based on seq2seq, the coder can adopt models such as LSTM, GRU, BERT and the like, and the decoder can adopt models such as LSTM, GRU and the like in the same way.

In the step D, the preprocessed text and the user history information can also be directly used as the input of the language model to obtain the video title after error correction.

The method for obtaining the video title after error correction and the most similar video title returned by the similarity algorithm according to the language model to obtain the final error correction title comprises the following steps:

step E01, if the similarity value between the most similar video title returned by the similarity algorithm and the video title obtained after error correction according to the language model is larger than a set threshold value, directly returning the most similar video title returned by the similarity algorithm; otherwise, returning to obtain the video title after error correction according to the language model.

According to the video title error correction method after voice recognition provided by the embodiment, the problem that error correction fails due to the fact that an error corrected term is found due to the word segmentation quality, the pre-constructed word bank quality and the language model quality in the existing method can be solved; meanwhile, the historical watching habit of the user is introduced, and the error of the user can be corrected in a targeted manner by adopting an end-to-end method; in addition, the method combines the rule error correction and the end-to-end error correction, and also avoids the problem that the actual similar video titles are found in the video title library by the rule error correction.

Example two

Referring to fig. 2, a video title error correction system after speech recognition in this embodiment includes:

and the text preprocessing module is used for preprocessing the text after the voice recognition. Mainly comprises the operations of removing part of characters in the text after voice recognition, converting the format of part of characters in the text after voice recognition and the like;

and the video title extraction module is used for extracting the video title contained in the preprocessed text. A method for extracting the video title by using a data training model based on a rule method and an entity recognition method can be adopted;

and the similarity algorithm module is used for returning the video titles most similar to the extracted video titles in the video title library based on the similarity algorithm. Similarity calculation methods based on characters and pinyin, such as an editing distance, and a vector similarity calculation method based on a deep learning model can be adopted;

and the language model module is used for using the text after the voice recognition and the user history information as the input of the language model to obtain the video title after error correction. Mainly adopts a frame based on seq2seq, an encoder can adopt models such as LSTM, GRU, BERT and the like, and a decoder can adopt models such as LSTM, GRU and the like in the same way;

and the error correction title determining module is used for obtaining the final error correction title according to the video title obtained by the language model and the most similar video title returned by the similarity algorithm. And judging whether the similarity of the video titles returned by the similarity algorithm is greater than a threshold value, if so, directly returning the video titles, otherwise, obtaining the video titles by adopting a language model.

By the video title error correction system after voice recognition provided by the embodiment, given a text after voice recognition to be corrected, the error correction can be automatically performed on the video title in which the error is recognized. The method solves the problems that the existing rule-based method is limited by word segmentation quality, error correction failure is caused by the quality of a pre-constructed word bank and the quality of a language model, and the end-to-end method does not consider the historical viewing habit of a user.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video title error correction method after voice recognition is characterized by comprising the following steps:

a, preprocessing a text after voice recognition; the method of the step A comprises the following steps: removing characters which can influence video title extraction or error correction in the text after voice recognition; converting the format of part of characters in the text after voice recognition to realize uniform format;

b, extracting the video title contained in the preprocessed text; the method of the step B comprises the following steps: directly extracting the video title by using a written regular expression based on a rule method; extracting a video title by using a data training model based on an entity identification method; the data training model is a CRF model, an LSTM + CRF model or a BERT model;

step E, obtaining the video title after error correction and the most similar video title returned by the similarity algorithm according to the language model to obtain the final error correction title; if the similarity value of the most similar video title and the video title after error correction is larger than a set threshold value, directly returning the most similar video title; otherwise, returning the video title after error correction.

2. The video title error correction method according to claim 1, wherein the method of step C comprises: and (3) carrying out similarity calculation on the basis of the vectors of the Pinyin, the characters and the deep learning model to obtain corresponding video titles and similarity values, and obtaining the final most similar video titles by adopting a decision algorithm.

3. The method according to claim 1, wherein the method of step D comprises obtaining user history information, encoding the speech-recognized text using an encoder of a language model to obtain encoded vectors of the speech-recognized text, and obtaining the error-corrected video title from the language model using the encoded vectors of the speech-recognized text and the user history information as input to a language model decoder.

4. The video title correction method of claim 3, wherein the language model is based on a seq2seq architecture, the encoder employs an LSTM, GRU or BERT model, and the decoder employs an LSTM or GRU model.

5. A voice-recognized video title error correction system for implementing the video title error correction method according to any one of claims 1-4, comprising: