CN110782881A

CN110782881A - Video entity error correction method after speech recognition and entity recognition

Info

Publication number: CN110782881A
Application number: CN201911023854.0A
Authority: CN
Inventors: 孙云云; 刘楚雄; 唐军
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2020-02-11

Abstract

The invention relates to voice text processing, and discloses a method for correcting errors of video entities after voice recognition and entity recognition, which solves the problem that the user experience is influenced by incomplete and error video entities of voice recognition due to the influence factors of nonstandard Mandarin, incomplete search sentence patterns, noise and the like during human-computer interaction of a user. The method comprises the following steps: A. analyzing and preprocessing the text data after voice conversion to obtain a sample data set; B. training a named entity recognition model based on Bilstm + crf by using sample data; C. processing the movie entity data of the recent high-frequency request of the user to construct an entity correction data set; D. in the actual voice interaction process, according to text data after voice recognition, a trained named entity recognition model based on Bilstm + crf is used for prediction and entity verification; E. error correction processing is carried out on the entity which fails in verification; F. and packaging the error correction result.

Description

Video entity error correction method after speech recognition and entity recognition

Technical Field

The invention relates to voice text processing, in particular to a method for correcting errors of video entities through voice recognition and entity recognition.

Background

With the popularization of deep learning, great breakthroughs are made in the aspects of computer vision, speech recognition, natural language processing and the like. At present, the accuracy rate of speech recognition reaches 97%. Compared with other human-computer interaction modes, the voice interaction is more in line with the daily habits of people and is more efficient, so that the voice recognition technology is widely applied to various fields such as smart home, industrial production, communication, medical treatment, automatic driving and the like. The television necessary for each family is less and needs intellectualization, all devices in the family can be operated through the television, and the film of the psychoscope can be watched by moving the mouth on the sofa. The intelligent television can realize a bidirectional man-machine interaction function, integrates various functions such as audio and video, entertainment, data and the like into a whole, and meets the television product with diversified and personalized requirements of users. The purpose of the method is to bring more convenient experience to users, and the method is a trend of televisions at present.

In the actual smart television voice interaction process, because the user is mostly old person and children, old person Mandarin is not standard, common dialect, and the sentence pattern is incomplete when children search for the video, only remembers the personage role in the cartoon and various factor influences such as ambient noise, and speech recognition error rate is higher. The prior art focuses on improving the accuracy of speech recognition, but lacks further processing of recognition results.

Due to the interference of environmental noise, equipment, accents and other factors, the text converted by speech recognition often contains a large amount of noise data, such as incorrect text data of homophones, similar phonetic words, wrongly written characters and the like, and such text errors often bring word segmentation errors. Most of the present error correction after speech recognition is a statistical method based on word recognition result combination probability, and the method has two problems: firstly, errors of speech recognition can cause word segmentation errors, and wrong terms are often extracted due to wrong words in the word segmentation process; secondly, in the application of the method in a specific field, due to the lack of a large-scale corpus, the limited corpus samples are difficult to accurately reflect the real lexical item probability distribution, and the theoretical expectation cannot be achieved through a probability calculation method. Therefore, in the practical application process, for example, in the application of the dialogue robot, the effect of text error correction achieved by the statistical method is not ideal, and great resistance is formed to the subsequent semantic analysis and intention recognition.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for correcting the error of the video entity after the voice recognition and the entity recognition is provided, and the problem that the user experience is influenced due to incomplete and error video entities of the voice recognition caused by the influence factors of nonstandard Mandarin, incomplete search sentence patterns, noise and the like during the human-computer interaction of a user is solved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a video entity error correction method after voice recognition and entity recognition comprises the following steps:

A. analyzing and preprocessing the text data after voice conversion to obtain a sample data set;

B. training a named entity recognition model based on Bilstm + crf by using sample data;

C. processing the movie entity data of the recent high-frequency request of the user to construct an entity correction data set;

D. in the actual voice interaction process, according to text data after voice recognition, a trained named entity recognition model based on Bilstm + crf is used for prediction and entity verification, if the entity verification is passed, output is returned, and if the entity verification fails, the step E is carried out;

E. error correction processing is carried out on the entity which fails in verification;

F. and packaging the error correction result.

As a further optimization, step a specifically includes:

carrying out cluster analysis on user text data after voice conversion acquired from a television terminal, determining the entity type, word and tag of a movie and television, marking an entity label on a commonly used search sentence pattern of a user, and training a 300-dimensional character vector by using word2vec as sample data.

As a further optimization, step B specifically includes:

dividing the sample data into a training data set, a testing data set and a verification data set according to a certain proportion, and training a named entity recognition model based on Bilstm + crf.

As a further optimization, step C specifically includes:

the method comprises the steps of regularly obtaining a film and television entity with the recent user request times exceeding a certain threshold value from a knowledge graph, then carrying out 2-gram segmentation on film and television entity data, storing entities containing the same character strings obtained through segmentation in a redis hash structure with the character strings as keys, and constructing an entity correction data set.

For further optimization, in step D, before prediction is carried out by using a trained named entity recognition model based on Bilstm + crf, special symbols are removed from text data after voice recognition, entities of 'set/season' are removed, and the processed data are used as model input; and after model prediction, carrying out entity knowledge graph query and verification on the entity containing the film and television name in the prediction result, if the entity can be found, returning to output, and if the entity cannot be found, entering the step E.

As a further optimization, step E specifically includes: firstly, 2-gram segmentation is carried out on entities which fail to be verified, then the word segmentation result is converted into pinyin in a circulating mode, all entities containing the word segmentation are searched in a correction data set, and then correct entities are obtained according to an error correction algorithm.

As a further optimization, the error correction algorithm comprises: if the pinyin similarity or the initial similarity between an entity in the correction data set and an entity failed in verification is greater than a preset threshold, taking the entity in the correction data set as an error correction result; if the pinyin similarity and the first letter similarity do not reach the preset threshold, performing weighted calculation on scores based on the request times, the pinyin character similarity, the Chinese character similarity and the first letter similarity, taking the first n entities with the scores exceeding the threshold for observation, and preferentially outputting the entity of the animation titles if the entity of the animation titles exists in the entities.

As a further optimization, in step F, the error-corrected entity and other entities predicted by the model are encapsulated.

The invention has the beneficial effects that: the method comprises the steps of constructing an entity correction data set by processing movie entity data requested by a recent user at a high frequency, updating the correction data set at regular time, and correcting a failed entity by adopting the entity correction data set if entity verification fails after model prediction is passed; during correction, the characters, the pinyin, the similarity of the initial letters and the request times weighting scores are considered, and different error types of the entity can be corrected.

Drawings

FIG. 1 is a flow chart of a video entity error correction method after speech recognition and entity recognition according to the present invention;

fig. 2 is a flowchart illustrating error correction processing performed on an entity that fails to verify in the embodiment.

Detailed Description

The invention aims to provide a method for correcting errors of video entities after voice recognition and entity recognition, and solves the problem that user experience is influenced by incomplete and errors of the video entities of the voice recognition due to the influence factors of nonstandard Mandarin, incomplete search sentence patterns, noise and the like during human-computer interaction of a user. The method starts from the real data of the user, mines the user requirements, considers that the facing users are mostly old people and children, has the characteristics of nonstandard Mandarin, incomplete sentence pattern, noise influence and the like during voice man-machine interaction, enhances the recognition of the entity of the unregistered words and the prediction and error correction processing of the possible intention of the user after the entity recognition, and aims to improve the user experience.

Fig. 1 shows a method for error correction of video entities after speech recognition and entity recognition in the present invention, which comprises the following steps:

(1) analyzing and preprocessing the text data after the voice conversion to obtain a sample data set;

(2) training a named entity recognition model based on the Bilstm + crf by using sample data;

(3) processing the movie entity data of the recent high-frequency request of the user to construct an entity correction data set;

(4) in the actual voice interaction process, according to text data after voice recognition, a trained named entity recognition model based on Bilstm + crf is used for prediction and entity verification, if the entity verification is passed, output is returned, and if the entity verification fails, the step (5) is carried out;

(5) error correction processing is carried out on the entity which fails to be verified;

(6) and packaging the error correction result.

The scheme of the invention is further described by combining the drawings and the embodiment.

Example (b):

firstly, analyzing and preprocessing the text data after voice recognition:

analyzing user data acquired from a television terminal through K-means clustering, frequency, user behavior data and the like, determining basic requirements of film and television searching, such as common sentence searching patterns, video searching according to what conditions and the like, and determining entity categories and names by combining with service requirements; and then manually marking the training data according to the BIO standard, and training a 300-dimensional character vector by using real data of a user and a word2vec language model as the bottom input of the bidirectional Bilstm because no ready-made available marking data exists.

Secondly, training an entity recognition model based on the Bilstm + crf:

dividing all marked training data into a training data set, a testing data set and a verification data set according to the proportion of 0.7, 0.2 and 0.1.

Taking a sentence as a unit, a sentence (a sequence of words) containing n words is written as:

x＝(x ₁,x ₂,...,x _n)

wherein x is _iRepresents the id of the ith word of the sentence in the dictionary, and thus the word2Id vector of each word can be obtained, and the dimension is the size of the dictionary.

The dictionary counts the frequency of each word from all training data, and obtains the unique id corresponding to each word and the mark bit '< UNK >' of the unknown word after the words are sorted from big to small.

The invention uses linear crf to score entity labels-the softmax method is to make a local selection, and does not use surrounding labels to help decision-making. For example: "Yang power", when we give the label of power "I-actor", this should help us to decide the starting position of the "Yang" corresponding to the I-actor. A linear CRF defines a global score.

Thirdly, processing the movie entity data requested by the recent user at high frequency to construct an entity correction data set:

because the movie and television data are updated quickly, the hot movie and television film is not hot any more along with the time, if all the movie and television data are not selected as the data set to be corrected, the error correction processing consumes long time, and the user experience is poor. The invention takes into account that the requirements of most users are only corrected for recent hot data. Such as: and (3) taking the film and television entities with the request times larger than 500 from the knowledge graph, and segmenting the data 2-gram, wherein if the toy ringlet is segmented into: the toy is a toy in a small bell/bell, entities such as toy in a small bell, toy house and toy in a bear are stored in a redis hash structure taking wan-ju as key, and the entity storage form is wan-ju: whether the toy # request frequency # is an animation film or not and whether the toy house # request frequency # is an animation film or not. The data storage is to reduce the complexity of calculation when considering the similarity of pinyin and characters, and the common error correction method is basically based on fuzzy matching after word segmentation, but the word segmentation recognition rate is very low for the entities with few words, many words and wrongly-distinguished words, and 2-ngram segmentation words are used here.

Fourthly, carrying out prediction and entity verification by utilizing a trained named entity recognition model based on the Bilstm + crf:

predicting a text data input model, wherein the text data input model is processed before the text data input model, and entities such as a set, a season and the like are identified by rules and then removed, such as: the raw text data is "I want to see moon MI passing second set". After processing, the input is "I want to see monthly MI". The processing improves the recognition rate of the season-collected entities;

after model prediction, the predicted entity is verified in the knowledge graph to see whether such entity exists, and if the entity is found, the entity is directly returned. If the entity cannot be found, the entity needs to be corrected, for example, aiming at the text data after the voice recognition that "i want to see half-month-pass", the model predicts that the entity is "half-month-pass", and after the entity is verified, the entity is not in the knowledge graph, which indicates that the entity is possibly the user output with errors. At this point, step five is performed to further correct the predicted entity.

Fifthly, error correction processing of the verification failed entity:

the error correction process flow is shown in fig. 2, and firstly, 2-gram word segmentation is performed on an entity which fails in verification, such as a 'ringlet toy', and the entity is divided into ringlets/toys; then circularly converting the word segmentation result into pinyin to remove the correct entity containing the word segmentation, obtaining the correct entity according to an error correction algorithm after the correct entity is found, and the error correction algorithm is basically thought as follows: and if the pinyin similarity is greater than 80 or the first letter similarity is greater than 88, returning the matched movie entity, and if the pinyin similarity and the first letter similarity do not reach the required threshold, calculating a weighted score based on the request times, the pinyin character similarity, the Chinese character similarity and the first letter similarity, and taking the first three entities exceeding the threshold for observation. This scoring mechanism takes into account the error types of the different entities. The request times are processed with min-max standardization to make the value of 0,100]In the meantime. Let x ₁，x ₂,...,x _nStandardizing the request times corresponding to the entity to be subjected to similarity calculation:

when only one entity exists, the request times standard is set to be 90, and the adjustment request times with high pinyin similarity and low request times occupy the weight.

When there are more than one entities with scores exceeding a threshold (such as 65 points), the first three observations are taken, and if one of the animations has a score which is not more than 5 points different from the first one, the animation name entity is taken as a correction result. This is done mainly by analyzing the user data, with most errors concentrated on the animated film entity.

The problem that this error correction matching algorithm can solve is as follows:

sixthly, packaging and outputting the entity result:

and packaging the video entity obtained by error correction and other entities obtained by model prediction and outputting.

Claims

1. A video entity error correction method after voice recognition and entity recognition is characterized by comprising the following steps:

F. and packaging the error correction result.

2. The method as claimed in claim 1, wherein the video entity error correction after speech recognition and entity recognition,

the step A specifically comprises the following steps:

3. The method as claimed in claim 1, wherein the video entity error correction after speech recognition and entity recognition,

the step B specifically comprises the following steps:

4. The method as claimed in claim 1, wherein the video entity error correction after speech recognition and entity recognition,

the step C specifically comprises the following steps:

5. The method as claimed in claim 1, wherein the video entity error correction after speech recognition and entity recognition,

in the step D, before the prediction is carried out by utilizing a trained named entity recognition model based on Bilstm + crf, firstly removing special symbols from the text data after voice recognition, removing entities of 'set/season', and inputting the processed data as a model; and after model prediction, carrying out entity knowledge graph query and verification on the entity containing the film and television name in the prediction result, if the entity can be found, returning to output, and if the entity cannot be found, entering the step E.

6. The method as claimed in claim 1, wherein the video entity error correction after speech recognition and entity recognition comprises:

the step E specifically comprises the following steps: firstly, 2-gram segmentation is carried out on entities which fail to be verified, then the word segmentation result is converted into pinyin in a circulating mode, all entities containing the word segmentation are searched in a correction data set, and then correct entities are obtained according to an error correction algorithm.

7. The method as claimed in claim 6, wherein the video entity error correction after speech recognition and entity recognition,

the error correction algorithm comprises: if the pinyin similarity or the initial similarity between an entity in the correction data set and an entity failed in verification is greater than a preset threshold, taking the entity in the correction data set as an error correction result; if the pinyin similarity and the first letter similarity do not reach the preset threshold, performing weighted calculation on scores based on the request times, the pinyin character similarity, the Chinese character similarity and the first letter similarity, taking the first n entities with the scores exceeding the threshold for observation, and preferentially outputting the entity of the animation titles if the entity of the animation titles exists in the entities.

8. The method according to any one of claims 1-7, wherein the video entity error correction after speech recognition and entity recognition,

and in step F, packaging the entity subjected to error correction processing and other entities predicted by the model.