CN114581813B

CN114581813B - Visual language identification method and related equipment

Info

Publication number: CN114581813B
Application number: CN202210046616.7A
Authority: CN
Inventors: 谢东亮; 孙保胜
Original assignee: Beijing Yunchen Xintong Technology Co ltd
Current assignee: Beijing Yunchen Shuzhi Technology Co.,Ltd.
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2023-04-07
Anticipated expiration: 2042-01-12
Also published as: CN114581813A

Abstract

The application provides a visual language identification method and related equipment. The method comprises the following steps: performing lip feature extraction on the target video to obtain a viseme sequence containing lip features; removing redundancy of the visual bit sequence to obtain a candidate visual bit sequence; mapping the candidate viseme sequence to the phoneme sequence to obtain a candidate phoneme sequence; and mapping the candidate phoneme sequence to a candidate text to obtain a visual language identification result. According to the visual language identification method, the visual bit sequence of the target video after the lip characteristics are extracted is mapped to the phoneme sequence, the phoneme sequence of the target video is mapped to the candidate text, and then the identification result of the visual language is obtained. The problem that large-scale Chinese character modeling is difficult to realize in visual language identification research is solved, and the problem that research is difficult to advance due to lack of large-scale data sets is solved.

Description

Visual language identification method and related equipment

Technical Field

The application relates to the technical field of industrial internet intelligent monitoring, in particular to a visual language identification method and related equipment.

Background

With the rapid development of artificial intelligence, visual language identification has also become a popular research direction. The early visual language identification used a template matching method, which first established the corresponding relationship between words and feature vector sequences, then extracted the corresponding feature vector sequences from the input lip sequence diagram, and performed the similarity distance measure calculation between the vector sequences and each word template in the lexicon, and finally used the words corresponding to the feature vector sequence with the highest similarity as the final output result. The early end-to-end visual language identification research has poor interpretability, only inputs and outputs are given, the model learns the intermediate characteristics and parameters, researchers are difficult to know what the model learns after the intermediate characteristics and parameters, and when the model is not good enough in effect, the researchers are difficult to analyze which link has a problem. On the other hand, people introduce a deep learning technology into the field of visual language recognition research, however, an end-to-end deep learning visual language recognition model has strong dependency on a data set, and needs to learn the spatio-temporal characteristics of lip movements from the data set through multiple iterative training, which requires that the scale of the data set is large enough, lip forms covered by the data set are rich enough, lip corpora also need to cover tens of thousands of words, so that the model can fully learn the lip characteristics of human speaking, and the large-scale visual language recognition data set is just a resource lacking in the current visual language recognition research, so that the end-to-end visual language recognition research has a limitation defect.

Disclosure of Invention

In view of the above, an object of the present application is to provide a visual language identification method and a related device.

In view of the above, in a first aspect, the present application provides a visual language identification method, comprising:

performing lip feature extraction on the target video to obtain a viseme sequence containing lip features;

removing redundancy of the viseme sequence to obtain a candidate viseme sequence;

mapping the candidate viseme sequence to a phoneme sequence to obtain a candidate phoneme sequence;

and mapping the candidate phoneme sequence to a candidate text to obtain a visual language identification result.

In a second aspect, the present application provides a visual language recognition apparatus comprising:

a feature extraction module: the method comprises the steps of performing lip feature extraction on a target video to obtain a viseme sequence containing lip features;

the screening module is configured to remove redundancy of the viseme sequence to obtain a candidate viseme sequence;

an viseme-to-phoneme mapping module configured to map the viseme sequence to a phoneme sequence to obtain a candidate phoneme sequence;

and the phoneme text mapping module is configured to map the candidate phoneme sequence to a candidate text to obtain a visual language identification result.

In a third aspect, the present application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the visual language identification method as described in any one of the above when executing the program.

In a fourth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the visual language identification method of any one of the above.

From the foregoing, in the visual language identification process, lip feature extraction is performed on a target video to obtain a viseme sequence including lip features, and the viseme is a basic unit for identifying lip language. The obtained viseme sequence has a certain redundancy part, and the redundancy of the viseme sequence is removed to obtain a candidate viseme sequence. And mapping the candidate viseme sequence to the phoneme sequence to obtain a candidate phoneme sequence, wherein the phoneme is a basic unit of the voice. And mapping the candidate phoneme sequence to a candidate text to obtain a visual language identification result. According to the visual language identification method, the problem that large-scale Chinese characters are difficult to model in visual language identification research is solved through processing the visual digits and the phoneme, and the problem that the research is difficult to advance due to lack of a large-scale data set is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only the embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is an exemplary flowchart of a visual language identification method according to an embodiment of the present application.

Fig. 2 is a schematic view of a lip positioning scenario provided in an embodiment of the present application.

Fig. 3 is an optic/phoneme mapping table provided in the embodiment of the present application.

Fig. 4 is a schematic view illustrating an exemplary process of mapping an audio-visual phoneme according to an embodiment of the present application.

Fig. 5 is an exemplary flowchart of phonological text mapping provided in an embodiment of the present application.

Fig. 6 is a schematic diagram of a visual language identification device according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background section, the deep learning method for developing a large number of hands in the field of computer vision and the like has a wide application prospect in the field of visual language recognition, the deep learning is the intrinsic law and the expression level of learning sample data, and the information obtained in the learning process is of great help to the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. However, the end-to-end deep learning visual language recognition model has strong dependency on a data set, and needs to learn the spatiotemporal characteristics of lip movements from the data set through multiple iterative training, which requires that the data set has a large scale, the lip language forms covered by the data set are rich, and the lip language material also needs to cover tens of thousands of words, so that the model can fully learn the lip characteristics of human speech.

In view of the above, the visual language identification method provided by the present application obtains a viseme sequence including lip features by performing lip feature extraction on a target video, where viseme is a basic unit for identifying lip language. The obtained viseme sequence has a certain redundant part, and the viseme sequence is subjected to redundancy removal to obtain a candidate viseme sequence. And mapping the candidate viseme sequence to the phoneme sequence to obtain a candidate phoneme sequence, wherein the phoneme is a basic unit of the voice. And mapping the candidate phoneme sequence to a candidate text to obtain a visual language identification result. According to the visual language identification method, the visual bit sequence of the target video after the lip characteristics are extracted is mapped to the phoneme sequence, the phoneme sequence of the target video is mapped to the candidate text, and then the identification visual language identification result is obtained. The problem that large-scale Chinese character modeling is difficult to realize in visual language identification research is solved, and the problem that research is difficult to advance due to lack of large-scale data sets is solved.

The following describes the visual language identification method provided in the embodiments of the present application with specific examples.

Referring to fig. 1, an exemplary flowchart of a visual language identification method provided in an embodiment of the present application is schematically illustrated.

Step S101, lip feature extraction is carried out on the target video, and a viseme sequence containing the lip features is obtained.

In specific implementation, a video containing a face picture of a user is randomly selected as a target video, and lip feature positioning is performed on the target video to obtain primary lip key features. As shown in fig. 2, for a schematic diagram of a lip positioning scene provided in the embodiment of the present application, 20 primary lip key feature points and horizontal and vertical coordinates thereof may be obtained according to positioning, where the primary lip key feature points are respectively: a left lip corner endpoint 1, a left outer lip corner inner point 2, a first left outer lip upper point 3, a first left outer lip lower point 4, a second left outer lip upper point 5, a second left outer lip lower point 6, an upper outer lip middle point 7, a lower outer lip middle point 8, a right lip corner endpoint 9, a right outer lip corner inner point 10, a first right outer lip upper point 11, a first right outer lip lower point 12, a second right outer lip upper point 13, a second right outer lip lower point 14, a left inner lip upper point 15, a left inner lip lower point 16, a right inner lip upper point 17, a right inner lip lower point 18, an inner lip upper middle point 19, and an inner lip lower middle point 20.

According to the above 20 primary lip key feature points and the horizontal and vertical coordinates thereof, calculating an advanced lip state feature vector, wherein the advanced lip state feature includes primary lip key features, specifically:

the upper outer lip height is the distance from the upper outer lip midpoint 7 to the connecting line of the left and right lip corner endpoints, the lower outer lip height is the distance from the lower outer lip midpoint to the connecting line of the left and right lip corner endpoints, the upper inner lip height is the distance from the upper inner lip midpoint to the connecting line of the left and right lip corner endpoints, the lower inner lip height is the distance from the lower inner lip midpoint to the connecting line of the left and right lip corner endpoints, the outer lip width is the distance from the connecting line of the left and right lip corner endpoints, the inner lip width is the distance from the connecting line of the left and right outer lip corner inner points, the outer lip height is the upper outer lip height plus the lower outer lip height, the inner lip height is the upper inner lip height plus the lower inner lip height, the outer lip roundness is the ratio of the outer lip height to the outer lip width, and the inner lip roundness is the ratio of the inner lip height to the inner lip width.

And inputting the high-level lip state feature vector into a pre-constructed visual language recognition model for labeling to obtain a viseme sequence containing lip features.

In the specific implementation, a Long Short-Term Memory neural network (LSTM) is used as the basis of the visual language recognition model. A Bidirectional Long and Short Term circulation Memory neural network (Bi-LSTM, bidirectional Long Short-Term Memory) is followed by a full connection layer to serve as a network structure, and the number of hidden layer nodes of the Bidirectional Long and Short Term circulation Memory neural network is 128.

In the embodiment of the application, advanced lip state feature vectors with the size of 1 × 50 extracted from the target video at each moment are used as the input of the visual language identification model, and the output is a corresponding viseme sequence containing lip features. At this time, the output viseme sequence is according to the target video, and therefore, there may be a redundant portion of the viseme elements in the viseme sequence.

Step S102, removing redundancy of the view bit sequence to obtain a candidate view bit sequence.

In the specific implementation, the sequence of the visual bits is refined according to the Chinese pinyin combination rule to obtain a simplified candidate sequence of the visual bits.

In the embodiment of the application, for a plurality of non-single vowel visual bit elements which continuously and repeatedly appear in the visual bit sequence, one non-single vowel visual bit element is reserved to obtain a first candidate visual bit sequence;

for a plurality of single vowel visual-position elements which continuously and repeatedly appear in the first candidate visual-position sequence, responding to the condition that the occurrence frequency of the single vowel visual-position elements is higher than a preset threshold value, reserving two single vowel visual-position elements, and obtaining a second candidate visual-position sequence;

and for a plurality of single-vowel visual-position elements which continuously and repeatedly appear in the second candidate visual-position sequence, responding to the condition that the appearance times of the single-vowel visual-position elements are lower than a preset threshold value, reserving one single-vowel visual-position element, and obtaining the candidate visual-position sequence. Wherein, the candidate viseme sequences are divided into a plurality of categories according to the difference of the viseme elements.

It should be noted that in the redundancy removing step, viseme elements in the viseme sequence are simplified, and the basic units for identifying lip language are processed by simplifying and refining the viseme elements, so that the visual language identification method provided by the application is more accurate and comprehensive.

Step S103, mapping the candidate viseme sequence to the phoneme sequence to obtain a candidate phoneme sequence.

In specific implementation, as shown in fig. 4, an exemplary flowchart of the viseme-to-phoneme mapping provided in the embodiments of the present application is shown.

Step S401, for the candidate viseme sequences, mapping each viseme element in the candidate viseme sequences to the corresponding phoneme element according to the viseme-phoneme mapping table to obtain a plurality of phoneme sequences.

And S402, traversing a plurality of phoneme sequences, wherein the length of the substring is shortest to 1 and longest to 4, and judging whether the phoneme sequences are reasonable according to the Chinese pinyin syllable combination rule.

S4021, reserving a reasonable phoneme sequence conforming to the Chinese pinyin rule;

step S4022, abandoning unreasonable phoneme sequences;

in step S403, a candidate phoneme sequence is obtained.

Fig. 3 shows an phonological mapping table according to an embodiment of the present application.

And step S104, mapping the candidate phoneme sequence to a candidate text to obtain a visual language identification result.

Fig. 5 is a schematic diagram illustrating an exemplary flow of phoneme text mapping provided in an embodiment of the present application. In the specific implementation:

step S501, segmenting candidate phoneme sequences according to a candidate phoneme sequence Chinese pinyin rule to obtain different syllable combinations;

step S502, comparing all syllable combinations with syllables in the Chinese phonetic syllable table;

step S5021, reserving a reasonable segmentation result;

step S5022 abandons unreasonable segmentation results;

step S503, for the reasonable segmentation result, comparing each subsequence, namely syllable, with the Chinese character pinyin in the corpus to obtain a corresponding candidate text sequence.

In the embodiment of the present application, the reasonability of the candidate text sequence also needs to be calculated, and specifically, the reasonability of the candidate text sequence can be calculated by using a similar idea to an N-Gram language model:

as an alternative implementationFor example, taking Bigram model as an example, the probability of the reasonableness of any ith text element in the candidate text sequence appearing after the ith-1 text element is recorded as P (w) _i |w _i-1 ) Multiplying the reasonable degree probabilities of all the text elements in the candidate text sequence to obtain the text sequence reasonable degree:

wherein, P (w) ₁ ) Is the ith text element, P (w) _i |w _i-1 ) Is the probability of the reasonableness of the ith text element after the ith-1 text element, 1<i<l。

As an optional embodiment, taking the Trigram model as an example, considering the probability that the ith text element follows the ith-1 and the ith-2 text elements, the text sequence reasonableness can be obtained:

It should be noted that the probability of reasonableness P (w) between text elements in the candidate text sequence _i |w _i-1 ) Or P (w) _i |w _i- ₁ w _i-2 ) Can be statistically derived from the corpus. When i =1, P (w) ₁ ) Indicating the probability that the 1 st text element appears as a sentence start in the corpus, may be included in the corpus for subsequent reference. If text elements which are not registered in the corpus are encountered, a smaller reasonable degree probability is directly given.

And responding to the language reasonableness reached by the candidate text sequence to obtain a visual language identification result.

According to the visual language identification method, the basic units of the lip language, namely the viseme and the phoneme, are considered, and the final text sequence is identified by identifying the middle unit of the lip language, so that the visual language identification method has higher robustness.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and is completed by the mutual cooperation of a plurality of devices. In this distributed scenario, one device of the multiple devices may only perform one or more steps of the method of the embodiment of the present application, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a visual language recognition device.

Referring to fig. 6, the visual language recognition apparatus includes: the system comprises a feature extraction module 601, a screening module 602, a viseme-phoneme mapping module 603 and a phoneme text mapping module 604; wherein the content of the first and second substances,

the feature extraction module 601: the lip feature extraction method comprises the steps of performing lip feature extraction on a target video to obtain a viseme sequence containing lip features;

a screening module 602 configured to remove redundancy from the viseme sequence to obtain a candidate viseme sequence;

an viseme-to-phoneme mapping module 603 configured to map the viseme sequence to a phoneme sequence, resulting in a candidate phoneme sequence;

a phoneme text mapping module 604 configured to map the candidate phoneme sequence to a candidate text, resulting in a visual language identification result.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.

The apparatus of the foregoing embodiment is used to implement the corresponding visual language identification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the visual language identification method according to any of the above-mentioned embodiments. Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: processor 710, memory 720, input/output interface 730, communication interface 740, and bus 750. Wherein processor 710, memory 720, input/output interface 730, and communication interface 740 are communicatively coupled to each other within the device via bus 750.

The processor 710 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.

The Memory 720 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 720 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 720 and called by the processor 710 for execution.

The input/output interface 730 is used for connecting an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 740 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

Bus 750 includes a path that transfers information between various components of the device, such as processor 710, memory 720, input/output interface 730, and communication interface 740.

It should be noted that although the above-described device only shows the processor 710, the memory 720, the input/output interface 730, the communication interface 740 and the bus 750, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the corresponding visual language identification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the visual language identification method according to any of the above-described embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the visual language identification method according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made without departing from the spirit or scope of the embodiments of the present application are intended to be included within the scope of the claims.

Claims

1. A visual language identification method, comprising:

segmenting the candidate phoneme sequence according to a Chinese pinyin combination rule to obtain a plurality of groups of syllable combinations;

responding to the syllable combination of the groups to accord with a Chinese pinyin syllable table to obtain a candidate syllable sequence;

comparing the candidate syllable sequence with the Chinese characters with syllables in the corpus to obtain a candidate text sequence; calculating the reasonableness of the candidate text sequence;

and responding to the reached language reasonableness of the candidate text sequence to obtain a visual language recognition result.

2. The visual language identification method of claim 1, wherein the lip feature extraction of the target video to obtain the viseme sequence containing the lip feature comprises:

performing lip feature positioning on a target video to obtain primary lip key features;

calculating advanced lip state features according to the primary lip key features;

and organizing the advanced lip state features into advanced lip state feature vectors, inputting the advanced lip state feature vectors into a pre-constructed visual language recognition model for labeling, and obtaining the visual bit sequence containing the lip features.

3. The visual language identification method of claim 1, wherein the removing redundancy from the viseme sequence containing lip features to obtain candidate viseme sequences comprises:

for a plurality of non-single vowel visual bit elements which continuously and repeatedly appear in the visual bit sequence, reserving one non-single vowel visual bit element to obtain a first candidate visual bit sequence;

for a plurality of single vowel visual bit elements which continuously and repeatedly appear in the first candidate visual bit sequence, responding to the condition that the occurrence frequency of the single vowel visual bit elements is higher than a preset threshold value, reserving two single vowel visual bit elements, and obtaining a second candidate visual bit sequence;

and for a plurality of single-vowel visual-position elements which continuously and repeatedly appear in the second candidate visual-position sequence, responding to the condition that the occurrence frequency of the single-vowel visual-position elements is lower than a preset threshold value, and reserving one single-vowel visual-position element to obtain the candidate visual-position sequence.

4. The visual language identification method of claim 1, wherein mapping the viseme sequences to phoneme sequences to obtain candidate phoneme sequences comprises:

mapping the viseme sequence to a phoneme sequence according to a viseme-phoneme mapping table;

and responding to the phoneme sequence to accord with the Chinese pinyin combination rule to obtain the candidate phoneme sequence.

5. The visual language identification method of claim 1, wherein calculating the reasonableness of the candidate text sequence comprises:

the second arbitrary text in the candidate text sequence

The text element appearing at

The probability of the degree of reasonableness after each text element is recorded

Multiplying the reasonable degree probabilities of all the text elements in the candidate text sequence to obtain the text sequence reasonable degree:

wherein, the first and the second end of the pipe are connected with each other,

for the (i) th text element(s),

is as follows

The text element appearing at

The likelihood of reasonableness behind individual text elements.

6. A visual language recognition apparatus, comprising:

an audio-visual phoneme mapping module configured to map the candidate audio-visual sequence to a phoneme sequence to obtain a candidate phoneme sequence;

the phoneme text mapping module is configured to segment the candidate phoneme sequence according to a Chinese pinyin combination rule to obtain a plurality of groups of syllable combinations;

comparing the candidate syllable sequence with the Chinese characters with syllables in the corpus to obtain a candidate text sequence;

calculating the reasonableness of the candidate text sequence;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the program.

8. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.