CN112669851A

CN112669851A - Voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN112669851A
Application number: CN202110283891.6A
Authority: CN
Inventors: 胡广宇; 邓菁; 吴富章
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-04-16
Anticipated expiration: 2041-03-17
Also published as: CN112669851B

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, wherein the method comprises the steps of inputting an acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is obtained by training a participle text obtained by participle of a regulated instruction text, an extended text corresponding to the participle text and an extended pronunciation dictionary corresponding to the participle text, the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary, and finally a voice instruction corresponding to the recognized text is determined. The method and the system can improve the recognition accuracy of the Chinese and English letter mixture, the professional vocabulary in the special field and the Chinese and small word mixture, and effectively improve the matching accuracy when the intelligent system, the intelligent equipment or the inspection robot is in butt joint with the power system.

Description

Voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of intelligent speech technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium.

Background

The normal operation of transformer substation is direct to whole power system's stable safety, and along with the development of science and technology, more and more intelligent system, smart machine and patrol and examine robot use in the transformer substation management to improve the security and the accuracy of patrolling and examining, wherein, most of smart machine all adopt voice command to control.

The voice instruction applied to the transformer substation mainly comprises a general instruction, a system control instruction and an equipment control instruction. The recognition of general instructions and system control instructions can be completed by adopting the existing mandarin recognition mode, but because the equipment control instructions are complex and the pronunciation types and writing formats are rich, when the equipment control instructions are recognized by adopting the existing voice recognition mode, the recognition effect is poor, so that the matching accuracy is low when an intelligent system, intelligent equipment or an inspection robot is in butt joint with an electric power system.

Disclosure of Invention

In view of the above, an object of the present application is to provide a speech recognition method, apparatus, electronic device and readable storage medium, which can improve recognition accuracy of a mixture of chinese and english letters, a professional vocabulary in a proprietary field, and a mixture of chinese and a small number of words, and maintain speech recognition accuracy in a general field while improving speech recognition accuracy in the proprietary field.

In a first aspect, the present application provides a speech recognition method, including:

acquiring a voice signal to be recognized;

inputting the acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is a word segmentation text obtained by segmenting a normalized instruction text, an extended text corresponding to the word segmentation text and an extended pronunciation dictionary corresponding to the word segmentation text, and the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary;

and determining a voice instruction corresponding to the recognition text.

Preferably, the acquiring the voice signal to be recognized includes:

collecting voice signals;

and carrying out voice endpoint detection and noise detection on the collected voice signals to obtain the voice signals to be recognized.

Preferably, the instruction text is structured by:

acquiring part-of-speech rules and business rules corresponding to text normalization;

based on the part of speech rule and the business rule, respectively regularizing the instruction text to obtain a part of speech regularization result and a business regularization result;

performing cross validation on the part of speech normalization result and the service normalization result;

and determining the regulated instruction text based on the word property regulation result and the cross verification result of the service regulation result.

Preferably, the segmented text is obtained by segmenting the normalized instruction text through the following steps:

performing word segmentation processing on the normalized instruction text to obtain a Chinese word segmentation result and an English word segmentation result, wherein the English word segmentation result comprises a combination of at least one different English letter;

counting the times of the combination of the different English letters appearing in all the normalized instruction texts;

updating the English word segmentation result according to the comparison result of the times and a set threshold;

and determining the word segmentation text based on the Chinese word segmentation result and the updated English word segmentation result.

Preferably, the expanded text corresponding to the participle text is determined by the following steps:

acquiring all Chinese characters adjacent to English letters in the regulated instruction text to obtain an expanded Chinese character set;

and randomly arranging and combining all English letters and the Chinese characters in the extended Chinese character set to generate corresponding binary, ternary and quaternary groups so as to obtain the extended text corresponding to the word segmentation text.

Preferably, the extended pronunciation dictionary corresponding to the participle text is determined by the following steps:

labeling English letters or English words according to Chinese phonemes to obtain at least one pronunciation corresponding to the English letters or the English words;

and adding the at least one pronunciation into an initial pronunciation dictionary to obtain an extended pronunciation dictionary corresponding to the word segmentation text.

Preferably, the language model corresponds to the formula:

wherein P (W) represents the probability of the language model, P (W)₁) Probability of representing the underlying language model, P (W)₂) The probabilities of the specialized language models are represented and λ represents the weighting coefficients of the underlying language models.

Preferably, the determining the voice instruction corresponding to the recognized text includes:

carrying out reverse warping operation on the identification text to obtain an original instruction corresponding to the identification text; wherein the reverse normalization operation is performed based on part-of-speech rules and business rules;

and if the original instruction is detected to be in a preset instruction list, determining that the original instruction is the voice instruction.

In a second aspect, the present application further provides a speech recognition apparatus, comprising:

the signal acquisition module is used for acquiring a voice signal to be recognized;

the voice recognition module is used for inputting the acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is a word segmentation text obtained by segmenting a normalized instruction text, an extended text corresponding to the word segmentation text and an extended pronunciation dictionary corresponding to the word segmentation text, and the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary;

and the instruction determining module is used for determining the voice instruction corresponding to the recognition text.

In a third aspect, the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the speech recognition method as described above.

In a fourth aspect, the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech recognition method as described above.

The application provides a voice recognition method, a voice recognition device, an electronic device and a readable storage medium, wherein the voice recognition method comprises the following steps: acquiring a voice signal to be recognized, and inputting the acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is obtained by word segmentation text obtained by segmenting a regulated instruction text, an extended text corresponding to the word segmentation text and an extended pronunciation dictionary corresponding to the word segmentation text, the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary, and finally a voice instruction corresponding to the recognition text is determined. Therefore, the recognition accuracy rate of Chinese and English letter mixing, professional vocabularies in the special field and Chinese and a small number of words mixing can be improved, the voice recognition accuracy rate in the general field is kept while the voice recognition accuracy rate in the special field is improved, and therefore the matching accuracy rate is improved when an intelligent system, intelligent equipment or an inspection robot is in butt joint with an electric power system.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a specialized language model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of another speech recognition method provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

First, an application scenario to which the present application is applicable will be described. The application can be applied to the intelligent voice field of the transformer substation. The safety of the power system plays an irreplaceable role in the development of national economy, and when an accident occurs in any link, chain reaction can be brought, and a large-area power failure, personal casualties, main equipment damage and even a full-network breakdown catastrophic accident can be caused. The transformer substation is used as a key node for connecting a backbone network and a power distribution network, and how to ensure the normal operation of the transformer substation is directly related to the stability and the safety of the whole power system. In order to ensure real-time monitoring of the operation states of main primary equipment such as a main transformer, a bus and a switch in a transformer substation, the transformer substation needs to be inspected and maintained. The traditional transformer substation monitoring and inspection mainly adopts a manual mode, the defects of high labor intensity, low working efficiency, dispersed detection quality, single means and the like exist, and data detected manually cannot be accurately and timely accessed into a management information system. With the development of science and technology, more and more intelligent equipment and inspection robots begin to be applied to the management of the transformer substation, so that the safety and the accuracy of manual inspection are improved, and meanwhile, the working efficiency is improved. For most intelligent devices, voice commands are required for control.

The voice instruction applied to the transformer substation mainly comprises a general instruction, a system control instruction and an equipment control instruction. Generic class of instructions refers to domain independent instructions or questions and answers, such as: "confirm modify, close window, set up whether to alarm, quit the dialog box of going up and down, log system is normal" etc. The system control instructions mainly comprise instructions related to a system, a page, an interface and the like, such as opening a card removing interface, calling a main transformer page and the like. The equipment control type instructions mainly comprise instructions of specific equipment models, such as #2 main gear upshifting, 9004I-section PT cabinet handcart disconnecting link allowing remote control, 653 switch front upper cabinet door/653-1W net door unset, CosPhi automatic summation and the like. The equipment control instructions are complex, contain a large number of professional vocabularies, English letters, English words and the like, have great challenges for identifying common Mandarin, contain a large number of Roman numerals, Arabic numerals, punctuation marks, English letters and the like, have rich pronunciation types and writing formats, and bring great challenges for the butt joint of a voice transcription result and an electric power system.

At present, the existing voice recognition mainly comprises cloud service voice recognition and offline voice recognition, but according to the requirement of safety differentiation of a power system network in the safety protection regulation of a secondary power system, the network safety requirement of the power system cannot be met after some functions are communicated with an external network, so that only an offline voice recognition scheme can be adopted in the field of transformer substations.

The voice instruction recognition in the field of the transformer substation is basically similar to the standard mandarin recognition, but the equipment control instruction of the transformer substation contains a large number of English letters and professional vocabularies and a small number of English words, so the recognition method has certain difference; meanwhile, the system name, the equipment name and the like of the power system are rich in pronunciation types and writing formats, and comprise a large number of Roman numerals, Arabic numerals, punctuations, English letters and the like, and the inverse regularity of the recognition result is different from the inverse regularity of the general recognition result.

The existing voice instruction recognition engines for butting intelligent systems, intelligent equipment, inspection robots and the like applied to transformer substation management mainly comprise two types, one type is a voice recognition engine directly using a general voice recognition engine or a voice recognition engine optimizing an acoustic model aiming at the field of transformer substations, and generally can achieve better recognition effects on general instructions and system instructions, but has a poor effect on equipment control instructions. The other is command word recognition, which is used for recognizing specific instructions, and words outside the instructions can hardly be recognized, but the currently realized command word recognition mainly aims at general instructions and system instructions and can not well recognize equipment command instructions. The post-processing of the voice command recognition result usually has no text inverse warping or uses a general warping method, so that the device control command cannot be inverse warped. When the matching device is in butt joint with a power system, the matching success rate is low due to the adoption of a complete matching mode, the mismatching rate is increased due to the adoption of fuzzy matching, and the recognition effect of the device and the matching effect of the power system are not good.

Based on the above, the application provides a voice recognition method, a voice recognition device, an electronic device and a readable storage medium, which are applied to the field of transformer substations and can realize automatic updating and fusion of language models according to the provided instruction texts, so that the recognition accuracy of Chinese and English letter mixing, professional vocabularies in the field of transformer substations and Chinese and a small number of words mixing is improved, meanwhile, specific processing is performed on the inverse normalization of recognition results, the matching accuracy with an electric power system is improved, the voice recognition accuracy in the field of transformer substations can be improved, and the voice recognition accuracy in the general field is kept.

Referring to fig. 1, fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present disclosure. As shown in fig. 1, a speech recognition method provided in an embodiment of the present application includes:

and S110, acquiring a voice signal to be recognized.

Here, the voice signal to be recognized is a signal obtained by preprocessing the acquired voice signal, where the preprocessing mode may be a voice endpoint detection mode or a noise detection mode.

S120, inputting the acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is obtained by training a participle text obtained by participle of a normalized instruction text, an extended text corresponding to the participle text and an extended pronunciation dictionary corresponding to the participle text, wherein the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary.

In this step, the speech recognition model is used to convert the speech signal to be recognized into a corresponding recognized text. In a specific field, the voice control signal can be converted into corresponding execution control information through the voice recognition model, so that the intelligent device executes corresponding actions according to the execution control information.

In the embodiment of the application, the voice recognition model comprises an acoustic model and a language model, wherein the language model is an important component of the voice recognition model, the basic language model and the special language model generate the language model through interpolation, and the generated language model can be suitable for both the proprietary field and the universal field, so that the voice recognition accuracy of the proprietary field can be improved while the voice recognition accuracy of the universal field is maintained.

The basic language model is a common language model and comprises an N-gram model, an RNNLM model and the like, and the basic language model of the embodiment of the application trains a 4-gram (4-element grammar) -based basic language model by using universal Chinese text data. In particular, according to the definition of the 4-membered grammar, w₁，w₂……w_nThe probability of occurrence is:

wherein, w₁，w₂……w_nRepresenting the output word, the conditional probability is reduced to:

specifically, the model structure of the dedicated language model is the same as that of the basic language model, but there is a difference in the way of training. When the basic language model is trained, because the weight of the data samples of the low-frequency tuple combination is low and a large number of data samples exist, the data samples of the low-frequency tuple combination need to be filtered out when the basic language model is trained, but the data samples of all the tuple combinations need to be reserved when the special language model is trained, and the data samples of the low-frequency tuple combination cannot be filtered out when the basic language model is trained because the data samples are small in amount for a specific field.

Further, the training data sample of the special language model in the embodiment of the application includes a segmented word text obtained by segmenting the normalized instruction text, an extended text corresponding to the segmented word text, and an extended pronunciation dictionary corresponding to the segmented word text.

The instruction text is processed through a text normalization tool, so that a normalized instruction text is obtained, wherein the normalized instruction text only contains Chinese and English. And performing word segmentation processing on the normalized instruction text to obtain a word segmentation text containing Chinese word segmentation results and English word segmentation results. The text can be expanded based on the participle text, and the text processing speed can be increased by expanding the text based on the participle text. And marking Chinese phonemes on English word segmentation results in the word segmentation texts to obtain an expanded pronunciation dictionary.

The initial pronunciation dictionary is obtained by the initial pronunciation dictionary through expansion, wherein the initial pronunciation dictionary comprises the corresponding relation between characters or words and phonemes, a universal initial pronunciation dictionary is obtained, and the initial pronunciation dictionary comprises the corresponding relation between pinyin and Chinese characters and the corresponding relation between phonetic symbols and English words. Thus, after pronunciation expansion, the expanded pronunciation dictionary not only includes the corresponding relation between the pinyin and the Chinese characters and the corresponding relation between the phonetic symbols and the English words, but also includes the corresponding relation between the English words or the English letters and the pinyin.

It is necessary to supplement that, the speech recognition method can rapidly iterate the language model according to the specific scene, and is convenient to migrate to other specific scene fields, such as the power plant field, the metallurgy field, the automobile manufacturing field, and the like.

Therefore, the automatic updating of the special language model can be realized, and compared with the improvement of the identification accuracy in the specific scene field by optimizing the acoustic model, the embodiment of the application avoids the condition that the identification progress is not improved highly finally due to the lack of the limitation of labeling a training data sample in practical application, and meanwhile, the training period of the special language model is shorter, and the updating period of the model in the specific scene field is shortened.

It should be noted that the acoustic Model is used to describe a physical change rule of a speech, and common acoustic models are Hidden Markov Models (HMMs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and the like. Embodiments of the present application use Mandarin data to build a triphone-based deep neural network acoustic model.

And S130, determining a voice instruction corresponding to the recognition text.

Here, the voice instruction is a valid instruction, which is an instruction existing on the instruction list. When applied in a specific field, the voice command has a certain physical meaning.

For example, taking the field of a transformer substation as an example, the voice commands may be for upshifting of #2 main transformer, allowing remote control of 9004 stage i PT cabinet handcart disconnecting switch, enabling front cabinet door to be opened and closed by 653 switch, and unsetting of 653-1W network door, and the like, and the voice commands can control the intelligent device to perform corresponding operations.

The voice recognition method provided by the embodiment of the application comprises the following steps: acquiring a voice signal to be recognized, and inputting the acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is obtained by training a participle text obtained by participle of a normalized instruction text, an extended text corresponding to the participle text and an extended pronunciation dictionary corresponding to the participle text, wherein the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary and determining a voice instruction corresponding to the recognized text. Therefore, the recognition accuracy rate of Chinese and English letter mixing, professional vocabularies in the special field and Chinese and a small number of words mixing can be improved, the voice recognition accuracy rate in the general field is kept while the voice recognition accuracy rate in the special field is improved, and therefore the matching accuracy rate is improved when an intelligent system, intelligent equipment or an inspection robot is in butt joint with an electric power system.

The embodiment of the present application takes the intelligent voice field of a transformer substation as an example, and details of the voice recognition method provided by the embodiment of the present application are described, but not limited to this field.

In the embodiment of the present application, as a preferred embodiment, the step S110 includes:

collecting voice signals; and carrying out voice endpoint detection and noise detection on the collected voice signals to obtain the voice signals to be recognized.

Here, Voice Activity Detection (VAD) is to accurately locate the start point and end point of Voice from Voice with noise, remove the mute part and remove the noise part, and find a piece of content that Voice is really valid. Noise detection to further remove noise in the speech signal after speech endpoint detection.

Furthermore, by performing voice endpoint detection and noise detection on the collected voice signals, the voice recognition system can not only reduce the calculation amount and shorten the processing time, but also eliminate the noise interference of the silent section and improve the accuracy of voice recognition.

Specifically, the formula corresponding to the language model is as follows:

Here, the language model is generated by interpolating the basic language model and the specific language model, where λ represents a weighting coefficient of the basic language model, and a value of λ varies with a change of a specific application field, and when there are many terms, device names, operation command texts, and the like in the specific application field, the value of λ may be increased, but it is clear that a value range of λ is between 0 and 1, and λ is 0.9 as a default.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for training a specialized language model according to an embodiment of the present disclosure. As shown in fig. 2, when the special language model is trained, an instruction text is input, then the instruction text is normalized, the normalized instruction text is segmented to obtain segmented texts, the expanded text and the expanded pronunciation dictionary can be determined respectively through the segmented texts, and finally the special language model is trained through the segmented text, the expanded text and the expanded pronunciation dictionary.

Specifically, the instruction text is structured by the following steps:

acquiring part-of-speech rules and business rules corresponding to text normalization; based on the part of speech rule and the business rule, respectively regularizing the instruction text to obtain a part of speech regularization result and a business regularization result; performing cross validation on the part of speech normalization result and the service normalization result; and determining the regulated instruction text based on the word property regulation result and the cross verification result of the service regulation result.

Here, the part-of-speech rule is a rule for classifying parts-of-speech such as nouns, verbs, and adjectives; the business rules are rules for classifying operation norms, management rules, regulations, industry standards and the like related to the business. Wherein, the part of speech rule and the business rule are both rules set in the text regulation tool.

Specifically, a text normalization tool is used for performing normalization operation on a proprietary text in an instruction text, the normalization tool realizes text normalization and cleaning by using two methods, one method is to classify all texts based on part-of-speech rules, the other method is to classify the texts based on business rules, after the texts are classified, corresponding normalization results are generated through the corresponding rules, namely, part-of-speech normalization results and business normalization results are respectively obtained, then all available data can be screened out through cross verification of the normalization results, and all the available data are determined to be the normalized instruction text. The cross validation is to ensure that the classification result of the instruction text is more accurate, so that the accuracy of text normalization is high, and data cleaning is more thorough.

When the command text in the field of the transformer substation is structured, the special text is the terminology, the equipment name, the operation command text and the like of the transformer substation.

Thus, the regularizing tool can realize illegal character removal, full-angle and half-angle conversion, English letter standardization, conversion from Arabic numerals to Chinese characters, conversion from Roman numerals to Chinese characters, mixing of English and Roman characters, punctuation mark mixing writing, multiple pronunciations of Arabic numerals and Roman characters and the like. For example, the regular result of the #2 main gear up-shift is the "second main gear up-shift or the second well main gear up-shift", the regular result of the 9004 section i PT cabinet handcart switch allowing remote control "is the" nine-zero section four section PT cabinet handcart switch allowing remote control ", and the like.

Specifically, the segmented text is obtained by segmenting the normalized instruction text through the following steps:

performing word segmentation processing on the normalized instruction text to obtain a Chinese word segmentation result and an English word segmentation result, wherein the English word segmentation result comprises a combination of at least one different English letter; counting the times of the combination of the different English letters appearing in all the normalized instruction texts; updating the English word segmentation result according to the comparison result of the times and a set threshold; and determining the word segmentation text based on the Chinese word segmentation result and the updated English word segmentation result.

Here, when the word segmentation operation is performed for the first time, the combinations of different english alphabets between two chinese characters are grouped together, that is, the english word segmentation result includes at least one combination of different english alphabets. Next, it is necessary to determine whether to further perform word segmentation processing on the english word segmentation result according to the number of times of occurrence of the combination of different english alphabets.

Performing word segmentation processing on the normalized instruction text by using a word segmentation tool to obtain a Chinese word segmentation result and an English word segmentation result, wherein the English word segmentation result comprises at least one combination of different English letters, counting the times of the combination of the different English letters appearing in all the normalized instruction texts, and if the times are less than a set threshold value, continuously splitting the combination of the different English letters; if the combination of different English letters after the first splitting is still needed, the times of the combination of different English letters after the first splitting appearing in all regular instruction texts need to be continuously judged, if the times are smaller than a set threshold, the combination of different English letters after the first splitting is continuously split for the second time, the processes are executed in a circulating mode until the times corresponding to the combination of different English letters after the n-time splitting are larger than or equal to the set threshold or all the combinations of different English letters after the n-time splitting become single English letters, word segmentation is stopped, and an English word segmentation result is obtained, wherein n is a positive integer. And if the times are greater than or equal to the set threshold, keeping the English word segmentation result unchanged, updating the English word segmentation result according to the comparison result of the times and the set threshold, and determining the final word segmentation text based on the Chinese word segmentation result and the updated English word segmentation result.

It should be noted that, in the field of a substation, all the normalized instruction texts are all instruction texts related to a power station, where the instruction texts for training a basic language model have been removed from all the instruction texts related to the power station, and specifically, all the texts related to the power station are normalized to obtain all the normalized instruction texts, so that the times of occurrence of combinations of different english letters in all the normalized instruction texts can be counted, and then, an english word segmentation result is updated according to a comparison result of the times and a set threshold.

For example, the normalized instruction text "BRCS nine three one AM closed-repeat three-jump platen", the word segmentation result is "B RCS nine three one AM closed-repeat three-jump platen", and since the number of RCS occurrences is lower than the set threshold and the AM is higher than the set threshold, the final splitting result is "B R C S nine three one AM closed-repeat three-jump platen".

Specifically, determining the extended text corresponding to the participle text by the following steps:

acquiring all Chinese characters adjacent to English letters in the regulated instruction text to obtain an expanded Chinese character set; and randomly arranging and combining all English letters and the Chinese characters in the extended Chinese character set to generate corresponding binary, ternary and quaternary groups so as to obtain the extended text corresponding to the word segmentation text.

And determining to obtain an extended text corresponding to the normalized instruction text based on the obtained texts of the binary group, the ternary group, the quaternary group and the like.

For example, the "barc C S nine three one AM closed-repeat three-jump press plate", the extended chinese character set adjacent to the english alphabet is 'nine, one, and closed', and the binary group 'a nine', 'nine a', the triple 'A B nine', 'nine A B', the quadruple 'nine a AA', 'a AA nine', 'a AAA', and the like can be constructed.

Specifically, the extended pronunciation dictionary corresponding to the participle text is determined through the following steps:

labeling English letters or English words according to Chinese phonemes to obtain at least one pronunciation corresponding to the English letters or the English words; and adding the at least one pronunciation into an initial pronunciation dictionary to obtain an extended pronunciation dictionary corresponding to the word segmentation text.

Here, the english letters or english words are labeled according to chinese phonemes, or the english letters or english words are labeled according to pinyin of chinese harmonic, so as to obtain a corresponding relationship between the english letters or english letters and pinyin.

Furthermore, English letters and English words are labeled according to Chinese phonemes, namely each English letter and each English word can be labeled with a plurality of pronunciations, so that the labeled pronunciations are added into the initial pronunciation dictionary, and the extended pronunciation dictionary can be obtained. For example: the Z readings include Z EI1, Z EI4, Z I1, Z I4 and the like, and the ERROR readings include IE4 r ER2, IE1 r ER2 and the like.

In the embodiment of the present application, as a preferred embodiment, the step S130 includes:

carrying out reverse warping operation on the identification text to obtain an original instruction corresponding to the identification text; wherein the reverse normalization operation is performed based on part-of-speech rules and business rules; and if the original instruction is detected to be in a preset instruction list, determining that the original instruction is the voice instruction.

Here, the reverse normalization operation and the normalization operation are the reciprocal processes, the operation principle is the same, the applied rules are the same, and the operations are performed based on the part-of-speech rules and the business rules, and the normalization operation is specifically implemented.

For example, the voice signal to be recognized is "# 2 main shift upshift", after the recognition by the voice recognition model, the recognition text is "second main shift upshift or well two main shift upshift", then the recognition text is "second main shift upshift or well two main shift upshift", the reverse normalization operation is performed to restore, the original instruction "# 2 main shift upshift" is obtained, whether the original instruction "# 2 main shift upshift" is in the instruction list is judged, if yes, the original instruction "# 2 main shift upshift" is determined to be the voice instruction, and the intelligent device is controlled to execute the corresponding action through the voice instruction.

As shown in fig. 3, a speech recognition method provided in an embodiment of the present application includes: firstly, carrying out voice endpoint detection operation on the collected voice signals, then carrying out voice noise detection on the voice signals subjected to the voice endpoint detection operation to obtain voice signals to be recognized, recognizing the voice signals to be recognized by using a voice recognition model, and decoding recognition texts corresponding to the voice signals; and performing text reverse normalization operation on the recognized text to obtain an original instruction, judging whether the original instruction is an effective instruction or not according to the obtained original instruction, if so, determining that the original instruction is a voice instruction, and returning the voice instruction to the voice interaction system to execute related operation.

The speech recognition method provided by the embodiment of the application is based on the common standard mandarin speech recognition technology, and the special language model is trained through the segmented text obtained by segmenting the regular instruction text, the extended text corresponding to the segmented text and the extended dictionary corresponding to the segmented text, so that the automatic updating of the special language model can be realized. Compared with the method for improving the recognition accuracy rate in the specific scene field by optimizing the acoustic model, the method and the device avoid the condition that the recognition progress is not improved highly finally due to the lack of the limitation of labeling training sample data in practical application, and meanwhile, the training period of the language model is short, and the updating period of the model in the specific scene field is shortened.

Specifically, taking a voice recognition method applied to the intelligent voice field of the transformer substation as an example, compared with the problem that the recognition effect is poor when the existing voice recognition mode is used for recognizing the equipment control type instructions, the method and the device for recognizing the voice recognition of the transformer substation have the advantages that the target language model is changed under the condition that the target acoustic model is not changed, the mixed voice of Chinese and English letters of the transformer substation can be recognized accurately, and the accuracy can reach more than 98%; meanwhile, professional vocabularies, Chinese words and small words in the field of the transformer substation are obviously improved, the problems of Chinese and English letter mixing, Chinese and small word mixing and sudden performance reduction of vocabularies in the field of the transformer substation on a standard mandarin model are effectively solved, the recognition result in the field of the transformer substation is reversely regulated in a specific scene, and the matching accuracy of an intelligent system, intelligent equipment or an inspection robot in butt joint with a power system is improved.

Based on the same inventive concept, the embodiment of the present application further provides a speech recognition apparatus corresponding to the speech recognition method, and since the principle of the apparatus in the embodiment of the present application for solving the problem is similar to the speech recognition method in the embodiment of the present application, the implementation of the apparatus can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application, as shown in fig. 4, the speech recognition apparatus 400 includes:

a signal obtaining module 410, configured to obtain a speech signal to be recognized;

the speech recognition module 430 is configured to input the acquired speech signal to be recognized into a pre-trained speech recognition model, so as to obtain a recognition text matched with the speech signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is a word segmentation text obtained by segmenting a normalized instruction text, an extended text corresponding to the word segmentation text and an extended pronunciation dictionary corresponding to the word segmentation text, and the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary;

and an instruction determining module 440, configured to determine a voice instruction corresponding to the recognized text.

Preferably, when the signal obtaining module 410 is configured to obtain a speech signal to be recognized, the signal obtaining module 410 is configured to:

collecting voice signals;

Preferably, the speech recognition apparatus 400 further comprises a training module 420, and the training module 420 is configured to normalize the instruction text by:

Preferably, the training module 420 is configured to perform word segmentation on the normalized instruction text to obtain a segmented text by:

Preferably, the training module 420 is configured to determine the augmented text corresponding to the segmented text by:

Preferably, the training module 420 is configured to determine the extended pronunciation dictionary corresponding to the segmented text by:

Preferably, the language model corresponds to the formula:

Preferably, when the instruction determining module 440 is configured to determine the voice instruction corresponding to the recognized text, the instruction determining module 440 is configured to:

The voice recognition device provided by the embodiment of the application comprises a signal acquisition module, a voice recognition module and an instruction determination module, wherein the signal acquisition module acquires a voice signal to be recognized; the voice recognition module inputs the acquired voice signal to be recognized into a pre-trained voice recognition model to obtain a recognition text matched with the voice signal to be recognized; the voice recognition model comprises an acoustic model and a language model, the language model is generated by interpolation of a basic language model and a special language model, the special language model is a word segmentation text obtained by segmenting a normalized instruction text, an extended text corresponding to the word segmentation text and an extended pronunciation dictionary corresponding to the word segmentation text, and the extended pronunciation dictionary is obtained by extending an initial pronunciation dictionary; an instruction determination module determines a voice instruction corresponding to the recognized text. Therefore, the recognition accuracy rate of Chinese and English letter mixing, professional vocabularies in the special field and Chinese and few words mixing can be improved, the voice recognition accuracy rate in the special field is improved, and meanwhile the voice recognition accuracy rate in the general field is kept, so that the matching accuracy rate is improved when an intelligent system, intelligent equipment or an inspection robot is in butt joint with an electric power system.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.

The memory 520 stores machine-readable instructions executable by the processor 510, when the electronic device 500 runs, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the voice recognition method in the method embodiments shown in fig. 1 and fig. 3 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the speech recognition method in the method embodiments shown in fig. 1 and fig. 3 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method, characterized in that the speech recognition method comprises:

acquiring a voice signal to be recognized;

and determining a voice instruction corresponding to the recognition text.

2. The speech recognition method of claim 1, wherein the obtaining the speech signal to be recognized comprises:

collecting voice signals;

3. The speech recognition method of claim 1, wherein the instruction text is normalized by:

4. The speech recognition method of claim 3, wherein the segmented text is obtained by segmenting the normalized command text by:

5. The speech recognition method of claim 4, wherein the augmented text corresponding to the segmented text is determined by:

6. The speech recognition method of claim 4, wherein the extended pronunciation dictionary corresponding to the segmented text is determined by:

7. The speech recognition method of claim 1, wherein the language model corresponds to a formula:

8. The speech recognition method of claim 1, wherein the determining the speech instruction corresponding to the recognized text comprises:

9. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:

10. A computer-readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech recognition method according to one of claims 1 to 8.