CN113744737B

CN113744737B - Training of speech recognition model, man-machine interaction method, equipment and storage medium

Info

Publication number: CN113744737B
Application number: CN202111054577.7A
Authority: CN
Inventors: 钟业荣; 叶万余; 阮国恒; 江嘉铭; 阮伟聪; 张名捷; 黄一捷; 杨毅; 倪进超
Original assignee: Guangdong Power Grid Co Ltd; Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2024-06-11
Anticipated expiration: 2041-09-09
Also published as: CN113744737A

Abstract

The embodiment of the invention provides a training and man-machine interaction method, equipment and a storage medium of a voice recognition model, wherein the method comprises the following steps: acquiring first voice data belonging to a non-power industry and first text information serving as the content of the first voice data, acquiring terminology belonging to the power industry, merging the terminology into the first text information to acquire second text information belonging to the power industry, checking the validity of the second text information to the power industry, merging the terminology into the first voice data if the second text information is legal to the power industry to acquire second voice data belonging to the power industry, taking the second voice data as a sample and the second text information as a label, training a voice recognition model to convert the voice data belonging to the power industry into the text information, eliminating the mark of manual corpus, greatly reducing time consumption, greatly reducing the marking accuracy of the corpus of the power industry, greatly reducing high cost and greatly improving efficiency.

Description

Training of speech recognition model, man-machine interaction method, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of electric power, in particular to a method, equipment and a storage medium for training a voice recognition model and man-machine interaction.

Background

With the development of the deep learning field, the maturing of artificial intelligence technologies such as voice recognition and man-machine interaction is realized, and a large number of new generation service applications based on artificial intelligence are widely applied to a plurality of industries such as electric power, finance, internet and the like, so that the system becomes an infrastructure for the whole intelligent development.

The voice recognition is a human-computer interaction basis, a large amount of linguistic data are required for realizing training of a model for voice recognition in the power industry, but the linguistic data in the power industry cannot be compared with the general linguistic data, if the model for voice recognition is directly trained according to the situation that the number of data sets is not considered in academic, the performance of the model for voice recognition is lower due to the fact that the number of the linguistic data is insufficient.

If the corpus is marked manually, the time consumption of the human is long and mistakes are easy to occur due to the huge demand of the corpus, and the cost is high, but the efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a training and man-machine interaction method, equipment and a storage medium of a voice recognition model, which are used for solving the problem that the performance of the model for voice recognition in the power industry is low due to insufficient corpus.

In a first aspect, an embodiment of the present invention provides a training method for a speech recognition model, including:

acquiring first voice data belonging to a non-power industry and first text information serving as the content of the first voice data;

acquiring terms belonging to the power industry;

Merging the term into the first text information to obtain second text information belonging to the power industry;

verifying the validity of the second text information for the power industry;

if the second text information is legal for the power industry, the term is fused into the first voice data, and second voice data belonging to the power industry is obtained;

And training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the electric power industry into the text information.

In a second aspect, an embodiment of the present invention further provides a human-computer interaction method, including:

receiving voice data representing a problem in the power industry;

loading a speech recognition model trained by the method of the first aspect;

invoking the voice recognition model to convert the voice data into text information;

and querying an answer for solving the problem represented by the text information in the power industry.

In a third aspect, an embodiment of the present invention further provides a training device for a speech recognition model, including:

The training set acquisition module is used for acquiring first voice data belonging to the non-power industry and first text information serving as the content of the first voice data;

The power term acquisition module is used for acquiring terms belonging to the power industry;

the text information fusion module is used for fusing the terms into the first text information to obtain second text information belonging to the power industry;

the validity checking module is used for checking the validity of the second text information for the power industry;

The voice data fusion module is used for fusing the term into the first voice data if the second text information is legal for the power industry, so as to obtain second voice data belonging to the power industry;

And the model training module is used for training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into the text information.

In a fourth aspect, an embodiment of the present invention further provides a human-computer interaction device, including:

The problem receiving module is used for receiving voice data representing problems in the power industry;

a speech recognition model loading module for loading a speech recognition model trained by the method according to the first aspect or the apparatus according to the third aspect;

The text conversion module is used for calling the voice recognition model to convert the voice data into text information;

And the question inquiring module is used for inquiring answers for solving the questions represented by the text information in the power industry.

In a fifth aspect, embodiments of the present invention further provide a computer apparatus, the computer apparatus including:

One or more processors;

A memory for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of training a speech recognition model as described in the first aspect or the method of human interaction as described in the second aspect.

In a sixth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements a method for training a speech recognition model according to the first aspect or a method for human-computer interaction according to the second aspect.

In this embodiment, first voice data belonging to a non-electric power industry and first text information serving as content of the first voice data are obtained, terms belonging to the electric power industry are obtained, the terms are merged into the first text information, second text information belonging to the electric power industry is obtained, validity of the second text information for the electric power industry is checked, if the second text information is legal for the electric power industry, the terms are merged into the first voice data, second voice data belonging to the electric power industry is obtained, the second voice data are taken as samples, the second text information is taken as labels, a voice recognition model is trained to convert the voice data belonging to the electric power industry into text information, and according to the embodiment, on the basis of the non-electric power industry corpus, the terms of the electric power industry are used for constructing the corpus, authenticity of the corpus can be guaranteed.

Further, the high-performance speech recognition model can be trained by the corpus with sufficient and accurate quantity in the power industry, the accuracy of speech recognition in the power industry is improved, the subsequent man-machine interaction is performed based on the high-accuracy speech recognition result, and the quality of the man-machine interaction can be ensured.

Drawings

FIG. 1 is a flowchart of a training method of a speech recognition model according to an embodiment of the present invention;

fig. 2 is a flowchart of a man-machine interaction method according to a second embodiment of the present invention;

Fig. 3 is a schematic structural diagram of a training device for a speech recognition model according to a third embodiment of the present invention;

Fig. 4 is a schematic structural diagram of a man-machine interaction device according to a fourth embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a method for training a speech recognition model according to a first embodiment of the present invention, where the method may be applied to a case of generating a corpus in the power industry based on a generic corpus, so as to train the speech recognition model, and the method may be performed by a training device for the speech recognition model, where the training device for the speech recognition model may be implemented by software and/or hardware, and may be configured in a computer device, for example, a personal computer, a server, a workstation, or the like, and specifically includes the following steps:

Step 101, first voice data belonging to a non-power industry and first text information serving as content of the first voice data are acquired.

To facilitate collection of a sufficient number of data sets, in this embodiment, first voice data belonging to the non-power industry, first text information that is the content of the first voice data, i.e., the "first audio data" may be recorded when the speaker speaks the "first text information" may be collected in a general channel such as an open-source database and/or an open-source project.

Among them, the non-power industry may refer to other industries other than the power industry, for example, education industry, sports industry, media industry, etc., in which the term of the power industry is less likely to appear but is easy to collect, and the data amount is large.

In addition, the first voice data is usually sentence-level voice data, which may be voice data in the same language (e.g. chinese) or voice data in different languages (e.g. chinese, english), which is not limited in this embodiment.

Step 102, obtaining terminology belonging to the power industry.

In this embodiment, the terms belonging to the power industry may be collected in a database of the open source and/or a general channel of the open source such as the project, or in a website provided by the power industry.

The terms belonging to the power industry may refer to terms of power professions, and may be phrases or phrases, for example, reactive voltage control information AVC, a power distribution management system DMIS, an electric energy collection and billing system TMR, a distributed control system DCS, an electric control system ECS of a unit set, and the like.

Step 103, integrating the term into the first text information to obtain second text information belonging to the power industry.

In this embodiment, the term of the electric power industry is integrated into the first text information of the non-electric power industry to form new text information, and since the new text information contains the term of the electric power industry, the new text information can be regarded as a sentence describing the electric power industry, and thus can be recorded as second text information belonging to the electric power industry.

In one embodiment of the present invention, step 103 may include the steps of:

Step 1031, dividing the first text information into a plurality of first keywords according to the grammar structure.

In this embodiment, natural language processing may be performed on the first text information by using a manner of syntax tree, part-of-speech label of dependency tree, syntax edge, tree structure, and the like, and a syntax structure in the first text information, such as a main predicate or the like, is identified, and each part-of-speech (such as a subject, a predicate, an object, and the like) in the syntax structure is mapped to a first keyword in the first text information.

Step 1032, determining a first length of the first keyword.

The minimum units (such as Chinese characters, letters and the like) in the first keywords are counted, and the number of the minimum units is set to be the first length of the first keywords.

And 1033, replacing the first keywords with terms meeting the target conditions in the first text information to obtain second text information belonging to the power industry.

In this embodiment, the first keyword in the first text information is traversed, and the first keyword that meets the target condition is found.

The target conditions comprise the following two items:

1. the difference between the second length and the first length of the term is less than or equal to a first threshold value

The minimum units (e.g., chinese characters, letters, etc.) in the terms are counted, and the number of the minimum units is set as the second length of the term.

A difference (expressed in absolute value) between the second length of the term and the first length of the first keyword is calculated and compared with a preset first threshold.

If the difference is smaller than or equal to the first threshold, it means that the difference between the term and the first keyword is smaller, and at this time, if the term is replaced with the first keyword, the influence on the first voice data is smaller, and the authenticity of the second voice data generated based on the first voice data is kept.

If the difference is greater than the first threshold, it means that the difference between the term and the first keyword is greater, and at this time, if the term is replaced with the first keyword, the influence on the first voice data is greater, and the subsequent generation of the second voice data based on the first voice data is easy to be distorted.

2. The term applies to grammatical structures

The part of speech of the term is consistent with the part of speech of the first keyword, for example, the term is noun and the first keyword is noun, and both can be used as a subject and a predicate, so that when the term can be applied to replace the first keyword to generate the second text information, the grammar structure of the second text information is reasonable, and the authenticity of the second text information is ensured.

The part of speech of the term is inconsistent with the part of speech of the first keyword, for example, the term is a verb, the first keyword is a noun, the term can be used as a predicate, the first keyword can be used as a subject and a predicate, so that when the term can be used for replacing the first keyword to generate second text information, the grammar structure of the second text information is unreasonable, and the second text information is easy to distort.

Step 104, verifying the validity of the second text information for the power industry.

In this embodiment, for the second text information generated by the fusion term, the second text information may be a sentence that appears in the power industry, or may not be a sentence that appears in the power industry, and in order to ensure accuracy of the corpus, validity of the second text information with respect to the power industry may be checked, that is, whether the second text information is a sentence that is legal in the power industry may be checked.

For example, let the first text information be "the computer is bad", "the computer of the Ming dynasty is bad", the term of the power industry is "the transformer", the term "the transformer" is substituted for the first keyword "the computer", the second text information "the transformer is bad", "the transformer of the Ming dynasty is bad", for the power industry, the transformer is not set in a person, the "the transformer is bad" is a legal sentence in the power industry, and the "the transformer of the Ming dynasty is bad" is an illegal sentence in the power industry.

In one embodiment of the present invention, step 104 includes the steps of:

step 1041, obtaining third text information belonging to the power industry.

In this embodiment, the third text information pertaining to the power industry may be collected in a database of the open source and/or a general channel such as the item of the open source, or in a website provided by the power industry.

The third text information belonging to the power industry is usually a sentence, and may be text information in the same language (e.g. chinese) or text information in a different language (e.g. chinese, english), which is not limited in this embodiment.

Step 1042, calculating the similarity between the second text information and the third text information.

In this embodiment, the similarity between the second text information and the third text information may be calculated by an algorithm such as an edit distance, TF-IDF (term frequency-inverse text frequency index), simhash, topic modeling LDA, doc2vec, word2vec, etc., which may reflect whether the user speaks the second text information in the power industry to some extent.

Step 1043, calculating a distribution probability of the grammar structure containing the term in the second text information in the third text information.

Taking a grammar structure containing terms in the second text information as a whole, taking the situation that the whole is distributed in the third text information into consideration, and calculating the distribution probability of the whole in the third text information, wherein the distribution probability can reflect whether a user speaks the second text information in the power industry to a certain extent.

In one way of calculating the distribution probability, the second keyword in the third text information may be queried by performing word segmentation processing on the third text information based on the same string matching or the like.

And counting the dependency probability of each second keyword in the third text information, wherein the dependency probability is the ratio of the first word frequency to the second word frequency, the first word frequency is the word frequency number of the current second keyword after other second keywords appear in the third text information, and the second word frequency is the total word frequency of the other second keywords in the third text information.

It is assumed that the occurrence of the second keyword depends only on a limited one or several second keywords that it was preceded by. For a sentence T, it can be assumed that T is composed of word sequences W ₁,W₂,W₃, …, wn, then the probability that this sentence T is composed of W ₁,W₂,W₃,…,W_n connections is P(T)＝P(W₁W₂W₃…W_n)＝P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W2…W_n-1).

If the occurrence of a second keyword depends only on a second keyword that occurs before it, then P(T)＝P(W₁W₂W₃…W_n)＝P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W₂…W_n-1)≈P(W₁)P(W₂|W₁)P(W₃|W₂)…P(W_n|W_n-1).

The term is compared with a first keyword as a representation of the grammatical structure with a plurality of second keywords having a dependency relationship.

When the term is the same as the current second keyword and the first keyword is the same as other second keywords, the dependency probability is set as the distribution probability of the grammar structure containing the term in the second text information in the third text information, wherein the first keyword is the keyword replaced by the term in the first text information.

Step 1044, if the similarity is greater than or equal to the second threshold and the distribution probability is greater than or equal to the third threshold, determining that the second text information is legal for the power industry.

If the similarity is greater than or equal to the second threshold and the distribution probability is greater than or equal to the third threshold, indicating that the second text information has a greater probability of appearing in the power industry, it may be determined that the second text information is legal for the power industry.

In addition, if the similarity is smaller than the second threshold value and the distribution probability is smaller than the third threshold value, the second text information is indicated to have a smaller probability of appearing in the power industry, and the second text information can be determined to be illegal for the power industry.

Step 105, if the second text information is legal for the power industry, the term is fused into the first voice data, and the second voice data belonging to the power industry is obtained.

If the second text information is legal for the power industry, the pronunciation of the term can be further referred to, new voice data is formed on the basis of the first voice data and recorded as the second voice data belonging to the power industry, so that the content of the second voice data is the second text information.

In one embodiment of the present invention, step 105 includes the steps of:

step 1051, query the first voice data for a first voice signal with a first keyword.

In this embodiment, the first keyword is a keyword replaced by a term in the first text information, the first voice data includes multiple frames of voice signals, each frame of voice signal has been marked with an associated word in advance, and at this time, a voice signal corresponding to a word that constitutes the first keyword may be searched for and recorded as the first voice signal.

Step 1052, determine a speech conversion model.

In this embodiment, a voice conversion model for converting text information into a voice signal, that is, for performing an operation from text to voice (TextToSpeech), may be trained in advance, and the input of the voice conversion model is text information and the output is a voice signal.

Step 1053, invoking the speech conversion model to convert the term to a second speech signal.

The term (text information) of the electric power industry is input into the voice conversion model, the voice conversion model processes the voice conversion model according to logic of the voice conversion model, a voice signal is output, and the voice signal is recorded as a second voice signal, namely the content of the second voice signal is the term of the electric power industry.

Further, the partial speech conversion model may learn the characteristics of the speaker using a small number of utterances (e.g., sentences of several seconds duration), i.e., the partial speech conversion model has a Voice reproduction (Voice) function, such as DurIAN, deep Voice, etc.

In these speech conversion models, two basic approaches are generally focused on to solve the problem of speech duplication: speaker adaptation (speaker adaptation) and speaker coding (speaker encoding), both of which may be applied to a multi-speaker generated speech conversion model by speaker-embedded vectors without degrading speech quality. Both of these methods can achieve good performance with respect to the naturalness of speech and its similarity to the original speaker, even with a small amount of reproduced audio.

Then, a plurality of pieces of third voice data and fourth text information which is the content of each piece of third voice data can be obtained in a general channel such as an open source database and/or an open source item, wherein the tone of the third voice data is the same as that of the first voice data, that is, the third voice data and the first voice data are all the voices uttered by the same speaker.

The first voice data and the third voice data are taken as samples, the first text information and the fourth text information are taken as labels, and parameters of a voice conversion model are updated in a supervised mode so that the voice conversion model is used for synthesizing voice signals of tone, and at the moment, the voice conversion model has the function of copying speakers of the first voice data and the third voice data.

Under the condition of limiting the tone of the speaker, the term input is processed in the updated voice conversion model to obtain a second voice signal with the same tone as the first voice data, so that naturalness and authenticity in the subsequent generation of the second voice data can be ensured.

Step 1054, replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry.

In the first voice data, a second voice signal with the content of the power industry as a term is replaced with a first voice signal representing a first keyword in a non-power industry to form new voice data, and the new voice data comprises the term of the power industry and can be regarded as sentences describing the power industry, so that the new voice data can be recorded as second voice data belonging to the power industry.

Further, a first target signal and a second target signal may be determined from the second voice signals, where the first target signal is a multi-frame voice signal from the beginning of the sequence in the second voice signal, and the second target signal is a multi-frame voice signal from the end of the sequence in the second voice signal.

And determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data.

And respectively smoothing the first reference signal and the first target signal and smoothing the second reference signal and the second target signal under audio parameters such as volume and the like.

If the smoothing processing is completed, the second voice signal is replaced with the first voice signal, so that second voice data belonging to the power industry is obtained, the second voice signal is enabled to be more smoothly transited with other voice signals in the first voice data, and naturalness and authenticity of the second voice data are ensured.

And 106, training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into the text information.

In this embodiment, the second text information may be labeled as a tag of the second voice data, which indicates the real content of the second voice data.

And taking the second text information as a sample, performing supervised training on the voice recognition model under the supervision of the label, improving the learning ability of the voice recognition model on the second voice data, and improving the generalization ability of the voice recognition model in the power industry, thereby improving the accuracy of voice recognition in the power industry and ensuring the accuracy of subsequent man-machine interaction.

The structure of the speech recognition model is not limited to an artificially designed neural network, such as CNN (Convolutional Neural Network ), RNN (Rerrent Neural Network, cyclic neural network), or the like, but may be a neural network optimized by a model quantization method, a neural network searched for problems in the power industry by a NAS (neural network structure search) method, or the like, which is not limited in this embodiment.

Further, the speech recognition model may be retrained, fine-tuning (fine-tuning) may be performed on an existing third party model to obtain the speech recognition model, continuous learning may be performed on the speech recognition model, and the embodiment is not limited thereto.

Example two

Fig. 2 is a flowchart of a man-machine interaction method provided in a second embodiment of the present invention, where the method may be applicable to a situation that a person performs man-machine interaction by recognizing speech based on a speech recognition model in the power industry, and the method may be performed by a man-machine interaction device, where the man-machine interaction device may be implemented by software and/or hardware, and may be configured in a computer device, where the computer device may be a device on a user side, for example, a personal computer, a mobile terminal, a service terminal or an intelligent robot deployed in a power location, and the computer device may also be a device on a service side, for example, a server, a workstation, or the like, and specifically includes the following steps:

Step 201, voice data representing a problem in the power industry is received.

In this embodiment, the computer device may provide a User Interface (UI) at a local or remote client, where the UI provides a function of consulting a problem in the power industry, such as an interactive Interface of an intelligent robot, an interactive Interface for searching for a fault solution, etc., and a User may operate on the UI and speak during inputting the problem, and at this time, the local or remote client may call a microphone to record voice data, so as to obtain voice data representing the problem in the power industry.

Step 202, loading a voice recognition model.

In this embodiment, the voice recognition model may be trained in advance by the method of the first embodiment, and the structure and parameters thereof may be stored, and when the voice recognition model is applied, the voice recognition model and the parameters thereof may be loaded into the memory for operation.

Step 203, calling a voice recognition model to convert the voice data into text information.

The voice data is input into a voice recognition model, the voice recognition model carries out voice recognition on the voice data according to logic of the voice recognition model, and text information is output.

Since the voice recognition model is custom-trained for the power industry, text information with high accuracy can be obtained when voice recognition is performed on voice data representing a problem in the power industry.

Step 204, querying an answer for solving the problem represented by the text information in the power industry.

After speech recognition is performed to obtain text information as a question asked by the user, the meaning of the question is understood, and the intention of the user is recognized. At this time, the problems can be normalized, spelled and corrected, and semantic recognition such as word segmentation, entity recognition, syntactic analysis and the like can be called to facilitate further processing of the query sentence.

In the process, state tracking (DST) is carried out on the dialogue of man-machine interaction, the dialogue state is updated, the DST records all chat records of the user so far, and the system behavior which has occurred is used for updating the dialogue state; the next policy is determined based on the updated session state, and which service is addressed by this information is determined. After a certain service is activated, the service will also have its own dialogue strategy or flow to finally decide what information to return as a reply.

In the power industry, the following three methods may be adopted for searching and feeding back answers to questions posed by users:

(1) FAQ (Frequently Asked Questions, common problems)

The FAQ saves questions and related answers that the user frequently asks. For questions entered by the user, answers may be looked up in the FAQ library. If the corresponding question can be found, the answer corresponding to the question can be directly returned to the user without a plurality of complex processing procedures such as question understanding, information retrieval, answer extraction and the like, so that the efficiency is improved.

On the basis of a user question candidate question set in the FAQ library, an inverted index of the frequently asked question set is established, the retrieval efficiency of the system is improved, meanwhile, the similarity can be calculated by a semantic-based method, the matching precision of the questions is improved, namely, from user input to answer output, the FAQ system mainly goes through the processes of answer retrieval, semantic matching, answer reordering and the like.

(2) DocQA (extraction type question and answer)

The extraction type question and answer can also be called a machine reading understanding task, namely, a section of reference text is described, then a question is correspondingly given, and then the machine extracts an answer of the corresponding question in the text after reading the reference text. In specific cases, the extracted answers are integrated, processed and repeated to form the Xindaya correct answer.

(3) KBQA (Knowledge Based Qustioning AND ANSWERING, knowledge graph question and answer)

The knowledge graph-based question and answer can be used for knowing the association between the knowledge ontology and the knowledge by using the constructed knowledge graph in the marketing and distribution network field. This allows the question-answering system to have an inference function, which can answer some more complex questions.

The mismatching of the question-answering process based on the knowledge graph comprises systematic analysis of questions, obtaining entities to be matched through entity chain fingers (ENTITY LINKING), relation extraction (Relation Detection) and the like, and searching corresponding answers of the questions in a knowledge base.

The method uses a named entity recognition and relation extraction mode to analyze query sentences, and uses a model to generate a query subgraph by combining analysis results and the idea of state transition. And finally, obtaining an answer in the knowledge base by using the query subgraph. The generated query sub-graph mode used by us enables KBQA systems to implement the most advanced results currently in single-hop queries and multi-hop queries.

In the embodiment, voice data representing a problem in the power industry is received, a voice recognition model is loaded, the voice recognition model is called to convert the voice data into text information, an answer for solving the problem represented by the text information is queried in the power industry, the corpus of the power industry is constructed by referring to terms of the power industry on the basis of the corpus of the non-power industry, the authenticity of the corpus can be ensured, and the corpus of the power industry is abundant in quantity because of the numerous corpora of the non-power industry, so that the voice recognition model can be trained, the mark of the corpus is eliminated manually, the time consumption is greatly reduced, the accuracy of marking priori based on the corpus of the non-power industry is high, the marking accuracy of the corpus of the power industry is greatly reduced, the cost is higher, and the efficiency is greatly improved.

The high-performance voice recognition model can be trained by the sufficient and accurate corpus in the power industry, the accuracy of voice recognition in the power industry can be improved when voice tends to occur, the subsequent man-machine interaction can be performed based on the high-accuracy voice recognition result, the quality of man-machine interaction can be ensured, and an answer can be accurately found for the problem posed by a user.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Example III

Fig. 3 is a block diagram of a training device for a speech recognition model according to a third embodiment of the present invention, which may specifically include the following modules:

the training set acquisition module 301 is configured to acquire first voice data belonging to a non-power industry, and first text information serving as content of the first voice data;

a power term acquisition module 302, configured to acquire terms belonging to the power industry;

A text information fusion module 303, configured to fuse the term into the first text information to obtain second text information belonging to the power industry;

a validity checking module 304, configured to check validity of the second text information for the power industry;

A voice data fusion module 305, configured to fuse the term into the first voice data if the second text information is legal for the power industry, so as to obtain second voice data belonging to the power industry;

the model training module 306 is configured to train a speech recognition model by using the second speech data as a sample and the second text information as a tag, so as to convert the speech data belonging to the power industry into text information.

In one embodiment of the present invention, the text information fusion module 303 is further configured to:

dividing the first text information into a plurality of first keywords according to a grammar structure;

Determining a first length of the first keyword;

replacing the first keyword with the term meeting a target condition in the first text information to obtain second text information belonging to the power industry;

wherein the target condition includes a difference between a second length of the term and the first length being less than or equal to a first threshold, the term being applicable to the syntax structure.

In one embodiment of the present invention, the validity checking module 304 is further configured to:

acquiring third text information belonging to the power industry;

calculating the similarity between the second text information and the third text information;

calculating a distribution probability of a grammar structure containing the term in the second text information in the third text information;

And if the similarity is greater than or equal to a second threshold value and the distribution probability is greater than or equal to a third threshold value, determining that the second text information is legal for the electric power industry.

inquiring a second keyword in the third text information;

counting the dependency probability of each second keyword in the third text information, wherein the dependency probability is the ratio of a first word frequency to a second word frequency, the first word frequency is the word frequency of the second keyword in the third text information after other second keywords appear, and the second word frequency is the total word frequency of other second keywords in the third text information;

When the term is the same as the current second keyword and the first keyword is the same as other second keywords, setting the dependency probability as the distribution probability of the grammar structure containing the term in the second text information in the third text information, wherein the first keyword is a keyword replaced by the term in the first text information.

In one embodiment of the present invention, the voice data fusion module 305 is further configured to:

querying a first voice signal with content being a first keyword in the first voice data, wherein the first keyword is a keyword replaced by the term in the first text information;

Determining a voice conversion model, wherein the voice conversion model is used for converting text information into voice signals;

Invoking the speech conversion model to convert the term to a second speech signal;

And replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry.

acquiring third voice data and fourth text information serving as the content of the third voice data, wherein the tone of the third voice data is the same as that of the first voice data;

Updating the voice conversion model by taking the first voice data and the third voice data as samples and the first text information and the fourth text information as labels, so that the voice conversion model is used for synthesizing voice signals of the tone;

processing the term input updated in the voice conversion model under the condition of limiting the tone color to obtain a second voice signal with the same tone color as the first voice data.

Determining a first target signal and a second target signal in the second voice signals, wherein the first target signal is a multi-frame voice signal of which the sequence starts in the second voice signals, and the second target signal is a multi-frame voice signal of which the sequence ends in the second voice signals;

Determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data;

smoothing the first reference signal and the first target signal, and smoothing the second reference signal and the second target signal respectively;

And if the smoothing processing is completed, replacing the first voice signal with the second voice signal to obtain second voice data belonging to the power industry.

The training device for the speech recognition model provided by the embodiment of the invention can execute the training method for the speech recognition model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 4 is a structural block diagram of a man-machine interaction device according to a fourth embodiment of the present invention, which may specifically include the following modules:

A problem receiving module 401, configured to receive voice data representing a problem in the power industry;

A speech recognition model loading module 402 for loading a speech recognition model;

the speech recognition model may be trained by the method of embodiment one or the apparatus of embodiment three.

A text conversion module 403, configured to invoke the speech recognition model to convert the speech data into text information;

a question querying module 404, configured to query the power industry for an answer for solving the question represented by the text information.

The man-machine interaction device provided by the embodiment of the invention can execute the man-machine interaction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. Fig. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 5, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, a training method or a man-machine interaction method for implementing a speech recognition model provided by an embodiment of the present invention.

Example six

The sixth embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the foregoing training method or man-machine interaction method for a speech recognition model, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method for training a speech recognition model, comprising:

acquiring terms belonging to the power industry;

verifying the validity of the second text information for the power industry;

training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the electric power industry into text information;

wherein the verifying the validity of the second text information for the power industry includes:

acquiring third text information belonging to the power industry;

2. The method of claim 1, wherein said merging the term into the first text message to obtain a second text message belonging to the power industry comprises:

Determining a first length of the first keyword;

3. The method of claim 1, wherein the calculating the probability of the distribution of the grammar structure containing the term in the second text information in the third text information comprises:

inquiring a second keyword in the third text information;

4. A method according to any one of claims 1-3, wherein said merging said term into said first voice data to obtain second voice data belonging to said power industry comprises:

5. The method of claim 4, wherein said invoking the speech conversion model to convert the term to a second speech signal comprises:

6. The method of claim 5, wherein replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry comprises:

7. A human-computer interaction method, comprising:

receiving voice data representing a problem in the power industry;

loading a speech recognition model trained by the method of any one of claims 1-6;

8. A computer device, the computer device comprising:

One or more processors;

A memory for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of training a speech recognition model as claimed in any one of claims 1-6 or the method of human interaction as claimed in claim 7.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores thereon a computer program, which when executed by a processor implements the training method of a speech recognition model according to any of claims 1-6 or the human-computer interaction method according to claim 7.