CN113744737A

CN113744737A - Training of speech recognition model, man-machine interaction method, equipment and storage medium

Info

Publication number: CN113744737A
Application number: CN202111054577.7A
Authority: CN
Inventors: 钟业荣; 叶万余; 阮国恒; 江嘉铭; 阮伟聪; 张名捷; 黄一捷; 杨毅; 倪进超
Original assignee: Guangdong Power Grid Co Ltd; Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-12-03
Anticipated expiration: 2041-09-09

Abstract

The embodiment of the invention provides a method, equipment and a storage medium for training and man-machine interaction of a voice recognition model, wherein the method comprises the following steps: the method comprises the steps of obtaining first voice data belonging to a non-electric power industry and first text information serving as the content of the first voice data, obtaining terms belonging to an electric power industry, integrating the terms into the first text information, obtaining second text information belonging to the electric power industry, checking the legality of the second text information on the electric power industry, integrating the terms into the first voice data if the second text information is legal on the electric power industry, obtaining second voice data belonging to the electric power industry, training a voice recognition model by taking the second voice data as a sample and the second text information as a label, converting the voice data belonging to the electric power industry into text information, getting rid of manual marking of linguistic data, greatly reducing time consumption, achieving high marking accuracy of the linguistic data in the electric power industry, greatly reducing cost and greatly improving efficiency.

Description

Training of speech recognition model, man-machine interaction method, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of electric power, in particular to a training and man-machine interaction method, equipment and a storage medium of a voice recognition model.

Background

With the development of the field of deep learning and the maturity of artificial intelligence technologies such as voice recognition and human-computer interaction, a large number of new-generation service applications based on artificial intelligence are widely applied to multiple industries such as electric power, finance and the internet, and become the infrastructure of the overall intelligent development.

Speech recognition is the basis of human-computer interaction, and in order to realize the model that is used for speech recognition among the training electric power industry, a large amount of linguistic data are required, however, the linguistic data quantity of electric power industry can't compare with general linguistic data quantity, if direct training according to the condition of academic not considering data set quantity, the model that is used for speech recognition leads to the performance on the low side because of the quantity of linguistic data is not enough.

If the corpora are marked manually, the requirement of the corpora is huge, the manual work consumes long time, and mistakes are easy to make, so that the cost is high, but the efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a training method, a man-machine interaction method, equipment and a storage medium of a voice recognition model, and aims to solve the problem that the model for voice recognition in the power industry has low performance due to insufficient corpus quantity.

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, including:

acquiring first voice data belonging to a non-electric power industry and first text information serving as the content of the first voice data;

acquiring terms belonging to the power industry;

integrating the terms into the first text message to obtain a second text message belonging to the power industry;

verifying the legality of the second text information to the power industry;

if the second text information is legal for the power industry, the term is merged into the first voice data to obtain second voice data belonging to the power industry;

and training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into text information.

In a second aspect, an embodiment of the present invention further provides a human-computer interaction method, including:

receiving voice data representing a problem in the power industry;

loading a speech recognition model trained by the method of the first aspect;

calling the voice recognition model to convert the voice data into text information;

querying, in the electric power industry, answers for solving the questions represented by the textual information.

In a third aspect, an embodiment of the present invention further provides a training apparatus for a speech recognition model, including:

the training set acquisition module is used for acquiring first voice data belonging to the non-electric power industry and first text information serving as the content of the first voice data;

the electric power term acquisition module is used for acquiring terms belonging to the electric power industry;

the text information fusion module is used for fusing the terms into the first text information to obtain second text information belonging to the power industry;

the validity checking module is used for checking the validity of the second text information for the power industry;

the voice data fusion module is used for fusing the terms into the first voice data to obtain second voice data belonging to the power industry if the second text information is legal for the power industry;

and the model training module is used for training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into text information.

In a fourth aspect, an embodiment of the present invention further provides a human-computer interaction device, including:

the problem receiving module is used for receiving voice data representing problems in the power industry;

a speech recognition model loading module for loading a speech recognition model trained by the method according to the first aspect or the apparatus according to the third aspect;

the text conversion module is used for calling the voice recognition model to convert the voice data into text information;

and the question query module is used for querying answers for solving the questions represented by the text information in the power industry.

In a fifth aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method of training speech recognition models according to the first aspect or a method of human-computer interaction according to the second aspect.

In a sixth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for training a speech recognition model according to the first aspect or the method for human-computer interaction according to the second aspect.

In this embodiment, first voice data belonging to a non-power industry and first text information as content of the first voice data are acquired, terms belonging to a power industry are acquired, the terms are merged into the first text information, second text information belonging to a power industry is acquired, legality of the second text information to the power industry is checked, if the second text information is legal to the power industry, the terms are merged into the first voice data, second voice data belonging to the power industry is acquired, a voice recognition model is trained by taking the second voice data as a sample and the second text information as a tag to convert the voice data belonging to the power industry into text information, on the basis of the linguistic data belonging to the non-power industry, the linguistic data belonging to the power industry is constructed by referring to the terms of the power industry, authenticity of the linguistic data can be guaranteed, and since the linguistic data belonging to the non-power industry are numerous, the method has the advantages that the corpus quantity of the power industry is sufficient, the method can be used for training a speech recognition model, manual corpus marking is avoided, time consumption is greatly reduced, the corpus marking accuracy of the power industry is high based on the priori accuracy of the corpus marking of the non-power industry, the cost is greatly reduced, and the efficiency is greatly improved.

Furthermore, the sufficient and accurate corpora in the power industry can train a high-performance voice recognition model, so that the accuracy of voice recognition in the power industry is improved, and the quality of man-machine interaction can be ensured by subsequently performing man-machine interaction based on a high-accuracy voice recognition result.

Drawings

Fig. 1 is a flowchart of a method for training a speech recognition model according to an embodiment of the present invention;

fig. 2 is a flowchart of a human-computer interaction method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a training apparatus for a speech recognition model according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a human-computer interaction device according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a training method of a speech recognition model according to an embodiment of the present invention, where the embodiment is applicable to a case where a corpus of an electric power industry is generated based on a general corpus to train the speech recognition model, and the method may be executed by a training apparatus of the speech recognition model, where the training apparatus of the speech recognition model may be implemented by software and/or hardware, and may be configured in a computer device, such as a personal computer, a server, a workstation, and the like, and specifically includes the following steps:

step 101, first voice data belonging to a non-electric power industry and first text information serving as the content of the first voice data are obtained.

In order to collect a sufficient number of data sets, in the present embodiment, the first voice data belonging to the non-electric power industry, the first text information being the content of the first voice data, that is, the "first audio data" may be recorded when the speaker utters the "first text information" in a general channel such as an open-source database and/or an open-source project.

The non-electric power industry may refer to other industries besides the electric power industry, such as education industry, sports industry, media industry, and the like, and in the non-electric power industry, the term of the electric power industry is low in possibility, easy to collect and large in data volume.

In addition, the first speech data is usually speech data at sentence level, and may be speech data in the same language (e.g. chinese) or speech data in different languages (e.g. chinese and english), which is not limited in this embodiment.

And 102, acquiring terms belonging to the power industry.

In this embodiment, the terms belonging to the power industry may be collected in a general channel such as an open-source database and/or an open-source project, or in a website provided by the power industry.

The term belonging to the power industry may refer to a term of power specialization, and may be a phrase or a short sentence, for example, the reactive voltage control information AVC, the power distribution management system DMIS, the electric energy collection and billing system TMR, the distributed control system DCS, the unit electrical control system ECS, and the like.

And 103, integrating the terms into the first text message to obtain a second text message belonging to the power industry.

In this embodiment, the term of the power industry is merged into the first text message of the non-power industry to form a new text message, and the new text message includes the term of the power industry and can be regarded as a sentence describing the power industry, so that the new text message can be recorded as the second text message belonging to the power industry.

In one embodiment of the present invention, step 103 may comprise the steps of:

step 1031, dividing the first text information into a plurality of first keywords according to a grammar structure.

In this embodiment, the first text information may be subjected to natural language processing using a syntax tree, part-of-speech tags of a dependency tree, syntax edges, a tree structure, and the like, a syntax structure in the first text information, such as a subject-predicate, and the like, is identified, and each part-of-speech (such as a subject, a predicate, an object, and the like) in the syntax structure is mapped to a first keyword in the first text information.

Step 1032, determining a first length of the first keyword.

And counting the minimum units (such as Chinese characters, letters and the like) in the first keyword, and setting the number of the minimum units as the first length of the first keyword.

And 1033, replacing the first key words with terms meeting the target conditions in the first text information to obtain second text information belonging to the power industry.

In this embodiment, the first keyword in the first text information is traversed to find the first keyword meeting the target condition.

Wherein, the target conditions comprise the following two conditions:

1. the difference between the second length and the first length of the term is less than or equal to a first threshold

The smallest units (such as Chinese characters, letters and the like) in the terms are counted, and the number of the smallest units is set as the second length of the terms.

The difference (in absolute value) between the second length of the term and the first length of the first keyword is calculated and compared to a preset first threshold.

If the difference is smaller than or equal to the first threshold, it means that the difference between the term and the first keyword is small, and at this time, if the term is substituted for the first keyword, the influence on the first voice data is small, and the authenticity of the second voice data generated based on the first voice data is maintained.

If the difference is greater than the first threshold, it indicates that the difference between the term and the first keyword is large, and at this time, if the term is substituted for the first keyword, the influence on the first voice data is large, and it is easy to distort the second voice data generated based on the first voice data.

2. Terminology applicable to grammatical structures

The part of speech of the term is consistent with the part of speech of the first keyword, if the term is a noun and the first keyword is a noun, both of the terms can be used as a subject and a predicate, so that when the term can be applied to replace the first keyword to generate the second text information, the grammatical structure of the second text information is reasonable, and the authenticity of the second text information is ensured.

If the term is a verb and the first keyword is a noun, the term may be a predicate and the first keyword may be a subject or a predicate, so that the term is applicable to replace the first keyword to generate the second text information, the grammatical structure of the second text information is not reasonable, and the second text information is easily distorted.

And 104, verifying the legality of the second text information to the power industry.

In this embodiment, for the second text information generated by fusing terms, the second text information may be a sentence occurring in the power industry or may not be a sentence occurring in the power industry, and in order to ensure the accuracy of the corpus, the validity of the second text information in the power industry may be checked, that is, whether the second text information is a legal sentence in the power industry is checked.

For example, if the first text message is "the computer is broken", "the computer of the small minder is broken", the term of the power industry is "transformer", the term "transformer" is used to replace the first keyword "computer", the second text message "the transformer is broken", "the transformer of the small minder is broken", for the power industry, the transformer is not set in a certain person, the "transformer broken" is a legal sentence in the power industry, and the "transformer broken" is an illegal sentence in the power industry.

In one embodiment of the present invention, step 104 comprises the steps of:

and 1041, acquiring third text information belonging to the power industry.

In this embodiment, the third text information belonging to the power industry may be collected in a common channel such as an open-source database and/or an open-source project, or in a website provided by the power industry.

The third text information belonging to the power industry is usually a sentence, and may be text information in the same language (such as chinese) or text information in different languages (such as chinese and english), which is not limited in this embodiment.

Step 1042, calculating the similarity between the second text information and the third text information.

In this embodiment, the similarity between the second text information and the third text information may be calculated through algorithms such as an edit distance, a TF-IDF (term frequency-inverse text frequency index), a simhash, a topic modeling LDA, doc2vec, and word2vec, and the similarity may reflect to some extent whether the user speaks the second text information in the power industry.

And 1043, calculating the distribution probability of the grammar structure containing the terms in the second text information in the third text information.

And taking the grammar structure containing the terms in the second text information as a whole, considering the distribution condition of the whole in the third text information, and calculating the distribution probability of the whole in the third text information, wherein the distribution probability can reflect whether the user speaks the second text information in the power industry to a certain extent.

In a manner of calculating the distribution probability, the third text information may be subjected to word segmentation based on the same string matching, so as to query the second keyword in the third text information.

And counting the dependence probability of each second keyword in the third text information, wherein the dependence probability is the ratio of the first word frequency number to the second word frequency number, the first word frequency number is the word frequency number of the current second keyword after other second keywords in the third text information, and the second word frequency number is the total word frequency number of other second keywords in the third text information.

It is assumed that the occurrence of a second keyword depends only on the limited one or few second keywords that it has previously occurred. For a sentence T, it can be assumed that T is formed by a sequence of words W₁，W₂，W₃…, Wn, then this sentence T consists of W₁，W₂，W₃，…，W_nConnection ofThe probability of composition is P (T) ═ P (W)₁W₂W₃…W_n)＝P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W2…W_n-1)。

If the occurrence of a second keyword depends only on a second keyword that it has previously occurred, then P (t) ═ P (W)₁W₂W₃…W_n)＝P(W₁)P(W₂|W₁)P(W₃|W₁W₂)…P(W_n|W₁W₂…W_n-1)≈P(W₁)P(W₂|W₁)P(W₃|W₂)…P(W_n|W_n-1)。

The terms and the first keywords are used as representatives of the grammar structure and compared with a plurality of second keywords having dependency relationships.

And when the terms are the same as the current second keywords and the first keywords are the same as other second keywords, setting the dependence probability as the distribution probability of the grammar structure containing the terms in the second text information in the third text information, wherein the first keywords are the keywords replaced by the terms in the first text information.

Step 1044, if the similarity is greater than or equal to the second threshold and the distribution probability is greater than or equal to the third threshold, determining that the second text message is legal for the power industry.

If the similarity is greater than or equal to the second threshold and the distribution probability is greater than or equal to the third threshold, the second text message is indicated to have a greater probability of appearing in the power industry, and it can be determined that the second text message is legal for the power industry.

In addition, if the similarity is smaller than the second threshold and the distribution probability is smaller than the third threshold, which indicates that the second text message has a smaller probability of appearing in the power industry, it may be determined that the second text message is illegal for the power industry.

And 105, if the second text information is legal for the power industry, integrating the terms into the first voice data to obtain second voice data belonging to the power industry.

If the second text information is legal for the power industry, new voice data can be formed on the basis of the first voice data by further referring to the pronunciation of terms, and the new voice data is marked as the second voice data belonging to the power industry, so that the content of the second voice data is the second text information.

In one embodiment of the present invention, step 105 comprises the steps of:

step 1051, inquiring a first voice signal with a first keyword as the content in the first voice data.

In this embodiment, the first keyword is a keyword replaced by a term in the first text information, the first voice data includes multiple frames of voice signals, each frame of voice signal is labeled with an associated word in advance, and at this time, a voice signal corresponding to a word forming the first keyword can be searched and recorded as the first voice signal.

Step 1052, determining a voice conversion model.

In this embodiment, a speech conversion model for converting text information into a speech signal, that is, implementing an operation from text to speech (TextToSpeech), may be trained in advance, and the input of the speech conversion model is text information and the output is a speech signal.

Step 1053, call the speech conversion model to convert the term into a second speech signal.

And inputting the term (text information) of the power industry into the voice conversion model, processing the voice conversion model according to the logic of the voice conversion model, and outputting a voice signal which is recorded as a second voice signal, wherein the content of the second voice signal is the term of the power industry.

Further, the partial speech conversion model may learn the characteristics of the speaker using a small number of utterances (e.g., sentences of several seconds duration), i.e., the partial speech conversion model has a speech duplication (Voice) function, such as DurIAN, Deep Voice, etc.

In these speech conversion models, two basic approaches are usually focused on to solve the problem of speech replication: speaker adaptation (spatker adaptation) and speaker coding (spatker encoding), both of which can be applied to a multi-speaker by speaker embedding vectors to generate a speech conversion model without degrading speech quality. Both methods achieve good performance with respect to the naturalness of the speech and its similarity to the original speaker, even with a small amount of reproduced audio.

Then, a plurality of pieces of third speech data having the same tone as that of the first speech data, that is, the third speech data and the first speech data are both speech uttered by the same speaker, and fourth text information, which is the content of each piece of third speech data, may be acquired in a common channel such as an open-source database and/or an open-source project.

And updating parameters of the voice conversion model by using the first voice data and the third voice data as samples and the first text information and the fourth text information as labels in a supervision mode so that the voice conversion model is used for synthesizing voice signals of tone colors, wherein the voice conversion model has the function of copying the speakers of the first voice data and the third voice data.

Under the condition of limiting the tone color of the speaker, the terms are input into the voice conversion model after being updated for processing, so that a second voice signal with the same tone color as the first voice data is obtained, and the naturalness and the authenticity in the subsequent generation of the second voice data can be ensured.

And 1054, replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry.

In the first voice data, the second voice signal with the content of the term in the power industry replaces the first voice signal which represents the first keyword in the non-power industry to form new voice data, and the new voice data contains the term in the power industry and can be regarded as a sentence for describing the power industry, so the new voice data can be recorded as the second voice data belonging to the power industry.

Further, a first target signal and a second target signal may be determined in the second voice signal, where the first target signal is a multi-frame voice signal at the beginning of the sequence in the second voice signal, and the second target signal is a multi-frame voice signal at the end of the sequence in the second voice signal.

And determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data.

And respectively carrying out smoothing processing on the first reference signal and the first target signal and carrying out smoothing processing on the second reference signal and the second target signal under audio parameters such as volume.

And if the smoothing processing is finished, replacing the first voice signal with the second voice signal to obtain second voice data belonging to the power industry, so that the second voice signal and other voice signals in the first voice data are more smoothly transited, and the naturalness and the authenticity of the second voice data are ensured.

And step 106, training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into the text information.

In this embodiment, the second text information may be labeled as a label of the second voice data, which represents the actual content of the second voice data.

And the second text information is used as a sample, and the voice recognition model is supervised and trained under the supervision of the label, so that the learning capacity of the voice recognition model on the second voice data is improved, and the generalization capacity of the voice recognition model in the power industry is improved, so that the accuracy of voice recognition in the power industry is improved, and the accuracy of subsequent man-machine interaction is ensured.

The structure of the speech recognition model is not limited to a Neural Network designed manually, such as a CNN (Convolutional Neural Network), an RNN (recurrent Neural Network), and the like, and may also be a Neural Network optimized by a model quantization method, a Neural Network for topic search in the power industry by an NAS (Neural Network architecture search) method, and the like, which is not limited in this embodiment.

Further, the speech recognition model may be retrained, fine-tuning (fine-tuning) may be performed on an existing third-party model to obtain the speech recognition model, or continuous learning may be performed on the speech recognition model, and the present embodiment is not limited thereto.

Example two

Fig. 2 is a flowchart of a human-computer interaction method according to a second embodiment of the present invention, where the method is applicable to a situation where speech is recognized based on a speech recognition model and human-computer interaction is performed in an electric power industry, the method may be executed by a human-computer interaction device, the human-computer interaction device may be implemented by software and/or hardware, and may be configured in a computer device, the computer device may be a device at a user end, such as a personal computer, a mobile terminal, a service terminal deployed in an electric power site, or an intelligent robot, and the computer device may also be a device at a service end, such as a server, a workstation, and the like, and specifically includes the following steps:

step 201, receiving voice data representing problems in the power industry.

In this embodiment, the computer device may provide a User Interface (UI) at a local or remote client, where the UI provides a function of consulting a problem in the power industry, such as an interaction Interface of an intelligent robot and an interaction Interface of a search fault solution, and a User may perform an operation on the UI and speak when inputting a problem, and at this time, the local or remote client may call a microphone to record voice data to obtain voice data representing the problem in the power industry.

Step 202, loading a speech recognition model.

In this embodiment, the speech recognition model may be trained in advance by the method of the first embodiment, and the structure and the parameters of the speech recognition model may be stored.

Step 203, calling a voice recognition model to convert the voice data into text information.

And inputting the voice data into a voice recognition model, carrying out voice recognition on the voice data by the voice recognition model according to the logic of the voice recognition model, and outputting text information.

Because the speech recognition model is customized and trained for the power industry, text information with high accuracy can be obtained when speech recognition is carried out on speech data representing problems in the power industry.

And step 204, inquiring answers for solving the problems represented by the text information in the power industry.

After text information as a question asked by the user is obtained by speech recognition, the meaning of the question is understood, and the intention of the user is recognized. At this time, the question may be normalized, spelled for error correction, etc., and then semantic recognition such as word segmentation, entity recognition, syntactic analysis, etc., may be invoked to facilitate further processing of the query statement.

In the process, carrying out state tracking (DST) on the conversation of man-machine interaction, updating the conversation state, wherein the DST can record all chat records of the user so far, and the system behaviors which occur are used for updating the conversation state; the next policy is determined according to the updated session state, which service this information is to be handled by. After a service is activated, the service also has its own dialog policy or flow to finally decide what information to return as a reply.

In the power industry, the answer search and feedback can be performed by the following three methods for questions posed by users:

(1) FAQ (frequenctly activated Questions, common problem)

The FAQ saves questions and related answers that users frequently ask. For a question entered by the user, the answer may be looked up in the FAQ library. If the corresponding question can be found, the answer corresponding to the question can be directly returned to the user without many complex processing procedures such as question understanding, information retrieval, answer extraction and the like, so that the efficiency is improved.

On the basis of a user question candidate question set in an FAQ library, an inverted index of a frequently asked question set is established, the retrieval efficiency of the system is improved, meanwhile, the similarity can be calculated by a semantic-based method, the matching precision of the questions is improved, namely, from user input to answer output, and the FAQ system mainly carries out the processes of answer retrieval, semantic matching, answer reordering and the like.

(2) DocQA (extraction type question and answer)

The extraction type question-answering is also called a machine reading understanding task, which means that a reference text is described, then a question is correspondingly given, and then the machine reads the reference text and extracts the answer of the corresponding question in the text. Under specific conditions, the extracted answers are integrated, processed and repeated to form a correct answer of the Credit Jack.

(3) KBQA (Knowledge base querying and Answering, Knowledge graph question and answer)

The question-answering based on the knowledge graph utilizes the established marketing and distribution network domain knowledge graph to know the association between the knowledge body and the knowledge. This enables the question-answering system to have reasoning functions and to answer more complex questions.

The question-answering process error-poking based on the knowledge graph comprises the steps of analyzing questions by a system, obtaining entities needing to be matched through methods such as Entity Linking, Relation Detection and the like, and searching corresponding question answers in a knowledge base.

The method analyzes the query statement by using a named entity recognition and relation extraction mode, and generates a query subgraph by using a model according to a state transition thought in combination with an analysis result. And finally, using the query subgraph to obtain answers in the knowledge base. The query subgraph generation method used by the KBQA system can realize the most advanced results in single-hop query and multi-hop query.

In the embodiment, voice data representing problems in the power industry is received, a voice recognition model is loaded, the voice recognition model is called to convert the voice data into text information, answers for solving the problems represented by the text information are inquired in the power industry, the embodiment constructs the corpus of the power industry according to terms in the power industry on the basis of the corpus of the non-power industry, the authenticity of the corpus can be guaranteed, the corpus of the power industry is sufficient due to the fact that the corpus of the non-power industry is numerous, the corpus can be used for training the voice recognition model, manual corpus marking is avoided, time consumption is greatly reduced, accuracy of the corpus marking prior based on the non-power industry is high, marking accuracy of the corpus of the power industry is high, cost is greatly reduced, and efficiency is greatly improved.

The sufficient and accurate corpora in the power industry can train a high-performance voice recognition model, the accuracy of voice recognition in the power industry can be improved when voice is necessary, and man-machine interaction is subsequently carried out based on a high-accuracy voice recognition result, so that the quality of man-machine interaction can be ensured, and answers can be accurately found for questions provided by users.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

EXAMPLE III

Fig. 3 is a block diagram of a structure of a training apparatus for a speech recognition model according to a third embodiment of the present invention, which may specifically include the following modules:

a training set acquisition module 301, configured to acquire first voice data belonging to a non-power industry and first text information serving as content of the first voice data;

a power term obtaining module 302 for obtaining terms belonging to the power industry;

a text information fusion module 303, configured to fuse the term into the first text information to obtain second text information belonging to the power industry;

a validity checking module 304, configured to check validity of the second text message for the power industry;

a voice data fusion module 305, configured to fuse the term into the first voice data to obtain second voice data belonging to the power industry if the second text information is legal for the power industry;

the model training module 306 is configured to train a speech recognition model by using the second speech data as a sample and the second text information as a tag, so as to convert the speech data belonging to the power industry into text information.

In an embodiment of the present invention, the text information fusion module 303 is further configured to:

dividing the first text information into a plurality of first keywords according to a syntactic structure;

determining a first length of the first keyword;

replacing the first keyword with the term meeting the target condition in the first text message to obtain second text message belonging to the power industry;

wherein the target condition comprises a difference between a second length of the term and the first length being less than or equal to a first threshold, the term being applicable to the syntax structure.

In an embodiment of the present invention, the validity checking module 304 is further configured to:

acquiring third text information belonging to the power industry;

calculating the similarity between the second text information and the third text information;

calculating the distribution probability of the grammar structure containing the terms in the second text information in the third text information;

and if the similarity is greater than or equal to a second threshold and the distribution probability is greater than or equal to a third threshold, determining that the second text information is legal for the power industry.

querying a second keyword in the third text information;

counting the dependence probability of each second keyword in the third text information, wherein the dependence probability is a ratio of a first word frequency number to a second word frequency number, the first word frequency number is the word frequency number of the current second keyword appearing behind other second keywords in the third text information, and the second word frequency number is the total word frequency number of other second keywords in the third text information;

and when the term is the same as the current second keyword and the first keyword is the same as the other second keywords, setting the dependency probability as the distribution probability of the grammar structure containing the term in the second text information in the third text information, wherein the first keyword is the keyword replaced by the term in the first text information.

In an embodiment of the present invention, the voice data fusion module 305 is further configured to:

querying a first voice signal with first key words in the first voice data, wherein the first key words are key words replaced by the terms in the first text information;

determining a voice conversion model, wherein the voice conversion model is used for converting text information into a voice signal;

calling the voice conversion model to convert the term into a second voice signal;

and in the first voice data, replacing the first voice signal with the second voice signal to obtain second voice data belonging to the power industry.

acquiring third voice data and fourth text information serving as the content of the third voice data, wherein the tone of the third voice data is the same as that of the first voice data;

updating the voice conversion model by taking the first voice data and the third voice data as samples and the first text information and the fourth text information as labels, so that the voice conversion model is used for synthesizing the voice signals of the tone;

and under the condition of defining the tone color, inputting the terms into the voice conversion model after being updated for processing, and obtaining a second voice signal with the same tone color as the first voice data.

determining a first target signal and a second target signal in the second voice signal, wherein the first target signal is a multi-frame voice signal at the beginning of sequencing in the second voice signal, and the second target signal is a multi-frame voice signal at the end of sequencing in the second voice signal;

determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data;

smoothing the first reference signal and the first target signal, and the second reference signal and the second target signal, respectively;

and if the smoothing is finished, replacing the first voice signal with the second voice signal to obtain second voice data belonging to the power industry.

The training device of the speech recognition model provided by the embodiment of the invention can execute the training method of the speech recognition model provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a block diagram of a human-computer interaction device according to a fourth embodiment of the present invention, which may specifically include the following modules:

a problem receiving module 401, configured to receive voice data representing a problem in the power industry;

a speech recognition model loading module 402 for loading a speech recognition model;

the speech recognition model can be trained by the method of the first embodiment or the device of the third embodiment.

A text conversion module 403, configured to invoke the speech recognition model to convert the speech data into text information;

a question query module 404, configured to query answers in the electric power industry for solving the questions represented by the text messages.

The man-machine interaction device provided by the embodiment of the invention can execute the man-machine interaction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 5 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 5, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes programs stored in the system memory 28 to execute various functional applications and data processing, such as a training method or a human-computer interaction method for a speech recognition model provided by an embodiment of the present invention.

EXAMPLE six

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned training method for a speech recognition model or the human-computer interaction method, and can achieve the same technical effect, and is not described herein again to avoid repetition.

A computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for training a speech recognition model, comprising:

acquiring terms belonging to the power industry;

verifying the legality of the second text information to the power industry;

2. The method of claim 1, wherein said merging said term into said first textual information to obtain a second textual information pertaining to said power industry comprises:

determining a first length of the first keyword;

3. The method of claim 1, wherein the verifying the legitimacy of the second textual information for the power industry comprises:

acquiring third text information belonging to the power industry;

4. The method of claim 3, wherein the calculating the distribution probability of the grammar structure containing the terms in the second text information in the third text information comprises:

querying a second keyword in the third text information;

5. The method according to any one of claims 1-4, wherein said merging said term into said first voice data, obtaining second voice data belonging to said electric power industry, comprises:

6. The method of claim 5, wherein said invoking the speech conversion model to convert the term into a second speech signal comprises:

7. The method according to claim 6, wherein the replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry comprises:

8. A human-computer interaction method, comprising:

receiving voice data representing a problem in the power industry;

loading a speech recognition model trained by the method of any one of claims 1-7;

9. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method of training a speech recognition model according to any one of claims 1-7 or a method of human-computer interaction according to claim 8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of training a speech recognition model according to any one of claims 1 to 7 or a method of human-computer interaction according to claim 8.