CN113744737B - Training of speech recognition model, man-machine interaction method, equipment and storage medium - Google Patents

Training of speech recognition model, man-machine interaction method, equipment and storage medium Download PDF

Info

Publication number
CN113744737B
CN113744737B CN202111054577.7A CN202111054577A CN113744737B CN 113744737 B CN113744737 B CN 113744737B CN 202111054577 A CN202111054577 A CN 202111054577A CN 113744737 B CN113744737 B CN 113744737B
Authority
CN
China
Prior art keywords
text information
voice
voice data
power industry
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111054577.7A
Other languages
Chinese (zh)
Other versions
CN113744737A (en
Inventor
钟业荣
叶万余
阮国恒
江嘉铭
阮伟聪
张名捷
黄一捷
杨毅
倪进超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Qingyuan Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202111054577.7A priority Critical patent/CN113744737B/en
Publication of CN113744737A publication Critical patent/CN113744737A/en
Application granted granted Critical
Publication of CN113744737B publication Critical patent/CN113744737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a training and man-machine interaction method, equipment and a storage medium of a voice recognition model, wherein the method comprises the following steps: acquiring first voice data belonging to a non-power industry and first text information serving as the content of the first voice data, acquiring terminology belonging to the power industry, merging the terminology into the first text information to acquire second text information belonging to the power industry, checking the validity of the second text information to the power industry, merging the terminology into the first voice data if the second text information is legal to the power industry to acquire second voice data belonging to the power industry, taking the second voice data as a sample and the second text information as a label, training a voice recognition model to convert the voice data belonging to the power industry into the text information, eliminating the mark of manual corpus, greatly reducing time consumption, greatly reducing the marking accuracy of the corpus of the power industry, greatly reducing high cost and greatly improving efficiency.

Description

Training of speech recognition model, man-machine interaction method, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of electric power, in particular to a method, equipment and a storage medium for training a voice recognition model and man-machine interaction.
Background
With the development of the deep learning field, the maturing of artificial intelligence technologies such as voice recognition and man-machine interaction is realized, and a large number of new generation service applications based on artificial intelligence are widely applied to a plurality of industries such as electric power, finance, internet and the like, so that the system becomes an infrastructure for the whole intelligent development.
The voice recognition is a human-computer interaction basis, a large amount of linguistic data are required for realizing training of a model for voice recognition in the power industry, but the linguistic data in the power industry cannot be compared with the general linguistic data, if the model for voice recognition is directly trained according to the situation that the number of data sets is not considered in academic, the performance of the model for voice recognition is lower due to the fact that the number of the linguistic data is insufficient.
If the corpus is marked manually, the time consumption of the human is long and mistakes are easy to occur due to the huge demand of the corpus, and the cost is high, but the efficiency is low.
Disclosure of Invention
The embodiment of the invention provides a training and man-machine interaction method, equipment and a storage medium of a voice recognition model, which are used for solving the problem that the performance of the model for voice recognition in the power industry is low due to insufficient corpus.
In a first aspect, an embodiment of the present invention provides a training method for a speech recognition model, including:
acquiring first voice data belonging to a non-power industry and first text information serving as the content of the first voice data;
acquiring terms belonging to the power industry;
Merging the term into the first text information to obtain second text information belonging to the power industry;
verifying the validity of the second text information for the power industry;
if the second text information is legal for the power industry, the term is fused into the first voice data, and second voice data belonging to the power industry is obtained;
And training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the electric power industry into the text information.
In a second aspect, an embodiment of the present invention further provides a human-computer interaction method, including:
receiving voice data representing a problem in the power industry;
loading a speech recognition model trained by the method of the first aspect;
invoking the voice recognition model to convert the voice data into text information;
and querying an answer for solving the problem represented by the text information in the power industry.
In a third aspect, an embodiment of the present invention further provides a training device for a speech recognition model, including:
The training set acquisition module is used for acquiring first voice data belonging to the non-power industry and first text information serving as the content of the first voice data;
The power term acquisition module is used for acquiring terms belonging to the power industry;
the text information fusion module is used for fusing the terms into the first text information to obtain second text information belonging to the power industry;
the validity checking module is used for checking the validity of the second text information for the power industry;
The voice data fusion module is used for fusing the term into the first voice data if the second text information is legal for the power industry, so as to obtain second voice data belonging to the power industry;
And the model training module is used for training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into the text information.
In a fourth aspect, an embodiment of the present invention further provides a human-computer interaction device, including:
The problem receiving module is used for receiving voice data representing problems in the power industry;
a speech recognition model loading module for loading a speech recognition model trained by the method according to the first aspect or the apparatus according to the third aspect;
The text conversion module is used for calling the voice recognition model to convert the voice data into text information;
And the question inquiring module is used for inquiring answers for solving the questions represented by the text information in the power industry.
In a fifth aspect, embodiments of the present invention further provide a computer apparatus, the computer apparatus including:
One or more processors;
A memory for storing one or more programs,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of training a speech recognition model as described in the first aspect or the method of human interaction as described in the second aspect.
In a sixth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements a method for training a speech recognition model according to the first aspect or a method for human-computer interaction according to the second aspect.
In this embodiment, first voice data belonging to a non-electric power industry and first text information serving as content of the first voice data are obtained, terms belonging to the electric power industry are obtained, the terms are merged into the first text information, second text information belonging to the electric power industry is obtained, validity of the second text information for the electric power industry is checked, if the second text information is legal for the electric power industry, the terms are merged into the first voice data, second voice data belonging to the electric power industry is obtained, the second voice data are taken as samples, the second text information is taken as labels, a voice recognition model is trained to convert the voice data belonging to the electric power industry into text information, and according to the embodiment, on the basis of the non-electric power industry corpus, the terms of the electric power industry are used for constructing the corpus, authenticity of the corpus can be guaranteed.
Further, the high-performance speech recognition model can be trained by the corpus with sufficient and accurate quantity in the power industry, the accuracy of speech recognition in the power industry is improved, the subsequent man-machine interaction is performed based on the high-accuracy speech recognition result, and the quality of the man-machine interaction can be ensured.
Drawings
FIG. 1 is a flowchart of a training method of a speech recognition model according to an embodiment of the present invention;
fig. 2 is a flowchart of a man-machine interaction method according to a second embodiment of the present invention;
Fig. 3 is a schematic structural diagram of a training device for a speech recognition model according to a third embodiment of the present invention;
Fig. 4 is a schematic structural diagram of a man-machine interaction device according to a fourth embodiment of the present invention;
Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Example 1
Fig. 1 is a flowchart of a method for training a speech recognition model according to a first embodiment of the present invention, where the method may be applied to a case of generating a corpus in the power industry based on a generic corpus, so as to train the speech recognition model, and the method may be performed by a training device for the speech recognition model, where the training device for the speech recognition model may be implemented by software and/or hardware, and may be configured in a computer device, for example, a personal computer, a server, a workstation, or the like, and specifically includes the following steps:
Step 101, first voice data belonging to a non-power industry and first text information serving as content of the first voice data are acquired.
To facilitate collection of a sufficient number of data sets, in this embodiment, first voice data belonging to the non-power industry, first text information that is the content of the first voice data, i.e., the "first audio data" may be recorded when the speaker speaks the "first text information" may be collected in a general channel such as an open-source database and/or an open-source project.
Among them, the non-power industry may refer to other industries other than the power industry, for example, education industry, sports industry, media industry, etc., in which the term of the power industry is less likely to appear but is easy to collect, and the data amount is large.
In addition, the first voice data is usually sentence-level voice data, which may be voice data in the same language (e.g. chinese) or voice data in different languages (e.g. chinese, english), which is not limited in this embodiment.
Step 102, obtaining terminology belonging to the power industry.
In this embodiment, the terms belonging to the power industry may be collected in a database of the open source and/or a general channel of the open source such as the project, or in a website provided by the power industry.
The terms belonging to the power industry may refer to terms of power professions, and may be phrases or phrases, for example, reactive voltage control information AVC, a power distribution management system DMIS, an electric energy collection and billing system TMR, a distributed control system DCS, an electric control system ECS of a unit set, and the like.
Step 103, integrating the term into the first text information to obtain second text information belonging to the power industry.
In this embodiment, the term of the electric power industry is integrated into the first text information of the non-electric power industry to form new text information, and since the new text information contains the term of the electric power industry, the new text information can be regarded as a sentence describing the electric power industry, and thus can be recorded as second text information belonging to the electric power industry.
In one embodiment of the present invention, step 103 may include the steps of:
Step 1031, dividing the first text information into a plurality of first keywords according to the grammar structure.
In this embodiment, natural language processing may be performed on the first text information by using a manner of syntax tree, part-of-speech label of dependency tree, syntax edge, tree structure, and the like, and a syntax structure in the first text information, such as a main predicate or the like, is identified, and each part-of-speech (such as a subject, a predicate, an object, and the like) in the syntax structure is mapped to a first keyword in the first text information.
Step 1032, determining a first length of the first keyword.
The minimum units (such as Chinese characters, letters and the like) in the first keywords are counted, and the number of the minimum units is set to be the first length of the first keywords.
And 1033, replacing the first keywords with terms meeting the target conditions in the first text information to obtain second text information belonging to the power industry.
In this embodiment, the first keyword in the first text information is traversed, and the first keyword that meets the target condition is found.
The target conditions comprise the following two items:
1. the difference between the second length and the first length of the term is less than or equal to a first threshold value
The minimum units (e.g., chinese characters, letters, etc.) in the terms are counted, and the number of the minimum units is set as the second length of the term.
A difference (expressed in absolute value) between the second length of the term and the first length of the first keyword is calculated and compared with a preset first threshold.
If the difference is smaller than or equal to the first threshold, it means that the difference between the term and the first keyword is smaller, and at this time, if the term is replaced with the first keyword, the influence on the first voice data is smaller, and the authenticity of the second voice data generated based on the first voice data is kept.
If the difference is greater than the first threshold, it means that the difference between the term and the first keyword is greater, and at this time, if the term is replaced with the first keyword, the influence on the first voice data is greater, and the subsequent generation of the second voice data based on the first voice data is easy to be distorted.
2. The term applies to grammatical structures
The part of speech of the term is consistent with the part of speech of the first keyword, for example, the term is noun and the first keyword is noun, and both can be used as a subject and a predicate, so that when the term can be applied to replace the first keyword to generate the second text information, the grammar structure of the second text information is reasonable, and the authenticity of the second text information is ensured.
The part of speech of the term is inconsistent with the part of speech of the first keyword, for example, the term is a verb, the first keyword is a noun, the term can be used as a predicate, the first keyword can be used as a subject and a predicate, so that when the term can be used for replacing the first keyword to generate second text information, the grammar structure of the second text information is unreasonable, and the second text information is easy to distort.
Step 104, verifying the validity of the second text information for the power industry.
In this embodiment, for the second text information generated by the fusion term, the second text information may be a sentence that appears in the power industry, or may not be a sentence that appears in the power industry, and in order to ensure accuracy of the corpus, validity of the second text information with respect to the power industry may be checked, that is, whether the second text information is a sentence that is legal in the power industry may be checked.
For example, let the first text information be "the computer is bad", "the computer of the Ming dynasty is bad", the term of the power industry is "the transformer", the term "the transformer" is substituted for the first keyword "the computer", the second text information "the transformer is bad", "the transformer of the Ming dynasty is bad", for the power industry, the transformer is not set in a person, the "the transformer is bad" is a legal sentence in the power industry, and the "the transformer of the Ming dynasty is bad" is an illegal sentence in the power industry.
In one embodiment of the present invention, step 104 includes the steps of:
step 1041, obtaining third text information belonging to the power industry.
In this embodiment, the third text information pertaining to the power industry may be collected in a database of the open source and/or a general channel such as the item of the open source, or in a website provided by the power industry.
The third text information belonging to the power industry is usually a sentence, and may be text information in the same language (e.g. chinese) or text information in a different language (e.g. chinese, english), which is not limited in this embodiment.
Step 1042, calculating the similarity between the second text information and the third text information.
In this embodiment, the similarity between the second text information and the third text information may be calculated by an algorithm such as an edit distance, TF-IDF (term frequency-inverse text frequency index), simhash, topic modeling LDA, doc2vec, word2vec, etc., which may reflect whether the user speaks the second text information in the power industry to some extent.
Step 1043, calculating a distribution probability of the grammar structure containing the term in the second text information in the third text information.
Taking a grammar structure containing terms in the second text information as a whole, taking the situation that the whole is distributed in the third text information into consideration, and calculating the distribution probability of the whole in the third text information, wherein the distribution probability can reflect whether a user speaks the second text information in the power industry to a certain extent.
In one way of calculating the distribution probability, the second keyword in the third text information may be queried by performing word segmentation processing on the third text information based on the same string matching or the like.
And counting the dependency probability of each second keyword in the third text information, wherein the dependency probability is the ratio of the first word frequency to the second word frequency, the first word frequency is the word frequency number of the current second keyword after other second keywords appear in the third text information, and the second word frequency is the total word frequency of the other second keywords in the third text information.
It is assumed that the occurrence of the second keyword depends only on a limited one or several second keywords that it was preceded by. For a sentence T, it can be assumed that T is composed of word sequences W 1,W2,W3, …, wn, then the probability that this sentence T is composed of W 1,W2,W3,…,Wn connections is P(T)=P(W1W2W3…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1).
If the occurrence of a second keyword depends only on a second keyword that occurs before it, then P(T)=P(W1W2W3…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)≈P(W1)P(W2|W1)P(W3|W2)…P(Wn|Wn-1).
The term is compared with a first keyword as a representation of the grammatical structure with a plurality of second keywords having a dependency relationship.
When the term is the same as the current second keyword and the first keyword is the same as other second keywords, the dependency probability is set as the distribution probability of the grammar structure containing the term in the second text information in the third text information, wherein the first keyword is the keyword replaced by the term in the first text information.
Step 1044, if the similarity is greater than or equal to the second threshold and the distribution probability is greater than or equal to the third threshold, determining that the second text information is legal for the power industry.
If the similarity is greater than or equal to the second threshold and the distribution probability is greater than or equal to the third threshold, indicating that the second text information has a greater probability of appearing in the power industry, it may be determined that the second text information is legal for the power industry.
In addition, if the similarity is smaller than the second threshold value and the distribution probability is smaller than the third threshold value, the second text information is indicated to have a smaller probability of appearing in the power industry, and the second text information can be determined to be illegal for the power industry.
Step 105, if the second text information is legal for the power industry, the term is fused into the first voice data, and the second voice data belonging to the power industry is obtained.
If the second text information is legal for the power industry, the pronunciation of the term can be further referred to, new voice data is formed on the basis of the first voice data and recorded as the second voice data belonging to the power industry, so that the content of the second voice data is the second text information.
In one embodiment of the present invention, step 105 includes the steps of:
step 1051, query the first voice data for a first voice signal with a first keyword.
In this embodiment, the first keyword is a keyword replaced by a term in the first text information, the first voice data includes multiple frames of voice signals, each frame of voice signal has been marked with an associated word in advance, and at this time, a voice signal corresponding to a word that constitutes the first keyword may be searched for and recorded as the first voice signal.
Step 1052, determine a speech conversion model.
In this embodiment, a voice conversion model for converting text information into a voice signal, that is, for performing an operation from text to voice (TextToSpeech), may be trained in advance, and the input of the voice conversion model is text information and the output is a voice signal.
Step 1053, invoking the speech conversion model to convert the term to a second speech signal.
The term (text information) of the electric power industry is input into the voice conversion model, the voice conversion model processes the voice conversion model according to logic of the voice conversion model, a voice signal is output, and the voice signal is recorded as a second voice signal, namely the content of the second voice signal is the term of the electric power industry.
Further, the partial speech conversion model may learn the characteristics of the speaker using a small number of utterances (e.g., sentences of several seconds duration), i.e., the partial speech conversion model has a Voice reproduction (Voice) function, such as DurIAN, deep Voice, etc.
In these speech conversion models, two basic approaches are generally focused on to solve the problem of speech duplication: speaker adaptation (speaker adaptation) and speaker coding (speaker encoding), both of which may be applied to a multi-speaker generated speech conversion model by speaker-embedded vectors without degrading speech quality. Both of these methods can achieve good performance with respect to the naturalness of speech and its similarity to the original speaker, even with a small amount of reproduced audio.
Then, a plurality of pieces of third voice data and fourth text information which is the content of each piece of third voice data can be obtained in a general channel such as an open source database and/or an open source item, wherein the tone of the third voice data is the same as that of the first voice data, that is, the third voice data and the first voice data are all the voices uttered by the same speaker.
The first voice data and the third voice data are taken as samples, the first text information and the fourth text information are taken as labels, and parameters of a voice conversion model are updated in a supervised mode so that the voice conversion model is used for synthesizing voice signals of tone, and at the moment, the voice conversion model has the function of copying speakers of the first voice data and the third voice data.
Under the condition of limiting the tone of the speaker, the term input is processed in the updated voice conversion model to obtain a second voice signal with the same tone as the first voice data, so that naturalness and authenticity in the subsequent generation of the second voice data can be ensured.
Step 1054, replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry.
In the first voice data, a second voice signal with the content of the power industry as a term is replaced with a first voice signal representing a first keyword in a non-power industry to form new voice data, and the new voice data comprises the term of the power industry and can be regarded as sentences describing the power industry, so that the new voice data can be recorded as second voice data belonging to the power industry.
Further, a first target signal and a second target signal may be determined from the second voice signals, where the first target signal is a multi-frame voice signal from the beginning of the sequence in the second voice signal, and the second target signal is a multi-frame voice signal from the end of the sequence in the second voice signal.
And determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data.
And respectively smoothing the first reference signal and the first target signal and smoothing the second reference signal and the second target signal under audio parameters such as volume and the like.
If the smoothing processing is completed, the second voice signal is replaced with the first voice signal, so that second voice data belonging to the power industry is obtained, the second voice signal is enabled to be more smoothly transited with other voice signals in the first voice data, and naturalness and authenticity of the second voice data are ensured.
And 106, training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the power industry into the text information.
In this embodiment, the second text information may be labeled as a tag of the second voice data, which indicates the real content of the second voice data.
And taking the second text information as a sample, performing supervised training on the voice recognition model under the supervision of the label, improving the learning ability of the voice recognition model on the second voice data, and improving the generalization ability of the voice recognition model in the power industry, thereby improving the accuracy of voice recognition in the power industry and ensuring the accuracy of subsequent man-machine interaction.
The structure of the speech recognition model is not limited to an artificially designed neural network, such as CNN (Convolutional Neural Network ), RNN (Rerrent Neural Network, cyclic neural network), or the like, but may be a neural network optimized by a model quantization method, a neural network searched for problems in the power industry by a NAS (neural network structure search) method, or the like, which is not limited in this embodiment.
Further, the speech recognition model may be retrained, fine-tuning (fine-tuning) may be performed on an existing third party model to obtain the speech recognition model, continuous learning may be performed on the speech recognition model, and the embodiment is not limited thereto.
In this embodiment, first voice data belonging to a non-electric power industry and first text information serving as content of the first voice data are obtained, terms belonging to the electric power industry are obtained, the terms are merged into the first text information, second text information belonging to the electric power industry is obtained, validity of the second text information for the electric power industry is checked, if the second text information is legal for the electric power industry, the terms are merged into the first voice data, second voice data belonging to the electric power industry is obtained, the second voice data are taken as samples, the second text information is taken as labels, a voice recognition model is trained to convert the voice data belonging to the electric power industry into text information, and according to the embodiment, on the basis of the non-electric power industry corpus, the terms of the electric power industry are used for constructing the corpus, authenticity of the corpus can be guaranteed.
Further, the high-performance speech recognition model can be trained by the corpus with sufficient and accurate quantity in the power industry, the accuracy of speech recognition in the power industry is improved, the subsequent man-machine interaction is performed based on the high-accuracy speech recognition result, and the quality of the man-machine interaction can be ensured.
Example two
Fig. 2 is a flowchart of a man-machine interaction method provided in a second embodiment of the present invention, where the method may be applicable to a situation that a person performs man-machine interaction by recognizing speech based on a speech recognition model in the power industry, and the method may be performed by a man-machine interaction device, where the man-machine interaction device may be implemented by software and/or hardware, and may be configured in a computer device, where the computer device may be a device on a user side, for example, a personal computer, a mobile terminal, a service terminal or an intelligent robot deployed in a power location, and the computer device may also be a device on a service side, for example, a server, a workstation, or the like, and specifically includes the following steps:
Step 201, voice data representing a problem in the power industry is received.
In this embodiment, the computer device may provide a User Interface (UI) at a local or remote client, where the UI provides a function of consulting a problem in the power industry, such as an interactive Interface of an intelligent robot, an interactive Interface for searching for a fault solution, etc., and a User may operate on the UI and speak during inputting the problem, and at this time, the local or remote client may call a microphone to record voice data, so as to obtain voice data representing the problem in the power industry.
Step 202, loading a voice recognition model.
In this embodiment, the voice recognition model may be trained in advance by the method of the first embodiment, and the structure and parameters thereof may be stored, and when the voice recognition model is applied, the voice recognition model and the parameters thereof may be loaded into the memory for operation.
Step 203, calling a voice recognition model to convert the voice data into text information.
The voice data is input into a voice recognition model, the voice recognition model carries out voice recognition on the voice data according to logic of the voice recognition model, and text information is output.
Since the voice recognition model is custom-trained for the power industry, text information with high accuracy can be obtained when voice recognition is performed on voice data representing a problem in the power industry.
Step 204, querying an answer for solving the problem represented by the text information in the power industry.
After speech recognition is performed to obtain text information as a question asked by the user, the meaning of the question is understood, and the intention of the user is recognized. At this time, the problems can be normalized, spelled and corrected, and semantic recognition such as word segmentation, entity recognition, syntactic analysis and the like can be called to facilitate further processing of the query sentence.
In the process, state tracking (DST) is carried out on the dialogue of man-machine interaction, the dialogue state is updated, the DST records all chat records of the user so far, and the system behavior which has occurred is used for updating the dialogue state; the next policy is determined based on the updated session state, and which service is addressed by this information is determined. After a certain service is activated, the service will also have its own dialogue strategy or flow to finally decide what information to return as a reply.
In the power industry, the following three methods may be adopted for searching and feeding back answers to questions posed by users:
(1) FAQ (Frequently Asked Questions, common problems)
The FAQ saves questions and related answers that the user frequently asks. For questions entered by the user, answers may be looked up in the FAQ library. If the corresponding question can be found, the answer corresponding to the question can be directly returned to the user without a plurality of complex processing procedures such as question understanding, information retrieval, answer extraction and the like, so that the efficiency is improved.
On the basis of a user question candidate question set in the FAQ library, an inverted index of the frequently asked question set is established, the retrieval efficiency of the system is improved, meanwhile, the similarity can be calculated by a semantic-based method, the matching precision of the questions is improved, namely, from user input to answer output, the FAQ system mainly goes through the processes of answer retrieval, semantic matching, answer reordering and the like.
(2) DocQA (extraction type question and answer)
The extraction type question and answer can also be called a machine reading understanding task, namely, a section of reference text is described, then a question is correspondingly given, and then the machine extracts an answer of the corresponding question in the text after reading the reference text. In specific cases, the extracted answers are integrated, processed and repeated to form the Xindaya correct answer.
(3) KBQA (Knowledge Based Qustioning AND ANSWERING, knowledge graph question and answer)
The knowledge graph-based question and answer can be used for knowing the association between the knowledge ontology and the knowledge by using the constructed knowledge graph in the marketing and distribution network field. This allows the question-answering system to have an inference function, which can answer some more complex questions.
The mismatching of the question-answering process based on the knowledge graph comprises systematic analysis of questions, obtaining entities to be matched through entity chain fingers (ENTITY LINKING), relation extraction (Relation Detection) and the like, and searching corresponding answers of the questions in a knowledge base.
The method uses a named entity recognition and relation extraction mode to analyze query sentences, and uses a model to generate a query subgraph by combining analysis results and the idea of state transition. And finally, obtaining an answer in the knowledge base by using the query subgraph. The generated query sub-graph mode used by us enables KBQA systems to implement the most advanced results currently in single-hop queries and multi-hop queries.
In the embodiment, voice data representing a problem in the power industry is received, a voice recognition model is loaded, the voice recognition model is called to convert the voice data into text information, an answer for solving the problem represented by the text information is queried in the power industry, the corpus of the power industry is constructed by referring to terms of the power industry on the basis of the corpus of the non-power industry, the authenticity of the corpus can be ensured, and the corpus of the power industry is abundant in quantity because of the numerous corpora of the non-power industry, so that the voice recognition model can be trained, the mark of the corpus is eliminated manually, the time consumption is greatly reduced, the accuracy of marking priori based on the corpus of the non-power industry is high, the marking accuracy of the corpus of the power industry is greatly reduced, the cost is higher, and the efficiency is greatly improved.
The high-performance voice recognition model can be trained by the sufficient and accurate corpus in the power industry, the accuracy of voice recognition in the power industry can be improved when voice tends to occur, the subsequent man-machine interaction can be performed based on the high-accuracy voice recognition result, the quality of man-machine interaction can be ensured, and an answer can be accurately found for the problem posed by a user.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Example III
Fig. 3 is a block diagram of a training device for a speech recognition model according to a third embodiment of the present invention, which may specifically include the following modules:
the training set acquisition module 301 is configured to acquire first voice data belonging to a non-power industry, and first text information serving as content of the first voice data;
a power term acquisition module 302, configured to acquire terms belonging to the power industry;
A text information fusion module 303, configured to fuse the term into the first text information to obtain second text information belonging to the power industry;
a validity checking module 304, configured to check validity of the second text information for the power industry;
A voice data fusion module 305, configured to fuse the term into the first voice data if the second text information is legal for the power industry, so as to obtain second voice data belonging to the power industry;
the model training module 306 is configured to train a speech recognition model by using the second speech data as a sample and the second text information as a tag, so as to convert the speech data belonging to the power industry into text information.
In one embodiment of the present invention, the text information fusion module 303 is further configured to:
dividing the first text information into a plurality of first keywords according to a grammar structure;
Determining a first length of the first keyword;
replacing the first keyword with the term meeting a target condition in the first text information to obtain second text information belonging to the power industry;
wherein the target condition includes a difference between a second length of the term and the first length being less than or equal to a first threshold, the term being applicable to the syntax structure.
In one embodiment of the present invention, the validity checking module 304 is further configured to:
acquiring third text information belonging to the power industry;
calculating the similarity between the second text information and the third text information;
calculating a distribution probability of a grammar structure containing the term in the second text information in the third text information;
And if the similarity is greater than or equal to a second threshold value and the distribution probability is greater than or equal to a third threshold value, determining that the second text information is legal for the electric power industry.
In one embodiment of the present invention, the validity checking module 304 is further configured to:
inquiring a second keyword in the third text information;
counting the dependency probability of each second keyword in the third text information, wherein the dependency probability is the ratio of a first word frequency to a second word frequency, the first word frequency is the word frequency of the second keyword in the third text information after other second keywords appear, and the second word frequency is the total word frequency of other second keywords in the third text information;
When the term is the same as the current second keyword and the first keyword is the same as other second keywords, setting the dependency probability as the distribution probability of the grammar structure containing the term in the second text information in the third text information, wherein the first keyword is a keyword replaced by the term in the first text information.
In one embodiment of the present invention, the voice data fusion module 305 is further configured to:
querying a first voice signal with content being a first keyword in the first voice data, wherein the first keyword is a keyword replaced by the term in the first text information;
Determining a voice conversion model, wherein the voice conversion model is used for converting text information into voice signals;
Invoking the speech conversion model to convert the term to a second speech signal;
And replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry.
In one embodiment of the present invention, the voice data fusion module 305 is further configured to:
acquiring third voice data and fourth text information serving as the content of the third voice data, wherein the tone of the third voice data is the same as that of the first voice data;
Updating the voice conversion model by taking the first voice data and the third voice data as samples and the first text information and the fourth text information as labels, so that the voice conversion model is used for synthesizing voice signals of the tone;
processing the term input updated in the voice conversion model under the condition of limiting the tone color to obtain a second voice signal with the same tone color as the first voice data.
In one embodiment of the present invention, the voice data fusion module 305 is further configured to:
Determining a first target signal and a second target signal in the second voice signals, wherein the first target signal is a multi-frame voice signal of which the sequence starts in the second voice signals, and the second target signal is a multi-frame voice signal of which the sequence ends in the second voice signals;
Determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data;
smoothing the first reference signal and the first target signal, and smoothing the second reference signal and the second target signal respectively;
And if the smoothing processing is completed, replacing the first voice signal with the second voice signal to obtain second voice data belonging to the power industry.
The training device for the speech recognition model provided by the embodiment of the invention can execute the training method for the speech recognition model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example IV
Fig. 4 is a structural block diagram of a man-machine interaction device according to a fourth embodiment of the present invention, which may specifically include the following modules:
A problem receiving module 401, configured to receive voice data representing a problem in the power industry;
A speech recognition model loading module 402 for loading a speech recognition model;
the speech recognition model may be trained by the method of embodiment one or the apparatus of embodiment three.
A text conversion module 403, configured to invoke the speech recognition model to convert the speech data into text information;
a question querying module 404, configured to query the power industry for an answer for solving the question represented by the text information.
The man-machine interaction device provided by the embodiment of the invention can execute the man-machine interaction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example five
Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. Fig. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in FIG. 5, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, a training method or a man-machine interaction method for implementing a speech recognition model provided by an embodiment of the present invention.
Example six
The sixth embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the foregoing training method or man-machine interaction method for a speech recognition model, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.
The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (9)

1. A method for training a speech recognition model, comprising:
acquiring first voice data belonging to a non-power industry and first text information serving as the content of the first voice data;
acquiring terms belonging to the power industry;
Merging the term into the first text information to obtain second text information belonging to the power industry;
verifying the validity of the second text information for the power industry;
if the second text information is legal for the power industry, the term is fused into the first voice data, and second voice data belonging to the power industry is obtained;
training a voice recognition model by taking the second voice data as a sample and the second text information as a label so as to convert the voice data belonging to the electric power industry into text information;
wherein the verifying the validity of the second text information for the power industry includes:
acquiring third text information belonging to the power industry;
calculating the similarity between the second text information and the third text information;
calculating a distribution probability of a grammar structure containing the term in the second text information in the third text information;
And if the similarity is greater than or equal to a second threshold value and the distribution probability is greater than or equal to a third threshold value, determining that the second text information is legal for the electric power industry.
2. The method of claim 1, wherein said merging the term into the first text message to obtain a second text message belonging to the power industry comprises:
dividing the first text information into a plurality of first keywords according to a grammar structure;
Determining a first length of the first keyword;
replacing the first keyword with the term meeting a target condition in the first text information to obtain second text information belonging to the power industry;
wherein the target condition includes a difference between a second length of the term and the first length being less than or equal to a first threshold, the term being applicable to the syntax structure.
3. The method of claim 1, wherein the calculating the probability of the distribution of the grammar structure containing the term in the second text information in the third text information comprises:
inquiring a second keyword in the third text information;
counting the dependency probability of each second keyword in the third text information, wherein the dependency probability is the ratio of a first word frequency to a second word frequency, the first word frequency is the word frequency of the second keyword in the third text information after other second keywords appear, and the second word frequency is the total word frequency of other second keywords in the third text information;
When the term is the same as the current second keyword and the first keyword is the same as other second keywords, setting the dependency probability as the distribution probability of the grammar structure containing the term in the second text information in the third text information, wherein the first keyword is a keyword replaced by the term in the first text information.
4. A method according to any one of claims 1-3, wherein said merging said term into said first voice data to obtain second voice data belonging to said power industry comprises:
querying a first voice signal with content being a first keyword in the first voice data, wherein the first keyword is a keyword replaced by the term in the first text information;
Determining a voice conversion model, wherein the voice conversion model is used for converting text information into voice signals;
Invoking the speech conversion model to convert the term to a second speech signal;
And replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry.
5. The method of claim 4, wherein said invoking the speech conversion model to convert the term to a second speech signal comprises:
acquiring third voice data and fourth text information serving as the content of the third voice data, wherein the tone of the third voice data is the same as that of the first voice data;
Updating the voice conversion model by taking the first voice data and the third voice data as samples and the first text information and the fourth text information as labels, so that the voice conversion model is used for synthesizing voice signals of the tone;
processing the term input updated in the voice conversion model under the condition of limiting the tone color to obtain a second voice signal with the same tone color as the first voice data.
6. The method of claim 5, wherein replacing the first voice signal with the second voice signal in the first voice data to obtain second voice data belonging to the power industry comprises:
Determining a first target signal and a second target signal in the second voice signals, wherein the first target signal is a multi-frame voice signal of which the sequence starts in the second voice signals, and the second target signal is a multi-frame voice signal of which the sequence ends in the second voice signals;
Determining a first reference signal and a second reference signal in the first voice data, wherein the first reference signal is a multi-frame voice signal adjacent to the first target signal in the first voice data, and the second reference signal is a multi-frame voice signal adjacent to the second target signal in the first voice data;
smoothing the first reference signal and the first target signal, and smoothing the second reference signal and the second target signal respectively;
And if the smoothing processing is completed, replacing the first voice signal with the second voice signal to obtain second voice data belonging to the power industry.
7. A human-computer interaction method, comprising:
receiving voice data representing a problem in the power industry;
loading a speech recognition model trained by the method of any one of claims 1-6;
invoking the voice recognition model to convert the voice data into text information;
and querying an answer for solving the problem represented by the text information in the power industry.
8. A computer device, the computer device comprising:
One or more processors;
A memory for storing one or more programs,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of training a speech recognition model as claimed in any one of claims 1-6 or the method of human interaction as claimed in claim 7.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores thereon a computer program, which when executed by a processor implements the training method of a speech recognition model according to any of claims 1-6 or the human-computer interaction method according to claim 7.
CN202111054577.7A 2021-09-09 2021-09-09 Training of speech recognition model, man-machine interaction method, equipment and storage medium Active CN113744737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111054577.7A CN113744737B (en) 2021-09-09 2021-09-09 Training of speech recognition model, man-machine interaction method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111054577.7A CN113744737B (en) 2021-09-09 2021-09-09 Training of speech recognition model, man-machine interaction method, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113744737A CN113744737A (en) 2021-12-03
CN113744737B true CN113744737B (en) 2024-06-11

Family

ID=78737459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111054577.7A Active CN113744737B (en) 2021-09-09 2021-09-09 Training of speech recognition model, man-machine interaction method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113744737B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930993A (en) * 2018-09-20 2020-03-27 蔚来汽车有限公司 Specific field language model generation method and voice data labeling system
CN111554273A (en) * 2020-04-28 2020-08-18 华南理工大学 Method for selecting amplified corpora in voice keyword recognition
CN111968624A (en) * 2020-08-24 2020-11-20 平安科技(深圳)有限公司 Data construction method and device, electronic equipment and storage medium
CN112397054A (en) * 2020-12-17 2021-02-23 北京中电飞华通信有限公司 Power dispatching voice recognition method
CN112651238A (en) * 2020-12-28 2021-04-13 深圳壹账通智能科技有限公司 Training corpus expansion method and device and intention recognition model training method and device
CN112885352A (en) * 2021-01-26 2021-06-01 广东电网有限责任公司 Corpus construction method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032448B1 (en) * 2017-01-06 2018-07-24 International Business Machines Corporation Domain terminology expansion by sensitivity

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110930993A (en) * 2018-09-20 2020-03-27 蔚来汽车有限公司 Specific field language model generation method and voice data labeling system
CN111554273A (en) * 2020-04-28 2020-08-18 华南理工大学 Method for selecting amplified corpora in voice keyword recognition
CN111968624A (en) * 2020-08-24 2020-11-20 平安科技(深圳)有限公司 Data construction method and device, electronic equipment and storage medium
CN112397054A (en) * 2020-12-17 2021-02-23 北京中电飞华通信有限公司 Power dispatching voice recognition method
CN112651238A (en) * 2020-12-28 2021-04-13 深圳壹账通智能科技有限公司 Training corpus expansion method and device and intention recognition model training method and device
CN112885352A (en) * 2021-01-26 2021-06-01 广东电网有限责任公司 Corpus construction method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113744737A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
US10176804B2 (en) Analyzing textual data
Wang et al. An overview of image caption generation methods
US20210210076A1 (en) Facilitating end-to-end communications with automated assistants in multiple languages
CN109146610B (en) Intelligent insurance recommendation method and device and intelligent insurance robot equipment
WO2022095380A1 (en) Ai-based virtual interaction model generation method and apparatus, computer device and storage medium
CN111708869B (en) Processing method and device for man-machine conversation
CN108304372A (en) Entity extraction method and apparatus, computer equipment and storage medium
CN114580382A (en) Text error correction method and device
WO2021218029A1 (en) Artificial intelligence-based interview method and apparatus, computer device, and storage medium
KR102041621B1 (en) System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
Banerjee et al. A dataset for building code-mixed goal oriented conversation systems
CN106570180A (en) Artificial intelligence based voice searching method and device
US11907665B2 (en) Method and system for processing user inputs using natural language processing
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
CN113672708A (en) Language model training method, question and answer pair generation method, device and equipment
CN111522924A (en) Emotional chat type reply generation method with theme perception
CN113392265A (en) Multimedia processing method, device and equipment
CN109992651B (en) Automatic identification and extraction method for problem target features
CN117332789A (en) Semantic analysis method and system for dialogue scene
Banerjee et al. Generating abstractive summaries from meeting transcripts
CN111968646A (en) Voice recognition method and device
CN116432653A (en) Method, device, storage medium and equipment for constructing multilingual database
CN115132182B (en) Data identification method, device, equipment and readable storage medium
Zahariev et al. Semantic analysis of voice messages based on a formalized context
CN117493548A (en) Text classification method, training method and training device for model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant