CN117877468A - Multi-mode voice refusing method and system for electric power man-machine interaction scene - Google Patents

Multi-mode voice refusing method and system for electric power man-machine interaction scene Download PDF

Info

Publication number
CN117877468A
CN117877468A CN202311851194.1A CN202311851194A CN117877468A CN 117877468 A CN117877468 A CN 117877468A CN 202311851194 A CN202311851194 A CN 202311851194A CN 117877468 A CN117877468 A CN 117877468A
Authority
CN
China
Prior art keywords
voice
human
acquired
text
electric power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311851194.1A
Other languages
Chinese (zh)
Inventor
周逸聪
龚梁
钟刚
郭鹏程
胡华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Firehome Putian Information Technology Co ltd
Original Assignee
Wuhan Firehome Putian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Firehome Putian Information Technology Co ltd filed Critical Wuhan Firehome Putian Information Technology Co ltd
Priority to CN202311851194.1A priority Critical patent/CN117877468A/en
Publication of CN117877468A publication Critical patent/CN117877468A/en
Pending legal-status Critical Current

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The application provides a multi-mode voice refusing method and a system for an electric power man-machine interaction scene, wherein the method comprises the following steps: collecting and acquiring voice signals of an electric power man-machine interaction scene; performing voice discrimination on the acquired voice signal; when the acquired voice signal is voice, converting the voice into text in real time, and acquiring the text; and carrying out multi-modal fusion processing on the acquired text and the original voice, acquiring multi-modal fusion characteristics, and controlling and executing different multi-modal voice refusal strategies according to the acquired multi-modal fusion characteristics. The multi-mode voice refusing method for the electric power man-machine interaction scene has higher refusing precision and stronger practicability; the voice recognition performance of the voice assistant is effectively enhanced, the voice interaction efficiency is improved, continuous dialogue is facilitated, and the user experience is improved.

Description

Multi-mode voice refusing method and system for electric power man-machine interaction scene
Technical Field
The application relates to the technical field of information, in particular to a multi-mode voice refusing method and system for an electric power man-machine interaction scene.
Background
Along with the development of man-machine interaction core technologies such as voice recognition and natural language processing and the continuous expansion of man-machine interaction application scenes, irrelevant voice rejection is a key component for realizing continuous dialogue in the voice interaction process, and is used for distinguishing whether voice signals and voice interaction instructions point to a voice assistant or not, and helping a user to continuously send instructions under the condition of not repeatedly using wake-up words by filtering environmental noise, interference audio and voices of any other non-directional characteristic instruction sets, so that continuous dialogue experience between the user and the voice assistant is improved.
It is important to accurately identify valid speech signals and speech commands specifying a user instruction set during intelligent interaction. In the intelligent human-computer interaction process, besides a directional voice instruction set of a user, a large amount of invalid human voice signals such as ambient noise, interference audio and the like and other nondirectional voice instructions exist. If the irrelevant sounds cannot be effectively identified and filtered, the voice assistant can be wrongly identified and wrongly operated in the human-computer interaction process, and the voice interaction experience of the user is affected. In short, it is important to achieve effective rejection of invalid speech in the intelligent speech interaction process to improve speech interaction efficiency and achieve continuous dialogue goals. Therefore, by recognizing and filtering out background noise and other non-voice sounds, the voice recognition performance of the voice assistant can be enhanced, and the user experience can be improved.
In the prior art, an effective refusing method for invalid voice of an electric power man-machine interaction scene is lacking.
Disclosure of Invention
The application provides a multi-mode voice refusing method and a multi-mode voice refusing system for an electric power man-machine interaction scene, which can solve the technical problem of how to effectively refuse invalid voice in the voice interaction process of the electric power man-machine interaction scene in the prior art.
In a first aspect, the present application provides a multi-modal voice rejection method for an electric human-computer interaction scenario, including the steps of:
collecting and acquiring voice signals of an electric power man-machine interaction scene;
performing voice discrimination on the acquired voice signal;
when the acquired voice signal is voice, converting the voice into text in real time, and acquiring the text;
and carrying out multi-modal fusion processing on the acquired text and the original voice, acquiring multi-modal fusion characteristics, and controlling and executing different multi-modal voice refusal strategies according to the acquired multi-modal fusion characteristics.
With reference to the first aspect, in one implementation manner, the step of performing voice discrimination on the acquired voice signal specifically includes the following steps:
constructing a power scene non-human voice classification model, wherein the power scene non-human voice classification model comprises an energy band-based non-human voice discrimination model and a deep learning-based non-human voice discrimination model;
inputting a voice signal to a non-human voice discrimination model based on an energy band, and acquiring a first non-human voice probability;
inputting a voice signal to a non-human voice discrimination model based on deep learning, and acquiring a second non-human voice probability;
according to the acquired first non-human voice probability and second non-human voice probability, calculating the non-human voice probability of the acquired voice signal;
comparing the non-human voice probability with a non-human voice probability threshold value to obtain a comparison result;
and obtaining a voice judgment result of the voice signal according to the obtained comparison result.
With reference to the first aspect, in one implementation manner, when the acquired voice signal is voice, converting voice into text in real time, acquiring text, and executing different rejection strategies according to the acquired recognition result, where the steps specifically include:
constructing a voice recognition model for converting voice into characters in the electric power field;
when the acquired voice signal is voice, inputting voice to the constructed voice recognition model to perform real-time text conversion, and acquiring text.
With reference to the first aspect, in an implementation manner, the step of constructing a speech recognition model for converting speech into text in the electric power domain specifically includes the following steps:
constructing a power voice instruction set;
and constructing a voice recognition model for converting the voice in the electric power field into characters based on the voice corpus in the electric power field and the constructed electric power voice instruction set.
With reference to the first aspect, in an implementation manner, the power voice instruction set includes instruction set text and corresponding voice corpus.
With reference to the first aspect, in an implementation manner, the step of constructing a voice recognition model for converting voice in the electric power domain based on the voice corpus in the electric power domain and the constructed electric power voice instruction set specifically includes the following steps:
acquiring the recognition accuracy of each voice command in the electric power voice command set;
when the recognition accuracy of each voice instruction in the acquired power voice instruction set is larger than an accuracy threshold, a voice recognition model for converting the power field voice into characters is built based on the power field voice corpus and the built power voice instruction set.
With reference to the first aspect, in an implementation manner, when the acquired voice signal is voice, inputting voice to the constructed voice recognition model to perform real-time text conversion, and acquiring text, the method specifically includes the following steps:
constructing a multi-mode refusal recognition model of human-computer interaction text and voice of an electric power scene;
respectively inputting a text of voice and an original voice into a multi-modal rejection model, and carrying out real-time semantic extraction through a text encoder and voice encoding to obtain text features and voice features;
inputting text features and voice features into a multi-modal fusion module to obtain multi-modal fusion features;
inputting the multi-mode fusion characteristics to the classifier, and controlling and executing different voice refusal strategies according to the appointed power voice instruction set.
In a second aspect, the present application provides a multi-modal speech rejection system for an electrical human-machine interaction scenario, comprising:
the signal acquisition module is used for acquiring voice signals of the electric power man-machine interaction scene;
the voice judging module is in communication connection with the signal acquisition module and is used for judging the voice of the acquired voice signal;
the voice conversion module is in communication connection with the voice judging module and is used for converting voice into text in real time when the acquired voice signal is voice and acquiring the text;
the voice refusing module is in communication connection with the signal acquisition module and the voice conversion module and is used for carrying out multi-mode fusion processing on the acquired text and the original voice, acquiring multi-mode fusion characteristics and controlling and executing different multi-mode voice refusing strategies according to the acquired multi-mode fusion characteristics.
With reference to the second aspect, in one implementation manner, the voice discriminating module includes:
the model building unit is used for building a power scene non-human voice classification model, and comprises an energy band-based non-human voice discrimination model and a deep learning-based non-human voice discrimination model;
the first probability acquisition unit is in communication connection with the model construction unit and the signal acquisition module and is used for inputting a voice signal to the non-human voice discrimination model based on the energy band to acquire a first non-human voice probability;
the second probability acquisition unit is in communication connection with the model construction unit and the signal acquisition module and is used for inputting a voice signal to a non-human voice discrimination model based on deep learning to acquire a second non-human voice probability;
the non-human voice probability acquisition unit is in communication connection with the first probability acquisition unit and the second probability acquisition unit and is used for calculating the non-human voice probability of acquiring the voice signal according to the acquired first non-human voice probability and second non-human voice probability;
the comparison unit is in communication connection with the non-human voice probability acquisition unit and is used for comparing the non-human voice probability with a non-human voice probability threshold value to acquire a comparison result;
and the judging result acquisition unit is in communication connection with the comparison unit and is used for acquiring the voice judging result of the voice signal according to the acquired comparison result.
In a third aspect, an embodiment of the present application provides a computer readable storage medium, where a multi-modal speech recognition program for an electric human-computer interaction scenario is stored on the computer readable storage medium, where when the multi-modal speech recognition program for an electric human-computer interaction scenario is executed by a processor, the steps of the multi-modal speech recognition method for an electric human-computer interaction scenario as described above are implemented.
The beneficial effects that technical scheme that this application embodiment provided include at least:
according to the multi-mode voice rejection method for the electric power man-machine interaction scene, the voice signals are acquired through voice discrimination, then the voice signals are converted into texts, the texts and original voices are subjected to multi-mode fusion processing, multi-mode fusion characteristics are acquired, accordingly, rejection of invalid voices is executed, compared with irrelevant voice rejection of single modes such as traditional voices or texts, the multi-mode irrelevant voice rejection technology does not need complex characteristic engineering, learning can be directly conducted from original voices and text modes, and due to the fact that multi-mode information is fused, higher rejection accuracy can be provided by multi-mode rejection models through complementation among the modes, and the multi-mode voice rejection method has higher practicability;
the voice recognition performance of the voice assistant is enhanced, the voice interaction efficiency is effectively improved, continuous dialogue is facilitated, and the user experience is improved.
Drawings
Fig. 1 is a schematic flow chart of a method for multi-mode voice rejection method for an electric man-machine interaction scenario according to an embodiment of the present application;
fig. 2 is a schematic flow chart of implementation of an energy band-based non-human voice discrimination model according to an embodiment of the present application;
fig. 3 is a schematic flow chart of implementation of a non-human voice discrimination model based on deep learning according to an embodiment of the present application;
fig. 4 is a schematic flow chart of an implementation of a human-computer interaction multi-mode rejection model of an electric power scene provided in an embodiment of the present application;
fig. 5 is a schematic diagram of a multimodal voice rejection procedure of an electric man-machine interaction scenario provided in an embodiment of the present application;
fig. 6 is a functional block diagram of a system of a multi-mode voice rejection method for an electric human-computer interaction scenario according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the foregoing drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and "third," etc. are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order, and are not limited to the fact that "first," "second," and "third" are not identical.
In the description of embodiments of the present application, "exemplary," "such as," or "for example," etc., are used to indicate an example, instance, or illustration. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and in addition, in the description of the embodiments of the present application, "plural" means two or more than two.
In some of the processes described in the embodiments of the present application, a plurality of operations or steps occurring in a particular order are included, but it should be understood that these operations or steps may be performed out of the order in which they occur in the embodiments of the present application or in parallel, the sequence numbers of the operations merely serve to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the processes may include more or fewer operations, and the operations or steps may be performed in sequence or in parallel, and the operations or steps may be combined.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
In a first aspect, referring to fig. 1, an embodiment of the present application provides a multi-modal voice rejection method for an electric human-computer interaction scenario, including the following steps:
s1, acquiring a voice signal of an electric power man-machine interaction scene;
s2, performing voice judgment on the acquired voice signals;
s3, when the acquired voice signal is voice, converting the voice into text in real time, and acquiring the text;
and S4, performing multi-modal fusion processing on the acquired text and the original voice to acquire multi-modal fusion characteristics, and controlling and executing different multi-modal voice refusal strategies according to the acquired multi-modal fusion characteristics.
According to the multi-mode voice refusing method for the electric power man-machine interaction scene, the human voice signals are acquired through human voice discrimination and then converted into the texts, the texts and the original voices are subjected to multi-mode fusion processing, the multi-mode fusion characteristics are acquired, accordingly, refusing of invalid voices is executed, voice recognition performance of a voice assistant is enhanced, voice interaction efficiency is effectively improved, continuous conversation is facilitated, and user experience is improved.
In an embodiment, in view of the fact that the electric power man-machine interaction scene has the characteristics that the voice is rich in accents of a specific region in the interaction process and the types of non-voice signals of the interaction environment are relatively concentrated, the voice signals of the acquired electric power man-machine interaction scene need to be subjected to voice judgment, and the step S2 of carrying out voice judgment on the acquired voice signals specifically comprises the following steps:
s21, constructing an electric power scene non-human voice classification model, wherein the electric power scene non-human voice classification model comprises an energy band-based non-human voice discrimination model A and a deep learning-based non-human voice discrimination model B;
s22, inputting a voice signal to a non-human voice discrimination model based on an energy band, and acquiring a first non-human voice probability alpha;
s23, inputting a voice signal to a non-human voice discrimination model based on deep learning, and acquiring a second non-human voice probability beta;
step S24, calculating the non-human voice probability lambda of the acquired voice signal according to the acquired first non-human voice probability alpha and second non-human voice probability beta;
s25, comparing the non-human voice probability with a non-human voice probability threshold value to obtain a comparison result;
and S26, acquiring a voice judgment result of the voice signal according to the acquired comparison result.
According to the multi-mode voice refusing method for the electric power human-computer interaction scene, the electric power scene non-human voice classification model is constructed and is mainly used for eliminating non-human voice signals in the electric power scene human-computer interaction process, when the voice signals are judged to be non-human voice, the electric power scene human-computer interaction system needs to refuse the voice signals, and when the voice signals are determined to be human voice, the follow-up steps S3 and S4 are needed.
In an embodiment, as shown in fig. 2, the step S22 of inputting a voice signal to an energy band-based non-human voice discrimination model to obtain a first non-human voice probability α specifically includes the following steps:
resampling voice signals acquired by a microphone of a human-computer interaction system of an electric power scene, resampling voice signals entering a non-human voice classification model of the electric power scene into 8000HZ, designing a group of filters, including six band-pass filters, wherein the band-pass frequencies of the filters are respectively 80-250HZ,250-500HZ,500-1000HZ,1000-2000HZ,2000-3000HZ,3000-4000Hz, dividing the frequency of the voice signals, respectively calculating the energy of each sub-band signal, and carrying out normalization processing to obtain epsilon i ,0≤ε i Not more than 1, wherein the value range of i is [1,6 ]],
In an embodiment, as shown in fig. 3, the deep learning-based non-human voice discrimination model B mainly includes a spec segment layer, a Convolition layer, a Linear layer, a Dropout layer, and a Softmax layer, and the step S23 of inputting a voice signal to the deep learning-based non-human voice discrimination model to obtain a second non-human voice probability β specifically includes the following steps:
resampling a voice signal acquired by a microphone of the power scene man-machine interaction system, resampling the voice signal entering the power scene non-human voice classification model into 16000HZ, enhancing the voice signal through a SpecAugment layer, then respectively passing the enhanced voice signal through a Convolition layer, a Linear layer and a Dropout layer, and finally carrying out two-classification output through a Softmax layer to obtain the non-human voice probability which is beta.
In an embodiment, the step S24 calculates a non-human voice probability λ of the acquired voice signal according to the acquired first non-human voice probability α and second non-human voice probability β, and specifically includes the following steps:
performing numerical transformation on the acquired first non-human voice probability alpha and second non-human voice probability beta according to the following formula, and calculating non-human voice probability lambda of acquired voice signals:
λ=μ*α+(1-μ)*β
in the formula, mu and (1-mu) are weights of non-human voice discrimination probabilities given by a model A and a model B in a power scene non-human voice classification model designed by the application respectively.
In an embodiment, the step S26 of obtaining the voice discrimination result of the voice signal according to the obtained comparison result specifically includes the following steps:
setting a non-human voice discrimination threshold as tau;
step S26A, when lambda > tau, judging that the acquired voice signal of the electric power man-machine interaction scene is non-human voice;
step S26B, when lambda And when tau is detected, judging that the acquired voice signal of the electric power man-machine interaction scene is human voice.
In a special case, in an actual application process, values of μ and τ are tested and adjusted according to a field operation environment of the electric human-computer interaction application system.
In an embodiment, when the acquired voice signal is voice, the step S3 converts voice into text in real time, and specifically includes the following steps:
s31, constructing a voice recognition model for converting voice into characters in the electric power field;
and S32, when the acquired voice signal is voice, inputting voice to the constructed voice recognition model to perform real-time text conversion, and acquiring text.
In an embodiment, the step S31 is a step of constructing a voice recognition model for converting voice into text in the electric power domain, and specifically includes the following steps:
step S311, constructing an electric power voice instruction set;
step S312, a voice recognition model for converting the voice in the electric power field into characters is built based on the voice corpus in the electric power field and the built electric power voice instruction set.
In an embodiment, the power voice instruction set includes an instruction set text and a corresponding voice corpus, the instruction set text of the power voice needs to meet the basic requirements that the instruction is clear and unambiguous, concise and redundancy-free, and the power professional term factors must be involved, and the corresponding voice corpus is clear and natural in reading.
In an embodiment, the step S312 is a step of constructing a voice recognition model for converting the voice in the electric power domain into text based on the voice corpus in the electric power domain and the constructed electric power voice instruction set, and specifically includes the following steps:
step S3121, obtaining the recognition accuracy acc of each voice command in the power voice command set;
step S3122A, when the recognition accuracy acc of each voice command in the obtained power voice command set is greater than the accuracy threshold value theta, constructing a voice recognition model (DL-ASR) for converting the voice in the power domain based on the voice corpus in the power domain and the constructed power voice command set;
step S3122B, when the recognition accuracy acc of any voice command in the obtained power voice command set is not greater than the accuracy threshold θ, returning to step S311, redesigning the command, adding the command to the specified power voice command set after the design is completed, and updating the specified power voice command set.
In an embodiment, as shown in fig. 4, the step S4 of performing multi-modal fusion processing on the acquired text and the original voice to acquire multi-modal fusion features, and according to the acquired multi-modal fusion features, controlling to execute different multi-modal voice rejection policies specifically includes the following steps:
s41, constructing a multi-mode refusal model of text and voice of man-machine interaction of an electric scene;
step S42, respectively inputting a text of voice of a person and an original voice multi-modal rejection model, carrying out real-time semantic extraction through a text encoder and voice encoding to obtain text features and voice features, specifically extracting high-order acoustic features in (N, X) dimension through an acoustic high-order feature extractor, and extracting high-order text features in (N, Y) dimension through the text high-order feature extractor;
s43, inputting text features and voice features into a multi-modal fusion module to obtain multi-modal fusion features;
and S44, inputting the multi-mode fusion characteristics into a classifier, and controlling to execute different voice refusing strategies according to the appointed power voice instruction set so as to realize non-power voice interaction instruction refusing.
According to the multi-mode voice refusing method for the motor interaction scene, through the steps of S21, the construction of the power scene non-human voice classification model, S311, the construction of the power voice instruction set, S312, the construction of the voice recognition model for converting voices into characters in the power field, S41, the construction of the power scene man-machine interaction text and voice multi-mode refusing model, and the construction of 4 core components, continuous dialogue of the designated power voice instruction in the power scene is realized, learning is directly carried out from original voice and text modes, multi-mode information is integrated, higher precision is provided by the multi-mode refusing model through complementation among modes, the high-efficiency and accurate refusing of invalid voices is realized.
In a more specific embodiment, as shown in fig. 5, it is determined whether the acquired voice signal is a voice command of a specified command set, if yes, voice rejection is not performed on the voice signal, if no, voice rejection is performed on the voice signal, invalid voice rejection determination is performed on a round of voice signal in the conversation is ended, invalid voice rejection determination is performed on a voice signal in a next round of conversation, the above-mentioned non-voice rejection model and multi-modal voice rejection model are circularly performed, and whether a new round of conversation voice signal is voice and is a specified voice command is determined.
In a second aspect, please refer to fig. 6, the present application provides a multi-mode voice rejection system for an electric power man-machine interaction scene, which includes a signal acquisition module 100, a voice discriminating module 200, a voice converting module 300 and a voice rejecting module 400, wherein the signal acquisition module 100 is configured to acquire a voice signal of the electric power man-machine interaction scene; the voice discriminating module 200 is in communication connection with the signal collecting module 100, and is used for discriminating the voice of the acquired voice signal; the voice conversion module 300 is in communication connection with the voice judging module 200, and is used for converting voice into text in real time when the acquired voice signal is voice, and acquiring text; the voice refusing module 400 is in communication connection with the signal collecting module 200 and the voice converting module 300, and is configured to perform multi-modal fusion processing on the acquired text and the original voice, acquire multi-modal fusion features, and control execution of different multi-modal voice refusing strategies according to the acquired multi-modal fusion features.
In an embodiment, the voice discriminating module includes:
the model building unit is used for building a power scene non-human voice classification model, and comprises an energy band-based non-human voice discrimination model and a deep learning-based non-human voice discrimination model;
the first probability acquisition unit is in communication connection with the model construction unit and the signal acquisition module and is used for inputting a voice signal to the non-human voice discrimination model based on the energy band to acquire a first non-human voice probability;
the second probability acquisition unit is in communication connection with the model construction unit and the signal acquisition module and is used for inputting a voice signal to a non-human voice discrimination model based on deep learning to acquire a second non-human voice probability;
the non-human voice probability acquisition unit is in communication connection with the first probability acquisition unit and the second probability acquisition unit and is used for calculating the non-human voice probability of acquiring the voice signal according to the acquired first non-human voice probability and second non-human voice probability;
the comparison unit is in communication connection with the non-human voice probability acquisition unit and is used for comparing the non-human voice probability with a non-human voice probability threshold value to acquire a comparison result;
and the judging result acquisition unit is in communication connection with the comparison unit and is used for acquiring the voice judging result of the voice signal according to the acquired comparison result.
In a third aspect, embodiments of the present application also provide a readable storage medium.
The readable storage medium of the application stores a multi-modal voice rejection program for an electric power man-machine interaction scene, wherein when the multi-modal voice rejection program for the electric power man-machine interaction scene is executed by a processor, the steps of the multi-modal voice rejection method for the electric power man-machine interaction scene are realized.
The method implemented when the multi-modal voice rejection procedure for the electric human-computer interaction scenario is executed may refer to various embodiments of the multi-modal voice rejection method for the electric human-computer interaction scenario of the present application, which are not described herein.
It should be noted that, the foregoing embodiment numbers are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising several instructions for causing a terminal device to perform the method described in the various embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (9)

1. The multi-mode voice refusing method for the electric power man-machine interaction scene is characterized by comprising the following steps of:
collecting and acquiring voice signals of an electric power man-machine interaction scene;
performing voice discrimination on the acquired voice signal;
when the acquired voice signal is voice, converting the voice into text in real time, and acquiring the text;
and carrying out multi-modal fusion processing on the acquired text and the original voice, acquiring multi-modal fusion characteristics, and controlling and executing different multi-modal voice refusal strategies according to the acquired multi-modal fusion characteristics.
2. The multi-modal voice rejection method for electric human-computer interaction scenario of claim 1, wherein the step of performing a human voice discrimination on the acquired voice signal comprises the steps of:
constructing a power scene non-human voice classification model, wherein the power scene non-human voice classification model comprises an energy band-based non-human voice discrimination model and a deep learning-based non-human voice discrimination model;
inputting a voice signal to a non-human voice discrimination model based on an energy band, and acquiring a first non-human voice probability;
inputting a voice signal to a non-human voice discrimination model based on deep learning, and acquiring a second non-human voice probability;
according to the acquired first non-human voice probability and second non-human voice probability, calculating the non-human voice probability of the acquired voice signal;
comparing the non-human voice probability with a non-human voice probability threshold value to obtain a comparison result;
and obtaining a voice judgment result of the voice signal according to the obtained comparison result.
3. The multi-modal speech rejection method for electric human-computer interaction scenario of claim 1, wherein when the acquired speech signal is voice, converting voice into text in real time, acquiring text, and executing different rejection strategies according to the acquired recognition result, comprising the following steps:
constructing a voice recognition model for converting voice into characters in the electric power field;
when the acquired voice signal is voice, inputting voice to the constructed voice recognition model to perform real-time text conversion, and acquiring text.
4. The multi-modal speech recognition method for electric power man-machine interaction scenario as set forth in claim 3, wherein the step of constructing a speech recognition model for converting electric power domain speech into text specifically includes the steps of:
constructing a power voice instruction set;
and constructing a voice recognition model for converting the voice in the electric power field into characters based on the voice corpus in the electric power field and the constructed electric power voice instruction set.
5. The method of claim 4, wherein the power speech instruction set includes instruction set text and corresponding speech corpus.
6. The multi-modal speech rejection method for electric power man-machine interaction scenario as in claim 4, wherein the step of constructing a speech recognition model for converting electric power domain speech into text based on electric power domain speech corpus and constructed electric power speech instruction set specifically comprises the steps of:
acquiring the recognition accuracy of each voice command in the electric power voice command set;
when the recognition accuracy of each voice instruction in the acquired power voice instruction set is larger than an accuracy threshold, a voice recognition model for converting the power field voice into characters is built based on the power field voice corpus and the built power voice instruction set.
7. The multi-modal speech rejection method for electric human-computer interaction scenario of claim 1, wherein when the acquired speech signal is voice, inputting voice to the constructed speech recognition model to perform real-time text conversion, and acquiring text, comprising the following steps:
constructing a multi-mode refusing model of human-computer interaction text and voice of an electric power scene;
respectively inputting a text of voice and an original voice into a multi-modal rejection model, and carrying out real-time semantic extraction through a text encoder and voice encoding to obtain text features and voice features;
inputting text features and voice features into a multi-modal fusion module to obtain multi-modal fusion features;
inputting the multi-mode fusion characteristics to the classifier, and controlling and executing different voice refusal strategies according to the appointed power voice instruction set.
8. A multi-modal speech recognition system for an electrical human-machine interaction scenario, comprising:
the signal acquisition module is used for acquiring voice signals of the electric power man-machine interaction scene;
the voice judging module is in communication connection with the signal acquisition module and is used for judging the voice of the acquired voice signal;
the voice conversion module is in communication connection with the voice judging module and is used for converting voice into text in real time when the acquired voice signal is voice and acquiring the text;
the voice refusing module is in communication connection with the signal acquisition module and the voice conversion module and is used for carrying out multi-mode fusion processing on the acquired text and the original voice, acquiring multi-mode fusion characteristics and controlling and executing different multi-mode voice refusing strategies according to the acquired multi-mode fusion characteristics.
9. The multi-modal speech rejection method for power human-computer interaction scenarios as in claim 4 wherein the voice discrimination module comprises:
the model building unit is used for building a power scene non-human voice classification model, and comprises an energy band-based non-human voice discrimination model and a deep learning-based non-human voice discrimination model;
the first probability acquisition unit is in communication connection with the model construction unit and the signal acquisition module and is used for inputting a voice signal to the non-human voice discrimination model based on the energy band to acquire a first non-human voice probability;
the second probability acquisition unit is in communication connection with the model construction unit and the signal acquisition module and is used for inputting a voice signal to a non-human voice discrimination model based on deep learning to acquire a second non-human voice probability;
the non-human voice probability acquisition unit is in communication connection with the first probability acquisition unit and the second probability acquisition unit and is used for calculating the non-human voice probability of acquiring the voice signal according to the acquired first non-human voice probability and second non-human voice probability;
the comparison unit is in communication connection with the non-human voice probability acquisition unit and is used for comparing the non-human voice probability with a non-human voice probability threshold value to acquire a comparison result;
and the judging result acquisition unit is in communication connection with the comparison unit and is used for acquiring the voice judging result of the voice signal according to the acquired comparison result.
CN202311851194.1A 2023-12-27 2023-12-27 Multi-mode voice refusing method and system for electric power man-machine interaction scene Pending CN117877468A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311851194.1A CN117877468A (en) 2023-12-27 2023-12-27 Multi-mode voice refusing method and system for electric power man-machine interaction scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311851194.1A CN117877468A (en) 2023-12-27 2023-12-27 Multi-mode voice refusing method and system for electric power man-machine interaction scene

Publications (1)

Publication Number Publication Date
CN117877468A true CN117877468A (en) 2024-04-12

Family

ID=90580439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311851194.1A Pending CN117877468A (en) 2023-12-27 2023-12-27 Multi-mode voice refusing method and system for electric power man-machine interaction scene

Country Status (1)

Country Link
CN (1) CN117877468A (en)

Similar Documents

Publication Publication Date Title
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
US8275616B2 (en) System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands
CN105374356B (en) Audio recognition method, speech assessment method, speech recognition system and speech assessment system
CN106601230B (en) Logistics sorting place name voice recognition method and system based on continuous Gaussian mixture HMM model and logistics sorting system
CN102005070A (en) Voice identification gate control system
CN103065629A (en) Speech recognition system of humanoid robot
CN104123939A (en) Substation inspection robot based voice interaction control method
CN101923857A (en) Extensible audio recognition method based on man-machine interaction
CN111462756B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN105989836A (en) Voice acquisition method, device and terminal equipment
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN111326178A (en) Multi-mode speech emotion recognition system and method based on convolutional neural network
CN103198829A (en) Method, device and equipment of reducing interior noise and improving voice recognition rate
Ting Yuan et al. Frog sound identification system for frog species recognition
CN108847221A (en) Audio recognition method, device, storage medium and electronic equipment
CN105679323B (en) A kind of number discovery method and system
CN110415697A (en) A kind of vehicle-mounted voice control method and its system based on deep learning
CN113823293A (en) Speaker recognition method and system based on voice enhancement
CN112183582A (en) Multi-feature fusion underwater target identification method
US20030120490A1 (en) Method for creating a speech database for a target vocabulary in order to train a speech recorgnition system
US20070083371A1 (en) Apparatus and method for recognizing voice
Zhang et al. An overview of speech recognition technology
CN117877468A (en) Multi-mode voice refusing method and system for electric power man-machine interaction scene
CN116230020A (en) Speech emotion recognition and classification method
Wang et al. The DKU-Duke-Lenovo system description for the third DIHARD speech diarization challenge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination