CN114077840A - Method, device, equipment and storage medium for optimizing voice conversation system - Google Patents

Method, device, equipment and storage medium for optimizing voice conversation system Download PDF

Info

Publication number
CN114077840A
CN114077840A CN202010825282.4A CN202010825282A CN114077840A CN 114077840 A CN114077840 A CN 114077840A CN 202010825282 A CN202010825282 A CN 202010825282A CN 114077840 A CN114077840 A CN 114077840A
Authority
CN
China
Prior art keywords
audio data
voice
false
triggering
voice conversation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010825282.4A
Other languages
Chinese (zh)
Inventor
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Volkswagen Mobvoi Beijing Information Technology Co Ltd
Original Assignee
Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Volkswagen Mobvoi Beijing Information Technology Co Ltd filed Critical Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority to CN202010825282.4A priority Critical patent/CN114077840A/en
Publication of CN114077840A publication Critical patent/CN114077840A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4418Suspend and resume; Hibernate and awake
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Biology (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention discloses an optimization method, device, equipment and storage medium of a voice conversation system. The optimization method of the voice conversation system comprises the following steps: controlling the working mode of the voice conversation system to be in a long monitoring mode, and collecting trigger audio data for triggering the voice conversation model to start working; identifying false trigger audio data in each trigger audio data; the voice dialog model in the voice dialog system is optimized using the respective false trigger audio data. The scheme of the embodiment of the invention improves the usability of the voice conversation system in the long monitoring working mode and reduces the situation of false recognition.

Description

Method, device, equipment and storage medium for optimizing voice conversation system
Technical Field
The embodiment of the invention relates to a voice data processing technology, in particular to an optimization method, device, equipment and storage medium of a voice conversation system.
Background
With the continuous development of computer technology, voice conversation systems are widely used. For example, smart speakers, smart stewards, smart phones, and in-vehicle terminals are all applied to voice dialog systems.
At present, in order to reduce the situation that the voice conversation system has false recognition (for example, recognizing the conversation between two persons as an instruction input by the user to the voice conversation system), the listening time period of the voice conversation system is generally set to be small (for example, 1 second); this results in the user often being required to wake up the voice dialog system multiple times; for example, when a dialog has not ended, the voice dialog system has already entered a sleep mode, and the user is required to wake up the voice dialog system again in order to continue with a subsequent dialog.
Therefore, how to improve the usability of the voice dialog system in the long monitoring working mode and reduce the situation of false recognition is an urgent need to be solved.
Disclosure of Invention
The embodiment of the invention provides an optimization method, device, equipment and storage medium of a voice conversation system, which are used for improving the usability of the voice conversation system in a long monitoring working mode and reducing the situation of false recognition.
In a first aspect, an embodiment of the present invention provides an optimization method for a voice dialog system, including:
controlling the working mode of a voice conversation system to be in a long monitoring mode, and collecting triggering audio data for triggering the voice conversation model to start working;
identifying false trigger audio data in each trigger audio data;
and optimizing a voice conversation model in the voice conversation system by adopting the false triggering audio data.
In a second aspect, an embodiment of the present invention further provides an optimization apparatus for a voice dialog system, including:
the voice conversation system comprises a triggering audio data module, a long monitoring mode and a voice conversation module, wherein the triggering audio data module is used for controlling the working mode of the voice conversation system to be in the long monitoring mode and collecting triggering audio data for triggering the voice conversation model to start working;
the false trigger audio data identification module is used for identifying false trigger audio data in each trigger audio data;
and the model optimization module is used for optimizing the voice dialogue model in the voice dialogue system by adopting the false triggering audio data.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for optimizing a voice dialog system according to any embodiment of the present invention.
The embodiment of the invention controls the working mode of the voice conversation system to be in a long monitoring mode, and collects the triggering audio data for triggering the voice conversation model to start working; identifying false trigger audio data in each trigger audio data; by adopting the false triggering audio data to optimize the voice conversation model in the voice conversation system, the usability of the voice conversation system in a long monitoring working mode is improved, and the situation of false recognition is reduced.
Drawings
Fig. 1 is a flowchart of an optimization method of a voice dialog system according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a method for optimizing a voice dialog system according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a method for optimizing a voice dialog system according to a third embodiment of the present invention;
FIG. 4 is a flowchart of a method for optimizing a voice dialog system according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of an optimizing apparatus of a voice dialog system in a fourth embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device in a fifth embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad invention. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.
Example one
Fig. 1 is a flowchart of an optimization method for a voice dialog system in an embodiment of the present invention, where the method is applicable to a case of optimizing the voice dialog system, and the method may be executed by an optimization apparatus for the voice dialog system, where the apparatus may be implemented in a software and/or hardware manner and integrated in an electronic device, where the electronic device may be a vehicle-mounted terminal, a computer, or a smart phone, and specifically, referring to fig. 1, the method specifically includes the following steps:
and step 110, controlling the working mode of the voice conversation system to be in a long monitoring mode, and collecting triggering audio data for triggering the voice conversation model to start working.
The voice dialogue system can be configured in the voice dialogue system according to the embodiment of the present invention, and when the voice dialogue system receives audio data sent by a user, the voice dialogue system can identify the received audio data through the voice dialogue model, so as to determine a problem of the user and solve the problem of the user. It should be noted that the triggering audio data for triggering the voice dialog model to start working according to the embodiment of the present invention may be any audio data; for example, "you are good", "please open" or "play music", etc., which are not limited in the embodiment of the present invention.
In an optional implementation manner of the embodiment of the present invention, controlling the operating mode of the voice dialog system to be in the long monitoring mode may include: and awakening the voice conversation system, and setting the awakening time of the voice conversation system to be a preset long supervision time. For example, the voice dialog system may be woken up by setting a wake-up word, setting a wake-up action, or a button provided in the voice dialog system.
In an optional implementation manner of the embodiment of the present invention, after the voice dialog system is woken up, the working mode of the voice dialog system can be in the long monitoring mode by controlling the working duration of the voice dialog system; for example, the working time of the voice dialog system may be set to 10 minutes, 1 hour, or 2 hours, etc., which is not limited in the embodiment of the present invention.
And 120, identifying false trigger audio data in each trigger audio data.
The false trigger audio data may be external chat audio data or environmental noise, and the like, and audio data responded by the voice dialog system is not needed.
In an optional implementation manner of the embodiment of the present invention, after the trigger audio data that triggers the voice dialog model to start working is collected, the false trigger audio data may be identified in each trigger audio data.
For example, the collected trigger audio data may be identified and the identification results may be analyzed; and determining whether each trigger audio data is false trigger audio data according to each identification result. For example, if any of the trigger audio data is recognized as "barking", the trigger audio data can be determined as false trigger audio data.
And step 130, optimizing the voice conversation model in the voice conversation system by adopting the error triggering audio data.
In an optional implementation manner of the embodiment of the present invention, after the false trigger audio data is identified, the collected false trigger audio data may be further used to optimize a speech dialogue model in the speech dialogue system. For example, if a large amount (e.g., 200 pieces) of false trigger audio data is collected, it may be determined that the voice dialog system needs to be optimized, and in particular, the voice dialog model in the voice dialog system may be optimized by using the collected 200 pieces of false trigger audio data.
In the scheme of the embodiment, the working mode of the voice conversation system is controlled to be in the long monitoring mode, and trigger audio data for triggering the voice conversation model to start working is collected; identifying false trigger audio data in each trigger audio data; by adopting the false triggering audio data to optimize the voice conversation model in the voice conversation system, the usability of the voice conversation system in a long monitoring working mode is improved, and the situation of false recognition is reduced.
Example two
Fig. 2 is a flowchart of an optimization method of a speech dialog system in the second embodiment of the present invention, which is a further refinement of the above technical solutions, and the technical solutions in this embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 2, the optimization method of the voice dialog system may include the steps of:
step 210, controlling the working mode of the voice dialogue system to be in a long monitoring mode, and collecting trigger audio data for triggering the voice dialogue model to start working.
Step 220, acquiring text information corresponding to each trigger audio data; calculating the semantic integrity of each text message; and when the semantic integrity degree of the target text information is smaller than a first set threshold value, determining the trigger audio data corresponding to the target text information as false trigger audio data.
In an optional implementation manner of the embodiment of the present invention, after the trigger audio data triggering the voice dialog model to start working is collected, text information corresponding to each trigger audio data may be further acquired; illustratively, three pieces of trigger audio data for triggering the voice conversation model to start working are collected, and then text information of the three pieces of trigger audio data can be respectively identified; for example, the recognized text information may be "weather", "how much is the weather today? "or" red ", etc., which are not limited in the embodiments of the present invention.
Furthermore, the semantic integrity of each text message can be calculated respectively; for example, in the above example, "weather", "how today weather? The semantic integrity of the three text messages of ' and ' little red '; illustratively, the semantic integrity of the three text messages may be 0.2, 0.9 and 0.15, which is not limited in the embodiment of the present invention.
Further, whether the semantic integrity of each text message is smaller than a first set threshold value or not can be further determined; the first set threshold may be a value such as 0.4, 0.5, or 0.6, which is not limited in this embodiment.
In an optional implementation manner of this embodiment, when the semantic integrity of the target text information is smaller than a first set threshold, it may be determined that the trigger audio data corresponding to the target text information is false trigger audio data. The target text information can be text information corresponding to the target trigger audio data; the target audio data may be one audio data or multiple audio data in the acquired trigger audio data, which is not limited in the embodiment of the present invention.
For example, if the semantic integrity of the target text information is 0.2 and the first set threshold is 0.4, it may be determined that the trigger audio data corresponding to the target text information is false trigger audio data.
In the above example, when the first set threshold is 0.4, the trigger audio data corresponding to the text information "weather" and "reddish" may be determined to be false trigger audio data.
And step 230, inputting the false trigger audio data serving as negative samples into a voice dialogue model in the voice dialogue system for training to obtain an optimized voice dialogue model.
In an optional implementation manner of this embodiment, after determining that the trigger audio data corresponding to the target text information is false trigger audio data, each false trigger audio data may be further input as a negative sample into a voice dialog model in the voice dialog system for training, so as to obtain an optimized voice dialog model.
For example, in the above example, if it is determined that the trigger audio data corresponding to the text information "weather" and "small red" is false trigger audio data, the trigger audio data corresponding to the text information "weather" and "small red" may be input as negative samples into the voice dialog model, and the voice dialog model may be trained again, so as to obtain the optimized voice dialog model.
In the scheme of the embodiment, the text information corresponding to each trigger audio data is acquired; calculating the semantic integrity of each text message; when the semantic integrity degree of the target text information is smaller than a first set threshold, determining that the trigger audio data corresponding to the target text information is false trigger audio data; and inputting the false trigger audio data serving as negative samples into a voice dialogue model in the voice dialogue system for training to obtain an optimized voice dialogue model, so that the training of the voice dialogue model in the voice dialogue system is realized, and a basis is provided for improving the usability of the voice dialogue system in a long monitoring working mode.
EXAMPLE III
Fig. 3 is a flowchart of an optimization method of a speech dialog system in a third embodiment of the present invention, which is a further refinement of the above technical solutions, and the technical solutions in this embodiment may be combined with various alternatives in one or more of the above embodiments. As shown in fig. 3, the optimization method of the voice dialog system may include the steps of:
and step 310, controlling the working mode of the voice conversation system to be in a long monitoring mode, and collecting triggering audio data for triggering the voice conversation model to start working.
And step 320, continuously playing the noise audio data in the environment of the voice conversation system.
Wherein, the noise audio data does not include audio data capable of triggering the voice dialogue system. For example, the noise audio data may be pre-recorded audio data which is played repeatedly in each scene; for example: audio data generated during the user's dialogue, car whistling sounds on roads, or the sounds of various animals, etc., which are not limited in the embodiments of the present invention.
In an optional implementation manner of the embodiment of the present invention, when the working mode for controlling the speech dialogue polyp is in the long monitoring mode, the pre-recorded noise audio data can be continuously played in the environment where the speech dialogue system is located; the advantage of this is that the voice dialog model can be made more certain of the false triggering audio data, providing a basis for optimizing the voice dialog model.
In step 330, false trigger audio data is identified in each trigger audio data.
And 340, optimizing the voice dialogue model in the voice dialogue system by adopting the error triggering audio data.
Step 350, continuously playing the noise audio data in the environment of the voice conversation system.
In an optional implementation manner of the embodiment of the present invention, after the voice dialog model to be optimized is optimized, the pre-recorded noise audio data may be continuously played in the environment where the voice dialog system is located; this has the advantage that a test of the performance of the optimized speech dialogue model can be carried out to determine whether further optimization of the speech dialogue model is required.
And step 360, calculating the false triggering frequency corresponding to the voice conversation system according to the triggering condition of the voice conversation system played with the noise audio data.
Furthermore, the frequency of the voice dialogue system being triggered by mistake can be calculated according to the triggering condition of the voice dialogue system by the played noise audio data. It should be noted that, since the noise audio data does not include the audio data capable of triggering the voice dialog system, in the process (in the environment of the voice dialog system, during the process of continuously playing the noise audio data), the frequency at which the voice dialog system is triggered is the frequency at which the voice dialog system is triggered by mistake.
And step 370, when the false triggering frequency of the voice dialog system is greater than or equal to the second set threshold, returning to execute the operation of controlling the working mode of the voice dialog system to be in the long monitoring mode, and collecting the operation of triggering audio data for triggering the voice dialog model to start working until the false triggering frequency of the voice dialog system is less than the second set threshold.
The second set threshold may be 20, 50, or 100, which is not limited in the embodiment of the present invention.
In an optional implementation manner of this embodiment, after the false trigger frequency corresponding to the music dialog system is obtained through calculation, the false trigger frequency may be further compared with a second set threshold, and when the false trigger frequency of the voice dialog system is greater than or equal to the second set threshold (e.g., 100), the operation of controlling the operating mode of the voice dialog system to be in the long monitoring mode may be executed back, and the operation of collecting trigger audio data that triggers the voice dialog model to start operating may be executed until the false trigger frequency of the voice dialog system is less than the second set threshold.
It should be noted that, when the false triggering frequency of the voice dialog system is greater than or equal to the second set threshold, it may be considered that the triggering accuracy of the model of the voice dialog in the voice dialog system does not reach the set standard, and it is also necessary to continuously optimize the model.
The advantage of this arrangement is that the usability of the voice dialogue system in the long monitoring operation mode can be further improved, and the situation of false recognition can be reduced.
In an optional implementation manner of the embodiment of the present invention, after calculating the false trigger frequency corresponding to the target voice dialog system, the method may further include: and when the false triggering frequency of the voice conversation system is less than a second set threshold value, performing online processing on the voice conversation system working in the long monitoring mode.
For example, if the false triggering frequency of the voice dialog system is 1 time and is less than the second set threshold value for 100 times, it may be determined that the voice dialog model in the voice dialog system has satisfied the set requirement, and at this time, the voice dialog system may be processed online, so as to ensure that the voice dialog system is not false triggered in the long monitoring operation mode.
According to the scheme of the embodiment, the noise audio data is continuously played in the environment where the voice conversation system is located; calculating a false triggering frequency corresponding to the voice dialogue system according to the triggering condition of the voice dialogue system by the playing noise audio data; and when the false triggering frequency of the voice conversation system is greater than or equal to a second set threshold value, returning to execute the operation of controlling the working mode of the voice conversation system to be in the long monitoring mode and collecting the operation of triggering audio data for triggering the voice conversation model to start working until the false triggering frequency of the voice conversation system is less than the second set threshold value, so that the usability of the voice conversation system in the long monitoring working mode can be further improved, and the situation of false recognition is reduced.
In order to make those skilled in the art better understand the optimization method of the voice dialog system of the embodiment, a specific example is used for description below, and the specific process includes:
step 1, selecting background noise, directly using noise sites (including scenes such as office whispering, noise in a vehicle running in an automobile, bird-speaking and cicada singing water flow and the like) to collect false trigger audio, or using a high-fidelity sound box to play the recorded site high-fidelity audio of each scene before, simulating the noise condition of each scene, and using recording to conveniently verify the optimization effect of the regression testing voice algorithm.
Step 2, waking up the voice dialog system to enter a long monitoring state within a period of time, where the long monitoring duration may be set, and the voice dialog system may enter the long monitoring state using a wake-up word or other manners, where the time may be shorter at an initial stage of optimization of the voice algorithm, and as the false triggering capability is improved, the length of the long monitoring time (e.g., 10 seconds, 30 minutes, or 90 minutes, etc.) is increased, which is not limited in this embodiment.
And 3, recording the response condition of the voice dialogue system to the ambient noise, including voice recognition of the voice dialogue system to the noise and other feedbacks of some responses made by the voice dialogue system, recording noise fragments including feedback stored in the voice dialogue system, including information of voice recognition results of the voice dialogue system to the noise fragments and reaction conditions of the voice dialogue system, and the like, sorting the information into a form or other recording forms, and restarting the step 1 if the voice dialogue system has no other reactions in the long monitoring time.
And 4, sorting the false triggering result of the voice conversation system on the noise in the long monitoring state, wherein the false triggering result comprises information such as audio, a recognition result, response feedback and the like, and giving the information to a voice algorithm team to train a voice conversation system model.
And 5, generating a new voice dialogue system after a voice dialogue model algorithm is optimized by a voice algorithm team, collecting false trigger audios of the new voice dialogue system again, preferably using the recording of the previous test result or similar scene noise, acquiring the false trigger audios again, optimizing and sorting the false trigger results under the long monitoring condition, comparing the false trigger results with the previous results, confirming the optimization effect with the voice algorithm, and confirming whether to submit the new false trigger results to the algorithm group optimization model for optimization.
It can be seen from the above example that, the scheme of the embodiment can rapidly and pertinently provide a large amount of false triggering audios to the voice algorithm team to optimize the voice conversation system, can rapidly realize automation, does not need special personnel to participate in the whole process, and saves labor cost.
For better understanding of the embodiment of the present invention, a specific application scenario of the embodiment of the present invention may be as follows:
after the voice conversation system is awakened normally, noise is recognized as voice, so that the situation of misrecognition of the voice conversation system is caused, optimization of a voice algorithm needs a large amount of misrecognized audio to optimize a training model, and the embodiment of the invention can also carry out misrecognized and collection of the misrecognized audio of a conventional voice flow, and is totally divided into the following steps:
1. the noise is played automatically.
2. The automatic voice awakens the vehicle machine.
3. The collection of audio is falsely triggered.
4. And sorting the false trigger audio.
5. And submitting the development algorithm for optimization.
6. And after the algorithm is optimized, carrying out regression verification.
Fig. 4 is a flowchart of an optimization method of a voice dialog system in a third embodiment of the present invention, and referring to fig. 4, the method specifically includes the following steps:
step 410, waking up the voice dialog system by waking up a word or other means.
And step 420, setting the voice conversation system to enter a long monitoring mode.
And step 430, judging whether the false triggering condition exists.
If yes, go to step 440;
if not, the process returns to step 410.
And step 440, recording the false triggering result and storing the false triggering audio data.
Step 450, optimizing a speech dialog model in the speech dialog system by mis-triggering the audio data.
Step 460, verifying whether the optimized voice dialog system is available.
If yes, go to step 470;
if not, the process returns to step 410.
Step 470, the voice dialog system is online, and the working mode of the voice dialog system is set to be the long monitoring working mode.
In the existing voice conversation system, if the voice recognition time is released, the conversation is carried out at any time without waking up, and under the long monitoring working mode of the voice system, the content of external noise and external chatting can cause the false recognition of the voice conversation system, and some unnecessary error feedback is made, so that the voice conversation system is in an abnormal messy state and is in a state of almost not being normally used. The embodiment of the invention can help to make up the defect, and reduce the situations of false identification and false triggering under long monitoring. By collecting the false triggering audio in the long monitoring state of the voice dialogue system and assisting the voice algorithm to carry out voice dialogue model training on the false triggering audio, the probability of false triggering in the using process is reduced, and the usability of the voice dialogue system in the long monitoring state is greatly improved.
Example four
Fig. 5 is a schematic structural diagram of an optimization apparatus of a voice dialog system according to a fourth embodiment of the present invention, which can execute the optimization method of the voice dialog system according to the foregoing embodiments. Referring to fig. 5, the apparatus includes: a trigger audio data module 510, a false trigger audio data identification module 520, and a model optimization module 530.
A trigger audio data module 510, configured to control a working mode of the voice dialog system to be in a long monitoring mode, and collect trigger audio data that triggers the voice dialog model to start working;
a false trigger audio data identification module 520, configured to identify a false trigger audio data in each trigger audio data;
and the model optimization module 530 is configured to optimize the voice dialog model in the voice dialog system by using the false trigger audio data.
In the scheme of the embodiment, the working mode of the voice conversation system is controlled to be in the long monitoring mode through the trigger audio data module, and trigger audio data for triggering the voice conversation model to start working is collected; identifying the false trigger audio data in each trigger audio data through a false trigger audio data identification module; the model optimization module adopts the error triggering audio data to optimize the voice conversation model in the voice conversation system, so that the usability of the voice conversation system in a long monitoring working mode is improved, and the condition of error identification is reduced.
Optionally, the false trigger audio data identification module 520 is specifically configured to acquire text information corresponding to each trigger audio data; calculating the semantic integrity of each text message; and when the semantic integrity degree of the target text information is smaller than a first set threshold value, determining the trigger audio data corresponding to the target text information as false trigger audio data.
Optionally, the model optimization module 530 is specifically configured to input each false trigger audio data as a negative sample to a voice dialog model in the voice dialog system for training, so as to obtain an optimized voice dialog model.
Optionally, the optimizing device of the voice dialog system further includes: the noise audio data playing module is used for continuously playing the noise audio data in the environment where the voice conversation system is located; wherein, the noise audio data does not include audio data capable of triggering the voice dialogue system.
Optionally, the optimizing device of the voice dialog system further includes: the false trigger frequency calculation module is used for continuously playing noise audio data in the environment of the voice conversation system; calculating false triggering frequency corresponding to the voice dialogue system according to the triggering condition of the voice dialogue system played with the noise audio data; and when the false triggering frequency of the voice conversation system is greater than or equal to a second set threshold value, returning to execute the operation of controlling the working mode of the voice conversation system to be in the long monitoring mode, and collecting the operation of triggering audio data for triggering the voice conversation model to start working until the false triggering frequency of the voice conversation system is less than the second set threshold value.
Optionally, the optimizing device of the voice dialog system further includes: and the online processing module is used for performing online processing on the voice conversation system working in the long monitoring mode when the false triggering frequency of the voice conversation system is smaller than a second set threshold value.
Optionally, the audio data module 510 is triggered, and is specifically configured to wake up the voice dialog system, and set a wake-up duration of the voice dialog system to a preset long supervision duration.
The optimization device of the voice dialog system provided by the embodiment of the invention can execute the optimization method of the voice dialog system provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
EXAMPLE five
Fig. 6 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention, as shown in fig. 6, the electronic device includes a processor 60, a memory 61, an input device 62, and an output device 63; the number of the processors 60 in the electronic device may be one or more, and one processor 60 is taken as an example in fig. 6; the processor 60, the memory 61, the input device 62 and the output device 63 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 6.
The memory 61 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the optimization method of the voice dialog system in the embodiment of the present invention (for example, the trigger audio data module 510, the false trigger audio data recognition module 520, and the model optimization module 530 in the optimization device of the voice dialog system). The processor 60 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 61, that is, implements the above-described optimization method of the voice dialog system.
The memory 61 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 61 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 61 may further include memory located remotely from the processor 60, which may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 62 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic apparatus. The output device 63 may include a display device such as a display screen.
EXAMPLE six
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for optimizing a voice dialog system, the method including:
controlling the working mode of a voice conversation system to be in a long monitoring mode, and collecting triggering audio data for triggering the voice conversation model to start working;
identifying false trigger audio data in each trigger audio data;
and optimizing a voice conversation model in the voice conversation system by adopting the false triggering audio data.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the optimization method of the voice dialog system provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the optimization apparatus of the voice dialog system, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (13)

1. A method for optimizing a voice dialog system, comprising:
controlling the working mode of a voice conversation system to be in a long monitoring mode, and collecting triggering audio data for triggering the voice conversation model to start working;
identifying false trigger audio data in each trigger audio data;
and optimizing a voice conversation model in the voice conversation system by adopting the false triggering audio data.
2. The method of claim 1, wherein the identifying false trigger audio data in each of the trigger audio data comprises:
acquiring text information corresponding to each trigger audio data;
calculating the semantic integrity of each text message;
and when the semantic integrity degree of the target text information is smaller than a first set threshold value, determining that the trigger audio data corresponding to the target text information is false trigger audio data.
3. The method of claim 1, wherein optimizing ones of the speech dialog models using each of the false trigger audio data comprises:
and inputting each false trigger audio data serving as a negative sample into a voice dialogue model in a voice dialogue system for training to obtain an optimized voice dialogue model.
4. The method of claim 1, after controlling the operating mode of the voice dialog system to be in the long listening mode, further comprising:
continuously playing noise audio data in the environment where the voice conversation system is located;
wherein the noise audio data does not include audio data that can trigger the voice dialog system.
5. The method of claim 4, wherein after optimizing the speech dialog model in the speech dialog system, the method further comprises:
continuously playing noise audio data in the environment where the voice conversation system is located;
calculating a false triggering frequency corresponding to the voice dialogue system according to the triggering condition of the voice dialogue system by the playing noise audio data;
and when the false triggering frequency of the voice conversation system is greater than or equal to a second set threshold value, returning to execute the operation of controlling the working mode of the voice conversation system to be in the long monitoring mode, and collecting the operation of triggering audio data for triggering the voice conversation model to start working until the false triggering frequency of the voice conversation system is less than the second set threshold value.
6. The method of claim 5, after said calculating a false trigger frequency corresponding to the voice dialog system, further comprising:
and when the false triggering frequency of the voice conversation system is smaller than the second set threshold value, performing online processing on the voice conversation system working in the long monitoring mode.
7. The method according to any one of claims 1-6, wherein the controlling the operating mode of the voice dialog system in a long listening mode comprises:
and awakening the voice conversation system, and setting the awakening duration of the voice conversation system to be a preset long supervision duration.
8. An apparatus for optimizing a voice dialog system, comprising:
the voice conversation system comprises a triggering audio data module, a long monitoring mode and a voice conversation module, wherein the triggering audio data module is used for controlling the working mode of the voice conversation system to be in the long monitoring mode and collecting triggering audio data for triggering the voice conversation model to start working;
the false trigger audio data identification module is used for identifying false trigger audio data in each trigger audio data;
and the model optimization module is used for optimizing the voice dialogue model in the voice dialogue system by adopting the false triggering audio data.
9. The device according to claim 8, wherein the false trigger audio data identification module is specifically configured to identify the false trigger audio data
Acquiring text information corresponding to each trigger audio data;
calculating the semantic integrity of each text message;
and when the semantic integrity degree of the target text information is smaller than a first set threshold value, determining that the trigger audio data corresponding to the target text information is false trigger audio data.
10. The apparatus according to claim 8, wherein the model optimization module is specifically configured to
And inputting each false trigger audio data serving as a negative sample into a voice dialogue model in a voice dialogue system for training to obtain an optimized voice dialogue model.
11. The apparatus of claim 8, further comprising:
the noise audio data playing module is used for continuously playing the noise audio data in the environment where the voice conversation system is located;
wherein the noise audio data does not include audio data that can trigger the voice dialog system.
12. The apparatus of claim 11, further comprising:
the false trigger frequency calculation module is used for continuously playing noise audio data in the environment where the voice conversation system is located;
calculating a false triggering frequency corresponding to the voice dialogue system according to the triggering condition of the voice dialogue system by the playing noise audio data;
and when the false triggering frequency of the voice conversation system is greater than or equal to a second set threshold value, returning to execute the operation of controlling the working mode of the voice conversation system to be in the long monitoring mode, and collecting the operation of triggering audio data for triggering the voice conversation model to start working until the false triggering frequency of the voice conversation system is less than the second set threshold value.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of optimizing a speech dialog system according to any one of claims 1 to 7 when executing the program.
CN202010825282.4A 2020-08-17 2020-08-17 Method, device, equipment and storage medium for optimizing voice conversation system Pending CN114077840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010825282.4A CN114077840A (en) 2020-08-17 2020-08-17 Method, device, equipment and storage medium for optimizing voice conversation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010825282.4A CN114077840A (en) 2020-08-17 2020-08-17 Method, device, equipment and storage medium for optimizing voice conversation system

Publications (1)

Publication Number Publication Date
CN114077840A true CN114077840A (en) 2022-02-22

Family

ID=80281007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010825282.4A Pending CN114077840A (en) 2020-08-17 2020-08-17 Method, device, equipment and storage medium for optimizing voice conversation system

Country Status (1)

Country Link
CN (1) CN114077840A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332072A (en) * 2023-12-01 2024-01-02 阿里云计算有限公司 Dialogue processing, voice abstract extraction and target dialogue model training method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
CN107221326A (en) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 Voice awakening method, device and computer equipment based on artificial intelligence
CN109461446A (en) * 2018-12-24 2019-03-12 出门问问信息科技有限公司 Method, device, system and storage medium for identifying user target request
CN110661927A (en) * 2019-09-18 2020-01-07 平安科技(深圳)有限公司 Voice interaction method and device, computer equipment and storage medium
CN111179907A (en) * 2019-12-31 2020-05-19 深圳Tcl新技术有限公司 Voice recognition test method, device, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
CN107221326A (en) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 Voice awakening method, device and computer equipment based on artificial intelligence
CN109461446A (en) * 2018-12-24 2019-03-12 出门问问信息科技有限公司 Method, device, system and storage medium for identifying user target request
CN110661927A (en) * 2019-09-18 2020-01-07 平安科技(深圳)有限公司 Voice interaction method and device, computer equipment and storage medium
CN111179907A (en) * 2019-12-31 2020-05-19 深圳Tcl新技术有限公司 Voice recognition test method, device, equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332072A (en) * 2023-12-01 2024-01-02 阿里云计算有限公司 Dialogue processing, voice abstract extraction and target dialogue model training method
CN117332072B (en) * 2023-12-01 2024-02-13 阿里云计算有限公司 Dialogue processing, voice abstract extraction and target dialogue model training method

Similar Documents

Publication Publication Date Title
CN109326289B (en) Wake-up-free voice interaction method, device, equipment and storage medium
CN108962262B (en) Voice data processing method and device
CN110428810B (en) Voice wake-up recognition method and device and electronic equipment
CN102999161B (en) A kind of implementation method of voice wake-up module and application
JP6826205B2 (en) Hybrid speech recognition combined performance automatic evaluation system
CN110047481B (en) Method and apparatus for speech recognition
CN111161714B (en) Voice information processing method, electronic equipment and storage medium
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN111833902B (en) Awakening model training method, awakening word recognition device and electronic equipment
CN108595406B (en) User state reminding method and device, electronic equipment and storage medium
CN111081260A (en) Method and system for identifying voiceprint of awakening word
CN113643704A (en) Test method, upper computer, system and storage medium of vehicle-mounted machine voice system
CN113205809A (en) Voice wake-up method and device
CN111724781A (en) Audio data storage method and device, terminal and storage medium
CN112712799B (en) Acquisition method, device, equipment and storage medium for false triggering voice information
CN110808050A (en) Voice recognition method and intelligent equipment
CN112185425B (en) Audio signal processing method, device, equipment and storage medium
CN111710339A (en) Voice recognition interaction system and method based on data visualization display technology
CN111833870A (en) Awakening method and device of vehicle-mounted voice system, vehicle and medium
CN110322587B (en) Evaluation recording method, device and equipment in driving process and storage medium
CN117198285A (en) Equipment awakening method, device, equipment, medium and vehicle
CN114077840A (en) Method, device, equipment and storage medium for optimizing voice conversation system
CN115985317A (en) Information processing method, information processing apparatus, vehicle, and storage medium
CN114420121A (en) Voice interaction method, electronic device and storage medium
CN111464644B (en) Data transmission method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination