WO2021169615A1 - 基于人工智能的语音响应处理方法、装置、设备及介质 - Google Patents
基于人工智能的语音响应处理方法、装置、设备及介质 Download PDFInfo
- Publication number
- WO2021169615A1 WO2021169615A1 PCT/CN2021/070450 CN2021070450W WO2021169615A1 WO 2021169615 A1 WO2021169615 A1 WO 2021169615A1 CN 2021070450 W CN2021070450 W CN 2021070450W WO 2021169615 A1 WO2021169615 A1 WO 2021169615A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- target
- response
- analyzed
- intent
- Prior art date
Links
- 230000004044 response Effects 0.000 title claims abstract description 297
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 56
- 238000003672 processing method Methods 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 129
- 230000008569 process Effects 0.000 claims abstract description 115
- 238000012545 processing Methods 0.000 claims abstract description 100
- 238000004458 analytical method Methods 0.000 claims abstract description 58
- 238000012544 monitoring process Methods 0.000 claims abstract description 18
- 239000002245 particle Substances 0.000 claims description 107
- 230000002452 interceptive effect Effects 0.000 claims description 78
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 6
- 238000003786 synthesis reaction Methods 0.000 claims description 6
- 230000001755 vocal effect Effects 0.000 claims 3
- 230000003993 interaction Effects 0.000 abstract description 26
- 230000000694 effects Effects 0.000 abstract description 5
- 230000036651 mood Effects 0.000 abstract 2
- 238000005516 engineering process Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004886 process control Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- This application relates to the field of voice processing technology, and in particular to a voice response processing method, device, equipment, and medium based on artificial intelligence.
- an intelligent interaction device with voice interaction function can collect and recognize the user's real-time voice, and respond based on the real-time voice recognition result to achieve the purpose of human-computer interaction.
- the current intelligent interactive device responds to real-time voice through ASR speech recognition, NLP semantic analysis, and TTS speech synthesis.
- the time required for this process is the pause response time of the intelligent interactive device interacting with the user.
- This pause The response time is specifically the time difference from when the user finishes speaking a certain segment of real-time voice to when the intelligent interactive device responds based on the real-time voice.
- the inventor realizes that the pause response time of the voice interaction between the current smart interactive device and the user is relatively long, which makes the user feel a delay and affects the user's experience of voice interaction.
- the embodiments of the present application provide an artificial intelligence-based voice response processing method, device, device, and medium to solve the problem of excessively long pause response time for voice interaction between smart interactive devices and users.
- a voice response processing method based on artificial intelligence including:
- a voice response processing device based on artificial intelligence including:
- the to-be-processed voice stream acquisition module is used to acquire the to-be-processed voice stream collected by the voice recording module in real time;
- the to-be-analyzed voice stream acquisition module is configured to perform sentence integrity analysis on the to-be-processed voice stream to obtain the to-be-analyzed voice stream;
- the playback analysis parallel processing module is used to execute the first processing process and the second processing process in parallel, call the first processing process to control the voice playback module to play the target modal particle recording, and call the second processing process to perform the analysis on the voice to be analyzed. Recognize the flow and obtain the target response voice;
- the response voice real-time playback module is configured to monitor the playback status of the target modal particle recording by the voice playback module in real time, and if the playback status is the end of playback, control the voice playback module to play the target response voice.
- An intelligent interactive device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
- One or more readable storage media storing computer readable instructions, the computer readable storage medium storing computer readable instructions, characterized in that, when the computer readable instructions are executed by one or more processors, Make the one or more processors execute the following steps:
- the above-mentioned artificial intelligence-based voice response processing method, device, equipment and medium first perform sentence integrity analysis on the to-be-processed voice stream collected in real time during the voice interaction process to determine the to-be-analyzed voice stream, which helps to improve subsequent recognition and analysis Accuracy and timeliness.
- the target modal particle recording is played while recognizing the voice stream to be analyzed, and the target response voice is played after the playback of the target modal particle recording ends, so that the recognition process of the to-be-analyzed voice stream and the playback process of the target modal particle recording are carried out at the same time.
- the target modal particle recording is played during the pause response time of the analysis and processing of the voice stream to be analyzed, which makes the connection between the target modal particle recording and the playback of the target response voice natural, and improves the response time and response effect of voice interaction.
- FIG. 1 is a schematic diagram of an application environment of a voice response processing method based on artificial intelligence in an embodiment of the present application
- FIG. 2 is a flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application
- FIG. 3 is another flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application
- FIG. 4 is another flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application
- FIG. 5 is another flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application
- Fig. 6 is another flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application
- FIG. 7 is another flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application.
- FIG. 8 is another flowchart of a voice response processing method based on artificial intelligence in an embodiment of the present application.
- FIG. 9 is a schematic diagram of a voice response processing device based on artificial intelligence in an embodiment of the present application.
- Fig. 10 is a schematic diagram of a smart interactive device in an embodiment of the present application.
- the artificial intelligence-based voice response processing method provided by the embodiment of the present application can be applied to an independently set intelligent interactive device, and can also be applied to the application environment as shown in FIG. 1.
- the intelligent interactive device when the artificial intelligence-based voice response processing method is applied to an independent intelligent interactive device, the intelligent interactive device is provided with a processor and a voice recording module and a voice playback module connected to the processor, which can be used in processing
- the voice response processing method based on artificial intelligence is executed on the device, so that during the voice interaction between the user and the smart interactive device, the response time of each pause of the smart interactive device is shorter, so that the user does not feel the delay in the voice interaction process, and the experience is better. good.
- the artificial intelligence-based voice response processing method is applied in an artificial intelligence-based voice response processing system.
- the artificial intelligence-based voice response processing system includes an intelligent interactive device and a server as shown in FIG.
- the device communicates with the server through the network.
- the intelligent interactive device is equipped with a voice recording module and a voice playback module.
- the voice response processing method based on artificial intelligence can be executed on the server, so that the user can interact with the intelligent interactive device during the voice interaction process.
- the interactive device has a short response time for each pause, so that the user does not feel the delay in the voice interaction process, and the experience is better.
- the server can be implemented as an independent server or a server cluster composed of multiple servers.
- the intelligent interaction device may be a robot that can realize human-computer interaction.
- an artificial intelligence-based voice response processing method is provided, and the method is applied to the processor of an independent intelligent interactive device or a server connected to the intelligent interactive device as an example for description. Including the following steps:
- S201 Acquire a voice stream to be processed collected by the voice recording module in real time.
- the voice recording module is a module that can realize the recording function.
- the voice recording module may be a recording chip integrated on an intelligent interactive device or a client for realizing the recording function.
- the voice stream to be processed is the voice stream collected in real time by the voice recording module and needs subsequent recognition processing.
- the processor of the smart interactive device or the server connected to the smart interactive device can obtain the voice recording module that collects the voice stream to be processed by the user's speech process in real time.
- the voice stream to be processed is specifically the voice stream that the user wants to interact with the smart interactive device
- S202 Perform sentence integrity analysis on the voice stream to be processed to obtain the voice stream to be analyzed.
- the voice stream to be analyzed is a voice stream that is determined from the voice stream to be processed and can reflect that the user has finished speaking a paragraph.
- Sentence integrity analysis of the voice stream to be processed refers to separating the voice stream to be processed based on a complete sentence, so that each voice stream to be analyzed can completely and accurately reflect the user's intention.
- the processor of the smart interactive device or the server connected to the smart interactive device can intercept the voice stream to be processed recorded in real time by the voice recording module, and extract the voice stream that reflects that the user has finished speaking a certain passage as the voice to be analyzed. In order to identify and analyze the voice stream to be analyzed later, to determine the user's intention reflected in the voice stream to be analyzed and respond based on the user's intention, so as to achieve the purpose of human-computer interaction.
- intercepting the to-be-analyzed voice stream from the to-be-processed voice stream can reflect that the user has finished speaking a certain paragraph, which can ensure the accuracy and timeliness of subsequent recognition and analysis, and avoid dividing a certain paragraph of the user's spoken into several paragraphs , Processing separately, resulting in lower accuracy and timeliness of speech recognition analysis.
- S203 Execute the first processing process and the second processing process in parallel, call the first process process to control the voice playback module to play the target modal particle recording, call the second process process to recognize the voice stream to be analyzed, and obtain the target response voice.
- the target modal particle recording refers to the modal particle recording that needs to be played this time.
- the modal particle recording is a pre-recorded recording related to the modal particle, for example, a pre-recorded recording corresponding to a modal particle such as "Hmm".
- the target response voice is the voice that responds to the user's intention determined by the recognition analysis of the voice stream to be analyzed. For example, if the user intent corresponding to the speech content corresponding to the voice stream to be analyzed is "I want to know the profit rate of product A", then the target response voice is "The profit rate of product A", which can be realized in the voice stream to be analyzed The user’s intention to respond intelligently to replace manual response, which helps to save labor costs.
- the first processing process is a process created on the processor of the intelligent interactive device or the processor of the server for controlling the work of the voice playback module.
- the second processing process is a process that is created on the processor of the intelligent interactive device or the processor of the server to perform recognition processing on the voice stream to be recognized.
- the processor of the smart interactive device or the server connected to the smart interactive device creates or calls the pre-created first processing process and second processing process after obtaining the voice stream to be analyzed, so that the first processing process and the second processing process
- the processing processes are executed in parallel, so that the first processing process controls the voice playback module to play the target modal particle recording, and the second processing process recognizes the to-be-analyzed voice stream to obtain the target response voice, so that the playback of the target modal particle recording and waiting
- the recognition process of the analyzed voice stream is processed in parallel to realize the playback of the target modal particle recording within the pause response time of the recognition and analysis of the voice stream to be analyzed, so that the intelligent interactive device responds in a timely manner, and avoids that the pause response time is too long and the user experience is poor.
- the pause response time here can be understood as the processing time for identifying and analyzing the voice stream to be analyzed to determine and play the target response voice.
- the pause response time for recognition and analysis of a certain segment of the voice stream to be analyzed is 3s. If the processor of the smart interactive device or the server connected to the smart interactive device plays the target modal particle for 2s within 1s after the voice stream to be analyzed is obtained Recording shortens the pause response time of the smart interactive device within 1s, so that the user does not feel the response delay, which helps improve the user experience.
- S204 Monitor the playback status of the target modal particle recording played by the voice playback module in real time, and if the playback status is the end of playback, control the voice playback module to play the target response voice.
- the target modal particle recording can be understood as the recording played during the recognition and analysis of the voice stream to be analyzed.
- the playback duration of the target modal particle recording will be within the pause response time corresponding to the voice stream to be analyzed. Therefore, the intelligent interactive device can play the target response voice in real time after the voice playback module is controlled to play the target modal particle recording, so as to realize the timely response to the user's intentions determined separately from the voice stream to be analyzed.
- the processor of the smart interactive device or the server connected to the smart interactive device obtains the voice stream to be analyzed, and controls the voice playback module to play the target modal particle recording based on the first processing process, it calls the state monitoring tool to monitor the voice playback in real time
- the module plays the playback status of the target modal particle recording.
- the playback status includes the playback end and the playback not end.
- the first processing process can be called to control the voice playback module to play the target response voice corresponding to the voice stream to be analyzed, so that after the target modal particle recording is played, naturally Concatenate the playback target response voice to avoid excessive pause response time that affects the user experience.
- the status monitoring tool is a preset tool for monitoring the playback status of the voice playback module.
- the sentence integrity analysis is performed on the voice stream to be processed collected in real time during the voice interaction process to determine the voice stream to be analyzed, which helps to improve subsequent recognition and analysis. Accuracy and timeliness.
- the recognition process of the to-be-analyzed voice stream can be performed simultaneously with the playback process of the target tone particle recording, and the target tone can be played within the pause response time of the analyzed voice stream. Word recording improves the response time and response effect of voice interaction.
- control the voice playback module After real-time monitoring of the playback status of the target modal particle recording is the end of playback, control the voice playback module to play the target response voice, so that the target modal particle recording and the playback of the target response voice are naturally connected, which helps to improve the response effect of voice interaction
- step S202 which is to perform sentence integrity analysis on the voice stream to be processed to obtain the voice stream to be analyzed, specifically includes the following steps:
- S301 Use a voice activation detection algorithm to monitor the voice stream to be processed, and obtain the voice pause point and the corresponding pause duration.
- the Voice Activity Detection (VAD) algorithm aims to detect whether the current voice signal contains a voice signal, that is, an algorithm that judges the input signal and distinguishes the voice signal from various background noise signals.
- the speech pause point is to use the VAD algorithm to identify the position of the speech pause in the voice stream to be processed, that is, the VAD algorithm is used to identify the position of the user in the voice stream to be processed when the user pauses.
- the pause duration corresponding to the speech pause point refers to the time difference between the start time and the end time of the speech pause recognized by the VAD algorithm.
- the smart interactive device can use a voice activation detection algorithm to perform silent monitoring of the voice stream to be processed to determine the corresponding voice pause point in the voice stream to be processed when the user pauses and the pause duration corresponding to any voice pause point, so that Analyze whether the user has finished a sentence based on the pause duration corresponding to the speech pause point, thereby performing sentence integrity analysis.
- a voice activation detection algorithm to perform silent monitoring of the voice stream to be processed to determine the corresponding voice pause point in the voice stream to be processed when the user pauses and the pause duration corresponding to any voice pause point, so that Analyze whether the user has finished a sentence based on the pause duration corresponding to the speech pause point, thereby performing sentence integrity analysis.
- S302 Determine a speech pause point whose pause duration is greater than a preset duration threshold as a target pause point.
- the preset duration threshold is a preset duration threshold used to evaluate the pause after the user finishes a sentence.
- the target pause point is the pause position when the user finishes a sentence determined by analyzing and determining from the voice stream to be processed.
- the smart interactive device compares the pause duration corresponding to any speech pause point with a preset duration threshold; if the pause duration is greater than the preset duration threshold, it is determined that the user has finished a sentence, and at this time, the pause duration is set The corresponding speech pause point is determined as the target pause point; if the pause duration is not greater than the preset duration threshold, it is determined that the user has not finished a sentence.
- the speech pause point at this time is a short pause during the user’s speaking process. Therefore, The speech pause point corresponding to the pause duration is determined as the target pause point.
- S303 Obtain a voice stream to be analyzed based on two adjacent target pause points.
- the intelligent interactive device determines the voice stream between two adjacent target pause points as the voice stream to be analyzed, so that the voice to be analyzed
- the stream can reflect the complete sentence that the user wants to express, so as to improve the accuracy and timeliness of the subsequent recognition and analysis, so that the subsequent recognition and analysis of the voice stream to be analyzed does not need to identify and analyze the signal between the target pause points. Guarantee its timeliness; since each voice stream to be analyzed reflects the complete sentence that the user wants to express, the accuracy of subsequent recognition and response is higher.
- the smart interactive device determines the initial target pause point at the starting point of recording the to-be-processed voice stream; then, determines the next target pause point after the initial target pause point as the end target pause point, Determine a voice stream to be analyzed based on the start target pause point and the end target pause point; finally, the end target pause point is updated to the new initial target pause point, and the repeated execution will determine the next target pause point after the initial target pause point
- the step of determining a voice stream to be analyzed based on the start target pause point and the end target pause point so as to realize the real-time division of multiple voice streams to be analyzed from the voice streams to be processed, thereby ensuring the voice to be analyzed
- the real-time nature of stream determination helps to improve the accuracy and timeliness of subsequent recognition and analysis of the voice stream to be analyzed.
- the VAD algorithm is first used to monitor the voice pause points and the corresponding pause duration in the voice stream to be processed collected in real time to ensure objectivity in the processing process.
- the speech pause point whose pause duration is greater than the preset duration threshold is determined as the target pause point, so as to avoid subsequent speech pause points whose pause duration is not greater than the preset duration threshold for voice division, which may lead to inaccurate subsequent recognition and analysis processes.
- the voice stream to be analyzed is determined based on two adjacent target pause points, so that the voice stream to be analyzed can reflect the complete sentence that the user wants to express, so as to improve the accuracy and timeliness of subsequent recognition and analysis.
- invoking the first processing process in step S203 to control the voice playback module to play the target modal particle recording specifically includes the following steps performed by invoking the first processing process:
- S401 Acquire the voice duration corresponding to the voice stream to be analyzed.
- the intelligent interactive device may call the first processing process to determine two adjacent target pause points corresponding to the voice stream to be analyzed, and obtain the voice duration corresponding to the voice stream to be analyzed based on the two target pause points.
- the intelligent interactive device determines the voice stream to be analyzed based on two adjacent target pause points. Specifically, it refers to the time between the end time of the previous target pause point and the start time of the next target pause point in the voice stream to be processed. The voice stream in between is determined as the voice stream to be analyzed. At this time, the time difference between the end time of the last target pause point and the start time of the next target pause point may be determined as the voice duration corresponding to the voice stream to be analyzed. Understandably, the voice duration corresponding to the voice stream to be analyzed can be determined based on the start time and end time of two adjacent target pause points, which makes the process of determining the voice duration simple and convenient, and helps to improve the efficiency of subsequent processing.
- S402 Query the system database based on the voice duration, determine the target modal particle recording based on the original modal particle recording that matches the voice duration, and control the voice playback module to play the target modal particle recording.
- the system database is a database set on or connected to the intelligent interactive device and used to store relevant data involved in the voice interaction process.
- the original modal particle recording is a pre-recorded modal particle-related recording that is used to make the intelligent interactive device and the user perform human-computer interaction.
- the target modal particle recording is one of the original modal particle recordings, specifically an original modal particle recording that matches the speech duration corresponding to the voice stream to be analyzed.
- the original modal particle recordings corresponding to different playback durations can be pre-recorded in the system database.
- the voice stream to be analyzed is recognized based on the estimation of the voice duration corresponding to the voice stream to be analyzed. Analyze the estimated processing time required; then, select the original modal particle recording whose playback duration matches the estimated processing time from the system database as the target modal particle recording, and control the voice playback module to play the target modal particle recording.
- a time length comparison table is pre-stored in the system database for the correspondence between the speech time length of the voice stream to be analyzed and its estimated processing time length, so that the estimated processing time length can be quickly determined through table look-up operations.
- the match between the playback duration and the estimated processing duration can be understood as the time difference between the playback duration and the estimated processing duration is the smallest or the time difference is within the preset error range, which makes the subsequent pause response time of the recognition and analysis process of the voice stream to be analyzed
- the target response voice can be played more naturally after the target modal particle recording is played, which helps to improve the efficiency of response processing.
- the number of original modal particle recordings whose playback duration matches the estimated processing duration selected from the system database is at least two, there are at least two original voice word recordings corresponding to the playback duration and the estimated processing duration.
- the time difference is within the preset error range, it is determined that there are at least two original modal particle recordings. In this case, it is necessary to randomly select one from at least two original modal particle recordings as the target modal particle recording, or from at least two original modal particle recordings The recording selects one that is different from the last selected target modal particle recording as the target modal particle recording.
- the voice response processing method based on artificial intelligence based on the two adjacent target pause points corresponding to the voice stream to be analyzed, the voice duration corresponding to the voice stream to be analyzed can be quickly determined, so that the acquisition process is simple and convenient.
- the efficiency is high; the target modal particle recording is determined based on the original modal particle recording that matches the speech duration, so as to achieve a more natural playback of the target response voice after the target modal particle recording is played, which helps to improve the efficiency of response processing.
- invoking the second processing process in step S203 to recognize the voice stream to be analyzed and obtaining the target response voice specifically includes the following steps performed by invoking the second processing process:
- S501 Perform voice recognition on the voice stream to be analyzed, corresponding to the text to be analyzed.
- the text to be analyzed refers to the text content determined after voice recognition of the voice stream to be analyzed.
- the process of performing voice recognition on the voice stream to be analyzed to obtain the text to be analyzed corresponding to the voice stream to be analyzed can be understood as a process of converting the voice signal of the voice stream to be analyzed into text information that can be subsequently recognized.
- an intelligent interactive device can use ASR (Automatic Speech Recognition, abbreviation for automatic speech recognition) technology or a pre-trained static decoding network that can realize speech-to-text conversion. Analyze the text to be analyzed corresponding to the voice stream for subsequent semantic analysis.
- ASR Automatic Speech Recognition, abbreviation for automatic speech recognition
- static decoding network that can realize speech-to-text conversion. Analyze the text to be analyzed corresponding to the voice stream for subsequent semantic analysis.
- S502 Perform semantic analysis on the text to be analyzed, and obtain the target intention corresponding to the text to be analyzed.
- the target intention is the user's intention determined after semantic analysis of the text to be analyzed.
- the process of performing semantic analysis on the text to be analyzed to obtain the target intention corresponding to the text to be analyzed can be understood as the process of using artificial intelligence technology to analyze the user's intention from the text information of the text to be analyzed, which is equivalent to the human brain from The process of separating user intentions in user utterances.
- the intelligent interactive device can use NLP (Natural Language Processing) technology or a semantic analysis model constructed based on a neural network in advance to perform semantic analysis on the text to be analyzed to accurately and quickly Determine the goal intent.
- NLP Natural Language Processing
- semantic analysis model constructed based on a neural network in advance to perform semantic analysis on the text to be analyzed to accurately and quickly Determine the goal intent.
- S503 Query the system database based on the target intention, and obtain the target response words corresponding to the target intention.
- the target response words are the words that the intelligent interactive device responds based on the analyzed target intentions.
- the target response words exist in the form of text and are the responses to the target intentions identified by the text to be analyzed. For example, if the target intent identified in the text to be analyzed is "the yield of product A", the corresponding target response phrase is "the yield of product A is", or if the target intent is identified by the text to be analyzed The target intent of is "what is the amount of my loan to be repaid this month", and the corresponding target response phrase is "the amount of your loan to be repaid this month is" and so on.
- the smart interactive device queries the system database based on the target intent, and directly obtains the target response words corresponding to the target intent from the system database, or obtains the target intent corresponding to the target intent from the system database.
- Response content and based on the response content to form target response words.
- S504 Obtain a target response voice based on the target response speech technique.
- the target response voice is the voice corresponding to the target response speech.
- the target response voice can be understood as when the intelligent interactive device interacts with the user, it needs to be played in real time after the pause response time corresponding to the voice stream to be analyzed, specifically for identifying the target intention in the voice stream to be analyzed Voice for response.
- the process of determining the target response voice based on the target response speech can be done by querying the system database to determine the pre-recorded target response speech corresponding to the target response speech, so that the target response speech can be obtained more efficiently.
- the text-to-speech technology here is a technology used to convert text content into voice content, such as TTS speech synthesis technology.
- the target intention can be quickly determined by performing voice recognition and semantic analysis on the voice stream to be analyzed; then the target response language and the corresponding target response voice can be determined based on the target intention , So as to realize the recognition analysis and response of the voice stream to be analyzed based on the real-time collection and interception of the voice recording module to realize intelligent interaction, so that the intelligent interactive device can be widely used in scenarios that need to respond to manual questions, such as setting in public places Intelligent interactive equipment used to facilitate user consultation to save labor costs.
- step S503 which is to query the system database based on the target intent, and obtain the target response words corresponding to the target intent, specifically includes the following steps:
- S601 Determine the intention type based on the target intention.
- the type of intent is to determine the type to which it belongs according to the intent of the target.
- the types of intentions can be divided into general intentions and special intentions.
- the general purpose refers to the purpose of querying general information, that is, the purpose of querying general information that has nothing to do with specific user information, for example, the purpose of querying the profitability of product A.
- a dedicated intention refers to an intention for querying dedicated information, that is, an intention for querying dedicated information related to specific user information, for example, an intention for querying dedicated information such as user 1's loan amount and repayment period.
- the general speech technique database is a database dedicated to storing general response speech techniques, and is a sub-database in the system database.
- General response words are pre-set words for responding to common questions.
- the target intention identified in the text to be analyzed is a general purpose
- the user wants to query general information that has nothing to do with specific user information.
- the general information can be stored in the general speech database with corresponding general response words. Therefore, the intelligent interactive terminal can query the general speech database based on the target intention, and use the general response speech corresponding to the target intention as the target response speech, so that the acquisition efficiency of the target response speech is higher.
- the dedicated information database is a database dedicated to storing user dedicated information, and is a sub-database in the system database.
- User-specific information is used to store information related to the user, for example, the user's account balance or loan amount.
- the speech template corresponding to the dedicated intent is a preset template corresponding to the dedicated intent and used to respond to the backhaul of the dedicated intent. For example, for "I want to know my monthly repayment information", the corresponding phrase template is "Your monthly repayment amount is..., and the repayment date is! and so on.
- the target intention identified in the text to be analyzed is a special purpose
- These special information are generally stored in a special information database. Therefore, the intelligent interactive device can be based on The target intent queries the dedicated information database to quickly obtain the intent query results corresponding to the specific intent, and then fill in the intent query results on the phrase template corresponding to the specific intent to obtain the target response phrase corresponding to the target intent to protect the target Respond to the real-time nature of speech acquisition.
- processing methods corresponding to the general intent and the specific intent are used to determine the corresponding target response. In order to ensure the efficiency and real-time performance of target response speech.
- step S504 namely, obtaining the target response voice based on the target response speech technique, specifically includes the following steps:
- the general response words corresponding to the target intent in the general speech database can be determined as the target response words, so that the target response words can be obtained Faster efficiency; in order to further improve the efficiency of the response speech, the general response recording corresponding to the general response speech can be recorded in advance, and the general response recording can be stored in the system database, and the general response speech can be determined as the target response speech.
- the general response recording pre-recorded by the general response speech can be directly determined as the target response voice, so as to improve the efficiency of obtaining the target response voice.
- the target response language determined based on the target intent is the text content formed by filling the intent query result corresponding to the target intent on the language template.
- text-to-speech technology needs to be used to convert the target response speech to text-to-speech to obtain the corresponding target response speech to protect the target.
- Real-time response to voice The text-to-speech technology here is a technology used to convert text content into voice content, such as TTS speech synthesis technology.
- the voice response processing method based on artificial intelligence provided in this embodiment, for the intent type corresponding to the target intent identified in the text to be analyzed, if the intent type is a general type, the general response recording can be directly used as the target response voice to improve The acquisition efficiency of target response voice; when the intent type is dedicated intent, text-to-speech conversion is performed on the target response speech, so as to obtain the target response voice, so as to improve the real-time performance of the target response voice.
- step S204 is to monitor in real time the playback status of the target modal particle recording played by the voice playback module. If the playback status is the playback end, control the voice playback module to play the target response voice, which specifically includes the following step:
- S801 Monitor the playback status of the target modal particle recording played by the voice playback module in real time, and if the playback status is the end of playback, it is determined whether the target response voice can be obtained within a preset time period.
- the target tone recording is used to play the voice to the user through the voice playback module during the pause response time between the collected voice stream to be analyzed and the playback target response voice, so as to achieve a natural transition and avoid excessive pause response time.
- the user experience is poor. Therefore, it is necessary to ensure that after the target modal particle recording is finished, it can switch to the target response voice in real time.
- the current smart interactive device may not be able to timely due to malfunction after the playback of the target modal particle recording ends. Obtaining the target response voice makes it impossible to switch and play the target response voice. At this time, if there is no other response mechanism, the smart interactive device will be unresponsive for a long time, which will affect the user experience.
- the intelligent interactive device can call the preset state monitoring tool to monitor the playback state of the voice playback module to play the target modal particle recording in real time; if the playback state is the playback end, it is necessary to determine whether the target response can be obtained within the preset time period Voice for subsequent processing based on the judgment result.
- the preset time period is a preset time period; if the playback status is not over, you need to continue to wait until the playback status is monitored as the end of playback, and then perform the judgment to determine whether the target response voice can be obtained within the preset time period .
- the smart interactive device can obtain the target response voice within the preset time period, after the target response voice is obtained, the target response voice will be played in real time to realize the real-time switch from playing the target modal particle to playing the target response voice .
- the use of intelligent interactive equipment can promptly respond to voices, avoiding excessive pause response time and affecting user experience.
- the emergency handling mechanism is a preset handling mechanism used when the target response voice cannot be obtained within a preset time.
- the smart interactive device cannot obtain the target response voice within a preset time period, at this time, the number of modal particles played can be obtained; if the number of modal particles played is less than the preset threshold, the next modal particle recording will be played randomly.
- the intelligent interactive device In order to make the intelligent interactive device respond in a timely manner, so that the target responds before the voice is played, and does not cause the user to wait for a response for a long time without responding; if the number of modal particle plays is not less than the preset number of times threshold, random play failure Prompt voice, so that users can understand whether the intelligent interactive device is faulty in time, and avoid waiting for a response.
- the number of modal particles played refers to the number of times that the modal particles have been recorded.
- the fault prompt voice is a pre-set voice used to prompt that the device has a fault, and the fault prompt voice may specifically correspond to the cause of the failure that the target response voice cannot be obtained.
- the target response voice or emergency processing mechanism is played according to the judgment result of whether the target response voice can be obtained within a preset time period after the target modal particle recording is played.
- Corresponding voice in order to achieve timely response to the voice stream to be analyzed formed by the user's speech, and improve response efficiency.
- an artificial intelligence-based voice response processing device corresponds to the artificial intelligence-based voice response processing method in the above-mentioned embodiment in a one-to-one correspondence.
- the artificial intelligence-based voice response processing device includes a voice stream to be processed acquisition module 901, a voice stream to be analyzed acquisition module 902, a playback analysis parallel processing module 903, and a response voice real-time playback module 904.
- the detailed description of each functional module is as follows:
- the to-be-processed voice stream acquiring module 901 is configured to acquire the to-be-processed voice stream collected by the voice recording module in real time.
- the to-be-analyzed voice stream acquisition module 902 is configured to perform sentence integrity analysis on the to-be-processed voice stream to obtain the to-be-analyzed voice stream.
- the play analysis parallel processing module 903 is used to execute the first processing process and the second processing process in parallel, call the first processing process to control the voice playback module to play the target modal particle recording, call the second processing process to recognize the voice stream to be analyzed, and obtain the target Respond to voice.
- the response voice real-time playback module 904 is used to monitor the playback status of the voice playback module to play the target modal particle recording in real time. If the playback status is the playback end, the voice playback module is controlled to play the target response voice.
- the to-be-analyzed voice stream acquisition module 902 includes a pause duration acquisition unit, a target pause point determination unit, and a to-be-analyzed voice stream acquisition unit.
- the pause duration acquisition unit is used to monitor the voice stream to be processed by using a voice activation detection algorithm to acquire the voice pause point and the corresponding pause duration.
- the target pause point determination unit is used to determine the voice pause point whose pause duration is greater than the preset duration threshold as the target pause point.
- the to-be-analyzed voice stream acquisition unit is used to obtain the to-be-analyzed voice stream based on two adjacent target pause points.
- the playback analysis parallel processing module 903 includes a voice duration acquisition unit and a modal particle playback control unit.
- the voice duration acquiring unit is used to acquire the voice duration corresponding to the voice stream to be analyzed.
- the modal particle playback control unit is used to query the system database based on the voice duration, determine the target modal particle recording based on the original modal particle recording matching the voice duration, and control the voice playback module to play the target modal particle recording.
- the playback analysis parallel processing module 903 includes a text acquisition unit to be analyzed, a target intention acquisition unit, a target response speech acquisition unit, and a target response voice acquisition unit.
- the to-be-analyzed text obtaining unit is used to perform voice recognition on the to-be-analyzed voice stream and obtain the to-be-analyzed text corresponding to the to-be-analyzed voice stream.
- the target intention acquisition unit is used to perform semantic analysis on the text to be analyzed and obtain the target intention corresponding to the text to be analyzed.
- the target response words acquisition unit is used to query the system database based on the target intention to obtain the target response words corresponding to the target intention.
- the target response voice acquisition unit is used to acquire the target response voice based on the target response speech.
- the target response speech acquisition unit includes an intention type determination subunit, a general speech determination subunit, and a dedicated speech determination subunit.
- the intent type determination subunit is used to determine the intent type based on the target intent.
- the general language skills determination subunit is used to query the general language database based on the target intention if the intention type is a general intention, and obtain the target response language corresponding to the target intention.
- the dedicated speech determination subunit is used to query the dedicated information database based on the target intention to obtain the intent query result if the intention type is the dedicated intention, and obtain the target response corresponding to the target intention based on the speech template corresponding to the dedicated intention and the intent query result Surgery.
- the target response voice acquisition unit includes a general voice determination subunit and a dedicated voice determination subunit.
- the general voice determination subunit is used for, if the intent type is general intent, query the system database based on the target response speech, and determine the general response recording corresponding to the target response speech as the target response speech.
- the dedicated voice determination subunit is used to perform voice synthesis on the target response speech if the intention type is a dedicated intention to obtain the target response voice.
- the response voice real-time playback module 904 includes a response voice reception judgment unit, a first response processing unit, and a second response processing unit.
- the response voice receiving judgment unit is used for real-time monitoring of the playback status of the target modal particle recording played by the voice playback module. If the playback status is the end of playback, it is determined whether the target response voice can be obtained within the preset time period.
- the first response processing unit is configured to play the target response voice in real time if the target response voice can be acquired within the preset time period.
- the second response processing unit is configured to execute an emergency response mechanism if the target response voice cannot be obtained within the preset time period.
- each module in the above artificial intelligence-based voice response processing device can be implemented in whole or in part by software, hardware, and a combination thereof.
- the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the intelligent interactive device, or may be stored in the memory of the intelligent interactive device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
- an intelligent interactive device may be a server, and its internal structure diagram may be as shown in FIG. 10.
- the intelligent interactive device includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the intelligent interactive device is used to provide calculation and control capabilities.
- the memory of the smart interactive device includes a readable storage medium and an internal memory.
- the readable storage medium stores an operating system, computer readable instructions, and a database.
- the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
- the database of the intelligent interactive device is used to store data adopted or generated during the process of executing the voice response processing method based on artificial intelligence.
- the network interface of the intelligent interactive device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions are executed by the processor to realize a voice response processing method based on artificial intelligence.
- the readable storage medium may be a non-volatile readable storage medium or a volatile readable storage medium.
- an intelligent interactive device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
- the artificial intelligence-based voice response processing method such as S201-S204 shown in FIG. 2, or shown in FIG. 3 to FIG. 8, is not repeated here to avoid repetition.
- the processor implements the functions of each module/unit in the embodiment of the artificial intelligence-based voice response processing device when the processor executes the computer-readable instruction, for example, the to-be-processed voice stream acquisition module 901 and the to-be-analyzed voice stream shown in FIG. 9
- the functions of the acquisition module 902, the playback analysis parallel processing module 903, and the response voice real-time playback module 904 are not repeated here to avoid repetition.
- one or more readable storage media storing computer readable instructions are provided.
- the readable storage medium stores computer readable instructions.
- processors execute the artificial intelligence-based voice response processing method in the foregoing embodiment, such as S201-S204 shown in FIG. 2 or shown in FIG. 3 to FIG. 8. In order to avoid repetition, details are not described herein again.
- the function of each module/unit in the embodiment of the artificial intelligence-based voice response processing device is realized, for example, the to-be-processed voice stream acquisition module 901 and the waiting
- the functions of the analysis voice stream acquisition module 902, the playback analysis parallel processing module 903, and the response voice real-time playback module 904 are not repeated here to avoid repetition.
- Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- User Interface Of Digital Computer (AREA)
- Machine Translation (AREA)
Abstract
提供了一种基于人工智能的语音响应处理方法、装置、设备及介质。该方法包括:获取语音录音模块实时采集的待处理语音流(S201);对待处理语音流进行语句完整性分析,得到待分析语音流(S202);并行执行第一处理进程和第二处理进程,基于第一处理进程控制语音播放模块播放目标语气词录音,基于第二处理进程对待分析语音流进行识别,获取目标响应语音(S203);实时监测语音播放模块播放目标语气词录音的播放状态,若播放状态为播放结束,则控制语音播放模块播放目标响应语音(S204)。该方法可使智能交互设备在人机交互过程中实时进行响应,提高语音交互的响应时间和响应效果。
Description
本申请要求于2020年02月27日提交中国专利局、申请号为202010122179.3,发明名称为“基于人工智能的语音响应处理方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及语音处理技术领域,尤其涉及一种基于人工智能的语音响应处理方法、装置、设备及介质。
随着人工智能技术的快速发展,应用人工智能技术的各种智能交互设备应运而生,以方便人们的工作或生活。例如,具有语音交互功能的智能交互设备可以采集并识别用户的实时语音,基于实时语音的识别结果进行响应,以达到人机交互目的。当前智能交互设备对实时语音进行响应需经过ASR语音识别、NLP语义分析和TTS语音合成等处理过程,这一处理过程所需的时间为智能交互设备与用户进行交互的停顿响应时间,这一停顿响应时间具体为从用户说完某一段实时语音时起,至智能交互设备基于实时语音进行响应时之间的时间差。发明人意识到当前智能交互设备与用户进行语音交互的停顿响应时间较长,使得用户感觉到延迟,影响用户进行语音交互的体验。
发明内容
本申请实施例提供一种基于人工智能的语音响应处理方法、装置、设备及介质,以解决当前智能交互设备与用户进行语音交互的停顿响应时间过长的问题。
一种基于人工智能的语音响应处理方法,包括:
获取语音录音模块实时采集的待处理语音流;
对所述待处理语音流进行语句完整性分析,得到待分析语音流;
并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;
实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
一种基于人工智能的语音响应处理装置,包括:
待处理语音流获取模块,用于获取语音录音模块实时采集的待处理语音流;
待分析语音流获取模块,用于对所述待处理语音流进行语句完整性分析,得到待分析语音流;
播放分析并行处理模块,用于并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;
响应语音实时播放模块,用于实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
一种智能交互设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取语音录音模块实时采集的待处理语音流;
对所述待处理语音流进行语句完整性分析,得到待分析语音流;
并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;
实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取语音录音模块实时采集的待处理语音流;
对所述待处理语音流进行语句完整性分析,得到待分析语音流;
并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;
实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
上述基于人工智能的语音响应处理方法、装置、设备及介质,先对语音交互过程中实时采集到的待处理语音流进行语句完整性分析,以确定待分析语音流,有助于提高后续识别分析的准确性和时效性。在对待分析语音流进行识别的同时播放目标语气词录音,并在目标语气词录音播放结束之后播放目标响应语音,使得待分析语音流的识别过程与目标语气词录音的播放过程同时进行,实现在对待分析语音流进行分析处理的停顿响应时间内播放目标语气词录音,使得目标语气词录音和目标响应语音的播放之间衔接自然,提高语音交互的响应时间和响应效果。
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中基于人工智能的语音响应处理方法的一应用环境示意图;
图2是本申请一实施例中基于人工智能的语音响应处理方法的一流程图;
图3是本申请一实施例中基于人工智能的语音响应处理方法的另一流程图;
图4是本申请一实施例中基于人工智能的语音响应处理方法的另一流程图;
图5是本申请一实施例中基于人工智能的语音响应处理方法的另一流程图;
图6是本申请一实施例中基于人工智能的语音响应处理方法的另一流程图;
图7是本申请一实施例中基于人工智能的语音响应处理方法的另一流程图;
图8是本申请一实施例中基于人工智能的语音响应处理方法的另一流程图;
图9是本申请一实施例中基于人工智能的语音响应处理装置的一示意图;
图10是本申请一实施例中智能交互设备的一示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的基于人工智能的语音响应处理方法,该基于人工智能的语音响应 处理方法可以应用在独立设置的智能交互设备上,也可以应用在如图1所示的应用环境中。
作为一示例,该基于人工智能的语音响应处理方法应用在独立设置的智能交互设备上时,该智能交互设备上设有处理器和与处理器相连的语音录音模块和语音播放模块,可在处理器上执行基于人工智能的语音响应处理方法,以使用户与智能交互设备进行语音交互过程中,智能交互设备每次停顿响应时间较短,使得用户感觉不到语音交互过程中存在延迟,体验更好。
作为另一示例,该基于人工智能的语音响应处理方法应用在基于人工智能的语音响应处理系统中,该基于人工智能的语音响应处理系统包括如图1所示的智能交互设备和服务器,智能交互设备与服务器通过网络进行通信,智能交互设备上设有语音录音模块和语音播放模块,可在服务器上执行基于人工智能的语音响应处理方法,以使用户与智能交互设备进行语音交互过程中,智能交互设备每次停顿响应时间较短,使得用户感觉不到语音交互过程中存在延迟,体验更好。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。智能交互设备可以是可实现人机交互的机器人。
在一实施例中,如图2所示,提供一种基于人工智能的语音响应处理方法,以该方法应用在独立的智能交互设备的处理器或者与智能交互设备相连的服务器为例进行说明,包括如下步骤:
S201:获取语音录音模块实时采集的待处理语音流。
其中,语音录音模块是可以实现录音功能的模块。作为一示例,语音录音模块可以是集成在智能交互设备或者客户端上的用于实现录音功能的录音芯片。
待处理语音流是语音录音模块实时采集到的需要进行后续识别处理的语音流。作为一示例,智能交互设备的处理器或者与智能交互设备相连的服务器,可获取语音录音模块实时采集用户说话过程所形成的待处理语音流,该待处理语音流具体为用户想与智能交互设备进行交互的用于反映用户意图的语音流。
S202:对待处理语音流进行语句完整性分析,得到待分析语音流。
其中,待分析语音流是从待处理语音流中确定的可反映用户已经说完一段话的语音流。对待处理语音流进行语句完整性分析是指将待处理语音流基于一个完整的语句进行分隔,以使每一待分析语音流可以完整而准确地反映用户意图。
作为一示例,智能交互设备的处理器或者与智能交互设备相连的服务器可从语音录音模块实时录制的待处理语音流中,截取出可以反映用户已经说完某一段话的语音流作为待分析语音流,以便后续对待分析语音流进行识别分析,以确定待分析语音流中反映的用户意图并基于该用户意图进行响应,从而实现人机交互的目的。可以理解地,从待处理语音流中截取可反映用户已说完某一段话的待分析语音流,可以保障后续识别分析的准确性和时效性,避免将用户说的某一段话划分为几段,分别进行处理,导致语音识别分析的准确性和时效性较低。
S203:并行执行第一处理进程和第二处理进程,调用第一处理进程控制语音播放模块播放目标语气词录音,调用第二处理进程对待分析语音流进行识别,获取目标响应语音。
其中,目标语气词录音是指本次需要播放的语气词录音,该语气词录音是预先录制的与语气词相关的录音,例如,预先录制的“嗯嗯”等语气词对应的录音。
其中,目标响应语音是根据对待分析语音流进行识别分析确定的用户意图进行响应的语音。例如,若待分析语音流对应的说话内容对应的用户意图为“我想了解A产品的收益率”,则目标响应语音为“A产品的收益率为……”,可实现对待分析语音流中的用户意图进行智能响应,以替代人工响应,有助于节省人工成本。
其中,第一处理进程是在智能交互设备的处理器或者服务器的处理器上创建的用于控制语音播放模块工作的进程。第二处理进程是在智能交互设备的处理器或者服务器的处理器上创建的用于对待识别语音流进行识别处理的进程。
作为一示例,智能交互设备的处理器或者与智能交互设备相连的服务器在得到待分析语音流后,创建或者调用预先创建的第一处理进程和第二处理进程,使第一处理进程和第二处理进程并行执行,以使第一处理进程控制语音播放模块播放目标语气词录音,并使第二处理进程对待分析语音流进行识别,以获取目标响应语音,以使目标语气词录音的播放与待分析语音流的识别过程并行处理,以实现在对待分析语音流进行识别分析处理的停顿响应时间内播放目标语气词录音,使得智能交互设备响应及时,避免停顿响应时间过长而导致用户体验差,而且播放目标语气词录音,可使人机交互过程更具有口语化,有助于提高用户体验。此处的停顿响应时间可以理解为对待分析语音流进行识别分析,以确定并播放目标响应语音的处理时间。例如,对某一段待分析语音流进行识别分析的停顿响应时间为3s,若智能交互设备的处理器或者与智能交互设备相连的服务器在得到待分析语音流后的1s内播放2s的目标语气词录音,使得智能交互设备的停顿响应时间缩短在1s内,使得用户感觉不到响应延迟,有助提高用户体验。
S204:实时监测语音播放模块播放目标语气词录音的播放状态,若播放状态为播放结束,则控制语音播放模块播放目标响应语音。
作为一示例,目标语气词录音可以理解为在对待分析语音流进行识别分析过程中播放的录音,一般来说,该目标语气词录音的播放时长会在待分析语音流对应的停顿响应时间内,因此,智能交互设备可以在控制语音播放模块播放目标语气词录音结束后,实时播放目标响应语音,以实现对待分析语音流分别确定的用户意图进行及时响应。
可以理解地,智能交互设备的处理器或者与智能交互设备相连的服务器在得到待分析语音流,并基于第一处理进程控制语音播放模块播放目标语气词录音后,调用状态监测工具实时监测语音播放模块播放目标语气词录音的播放状态,该播放状态包括播放结束和播放未结束。在语音播放模块播放目标语气词录音的播放状态为播放结束时,可调用第一处理进程控制语音播放模块播放待分析语音流对应的目标响应语音,以实现在播放目标语气词录音之后,自然地衔接播放目标响应语音,避免停顿响应时间过长而影响用户体验。其中,状态监测工具是预先设置的用于监测语音播放模块的播放状态的工具。
本实施例所提供的基于人工智能的语音响应处理方法中,先对语音交互过程中实时采集到的待处理语音流进行语句完整性分析,以确定待分析语音流,有助于提高后续识别分析的准确性和时效性。通过并行执行第一处理进程和第二处理进程,可使待分析语音流的识别过程与目标语气词录音的播放过程同时进行,实现在对待分析语音流进行分析处理的停顿响应时间内播放目标语气词录音提高语音交互的响应时间和响应效果。在实时监测到目标语气词录音的播放状态为播放结束后,控制语音播放模块播放目标响应语音,使得目标语气词录音和目标响应语音的播放之间衔接自然,有助于提高语音交互的响应效果
在一实施例中,如图3所示,步骤S202,即对待处理语音流进行语句完整性分析,得到待分析语音流,具体包括如下步骤:
S301:采用话音激活检测算法对待处理语音流进行监测,获取语音停顿点及对应的停顿时长。
话音激活检测(Voice Activity Detection,简称VAD)算法,其目的是检测当前语音信号中是否包含话音信号存在,即对输入信号进行判断,将话音信号与各种背景噪声信号区分出来的算法。
语音停顿点是采用VAD算法识别出待处理语音流中语音停顿的位置,即采用VAD算法识别用户说话停顿时在待处理语音流中的位置。语音停顿点对应的停顿时长是指采用VAD算法识别到语音停顿的开始时刻与结束时刻之间的时间差。
作为一示例,智能交互设备可以采用话音激活检测算法对待处理语音流进行静默监测,以确定用户说话停顿时在待处理语音流中对应的语音停顿点以及任一语音停顿点对应的停顿时长,以便基于语音停顿点对应的停顿时长分析用户是否说完一句话,从而进行语 句完整性分析。
S302:将停顿时长大于预设时长阈值的语音停顿点确定为目标停顿点。
其中,预设时长阈值是预先设置的用于评估用户说完一句话后停顿的时长阈值。目标停顿点是从待处理语音流中分析确定的用户说完一句话时的停顿位置。
作为一示例,智能交互设备将任一语音停顿点对应的停顿时长与预设时长阈值进行比较;若停顿时长大于预设时长阈值,则认定用户已经说完一句话,此时,将该停顿时长对应的语音停顿点确定为目标停顿点;若停顿时长不大于预设时长阈值,则认定用户还未说完一句话,此时的语音停顿点为用户说话过程中的短暂停顿,因此,不将该停顿时长对应的语音停顿点确定为目标停顿点。
S303:基于相邻两个目标停顿点,得到待分析语音流。
具体地,智能交互设备在从实时采集的待处理语音流中确定至少两个目标停顿点之后,将相邻两个目标停顿点之间的语音流确定为待分析语音流,使得该待分析语音流可以反映用户想要表述的完整语句,以便于提高后续识别分析的准确性和时效性,即可使后续对待分析语音流进行识别分析时,无需对目标停顿点之间的信号进行识别分析,保障其时效性;由于每一待分析语音流反映用户想要表述的完整语句,使得后续识别和响应的准确性更高。
作为一示例,智能交互设备将开始录制待处理语音流的起始时刻所在的位置确定为初始目标停顿点;然后,将在初始目标停顿点之后的下一个目标停顿点确定为结束目标停顿点,基于起始目标停顿点和结束目标停顿点确定一个待分析语音流;最后,将结束目标停顿点更新为新的初始目标停顿点,重复执行将在初始目标停顿点之后的下一个目标停顿点确定为结束目标停顿点,基于起始目标停顿点和结束目标停顿点确定一个待分析语音流这一步骤,以实现从待处理语音流中实时划分出多个待分析语音流,从而保证待分析语音流确定的实时性,有助于提高后续对待分析语音流进行识别分析的准确性和时效性。
本实施例所提供的基于人工智能的语音响应处理方法中,先采用VAD算法监测实时采集的待处理语音流中的语音停顿点及对应的停顿时长,以保证处理过程中客观性。将停顿时长大于预设时长阈值的语音停顿点确定为目标停顿点,避免后续停顿时长不大于预设时长阈值的语音停顿点进行语音划分而导致后续识别分析过程不准确。再基于相邻两个目标停顿点确定待分析语音流,从而使得该待分析语音流可以反映用户想要表述的完整语句,以便于提高后续识别分析的准确性和时效性。
在一实施例中,如图4所示,步骤S203中的调用第一处理进程控制语音播放模块播放目标语气词录音,具体包括调用第一处理进程执行的如下步骤:
S401:获取待分析语音流对应的语音时长。
具体地,智能交互设备可调用第一处理进程,确定待分析语音流对应的相邻两个目标停顿点,根据这两个目标停顿点,获取待分析语音流对应的语音时长。作为一示例,智能交互设备基于相邻两个目标停顿点,确定待分析语音流,具体是指将待处理语音流中,上一目标停顿点的结束时刻到下一目标停顿点的开始时刻之间的语音流确定为待分析语音流,此时,可以将上一目标停顿点的结束时刻到下一目标停顿点的开始时刻之间的时间差,确定为待分析语音流对应的语音时长。可以理解地,待分析语音流对应的语音时长可以基于相邻两个目标停顿点的开始时刻和结束时刻确定,使得语音时长的确定过程简单方便,有助于提高后续处理的效率。
S402:基于语音时长查询系统数据库,基于与语音时长相匹配的原始语气词录音确定目标语气词录音,控制语音播放模块播放目标语气词录音。
其中,系统数据库是设置在智能交互设备上的或者与智能交互设备相连的用于存储语音交互过程中涉及到相关数据的数据库。原始语气词录音是预先录制的用于使智能交互设备与用户进行人机交互时播放的与语气词相关的录音。目标语气词录音是原始语气词录音 中的一个,具体为与待分析语音流对应的语音时长相匹配的一个原始语气词录音。
作为一示例,系统数据库中可以预先录制不同播放时长对应的原始语气词录音,在获取待分析语音流对应的语音时长之后,基于该待分析语音流对应的语音时长预估对待分析语音流进行识别分析所需的预估处理时长;然后,从系统数据库中选择播放时长与预估处理时长相匹配的原始语气词录音,作为目标语气词录音,控制语音播放模块播放目标语气词录音。例如,系统数据库中预先存储时长对照表,用于待分析语音流的语音时长和其预估处理时长之间的对应关系,以便后续可以通过查表操作快速确定预估处理时长。其中,播放时长与预估处理时长相匹配可以理解为播放时长与预估处理时长之间的时间差最小或者时间差在预设误差范围内,使得后续在待分析语音流进行识别分析过程的停顿响应时间内实时播放目标语气词录音之后,可以更自然地在播放完目标语气词录音之后衔接播放目标响应语音,有助于提高响应处理的效率。
进一步地,在从系统数据库中选择播放时长与预估处理时长相匹配的原始语气词录音的数量为至少两个时,即存在至少两个原始语音词录音对应的播放时长与预估处理时长之间的时间差在预设误差范围内时,认定存在至少两个原始语气词录音,此时,需从至少两个原始语气词录音随机选取一个作为目标语气词录音,或者从至少两个原始语气词录音选择与上次选用的目标语气词录音不相同的一个作为目标语气词录音。
本实施例所提供的基于人工智能的语音响应处理方法中,基于待分析语音流对应的相邻两个目标停顿点,可快速确定待分析语音流对应的语音时长,使其获取过程简单方便,效率较高;再基于与语音时长相匹配的原始语气词录音确定目标语气词录音,以实现更自然地在播放完目标语气词录音之后衔接播放目标响应语音,有助于提高响应处理的效率。
在一实施例中,如图5所示,步骤S203中的调用第二处理进程对待分析语音流进行识别,获取目标响应语音,具体包括调用第二处理进程执行的如下步骤:
S501:对待分析语音流进行语音识别,对应的待分析文本。
其中,待分析文本是指对待分析语音流进行语音识别后确定的文本内容。本实施例中,对待分析语音流进行语音识别,以获取待分析语音流对应的待分析文本的过程可以理解为将待分析语音流这种语音信号转换成可进行后续识别的文本信息的过程。
作为一示例,智能交互设备可采用ASR(Automatic Speech Recognition的简称,即自动语音识别)技术或者预先训练好的可实现语音文本转换的静态解码网络对待分析语音流进行语音识别,从而可快速获取待分析语音流对应的待分析文本,以便后续进行语义分析。
S502:对待分析文本进行语义分析,获取待分析文本对应的目标意图。
其中,目标意图是对待分析文本进行语义分析后确定的用户意图。本实施例中,对待分析文本进行语义分析,以获取待分析文本对应的目标意图的过程可以理解为采用人工智能技术从待分析文本这一文本信息中分析用户意图的过程,相当于人脑从用户话语中分别用户意图的过程。
作为一示例,智能交互设备可以采用NLP(Natural Language Processing的简称,即自然语言处理)技术或者预先基于神经网络构建的语义分析模型对待分析文本进行语义分析,以从待分析文本中准确且快速地确定目标意图。
S503:基于目标意图查询系统数据库,获取目标意图对应的目标响应话术。
其中,目标响应话术是智能交互设备基于分析出的目标意图进行响应的话术,该目标响应话术以文本形式存在,是对待分析文本所识别出的目标意图的响应。例如,若待分析文本所识别出的目标意图为“A产品的收益率”,则其对应的目标响应话术为“A产品的收益率为……”,或者,若待分析文本所识别出的目标意图为“我本月待还款金额是多少”,则其对应的目标响应话术为“您本月待还款金额是……”等。
作为一示例,智能交互设备在确定待分析文本对应的目标意图后,基于目标意图查询系统数据库,从系统数据库中直接获取目标意图对应的目标响应话术,或者从系统数据库获取与目标意图对应的响应内容,并基于响应内容形成目标响应话术。
S504:基于目标响应话术,获取目标响应语音。
其中,目标响应语音是与目标响应话术相对应的语音。可以理解地,该目标响应语音可以理解为在智能交互设备与用户进行人机交互时,需从待分析语音流对应的停顿响应时间之后实时播放,具体是针对待分析语音流中识别出目标意图进行响应的语音。
作为一示例,基于目标响应话术确定目标响应语音的过程,既可以通过查询系统数据库,以确定与目标响应话术相对应的预先录制的目标响应语音,以使目标响应语音的获取效率较快;也可以采用文本语音转换技术将目标响应话术进行文本语音转换处理,以获取对应的目标响应语音,以保障目标响应语音的实时性。此处的文本语音转换技术是用于实现将文本内容转换成语音内容的技术,例如TTS语音合成技术。
本实施例所提供的基于人工智能的语音响应处理方法中,通过对待分析语音流进行语音识别和语义分析,可快速确定其目标意图;再基于目标意图确定目标响应话术及对应的目标响应语音,从而实现基于语音录音模块实时采集并截取的待分析语音流进行识别分析和响应,以实现智能交互,使得智能交互设备可广泛应用在需对人工提问进行响应的场景,如设置在公共场所上的用于方便用户咨询的智能交互设备,以节省人力成本。
在一实施例中,如图6所示,步骤S503,即基于目标意图查询系统数据库,获取目标意图对应的目标响应话术,具体包括如下步骤:
S601:基于目标意图,确定意图类型。
其中,意图类型是根据目标意图确定其所属的类型。作为一示例,可将意图类型划分为通用意图和专用意图。其中,通用意图是指用于查询通用信息的意图,即用于查询与特定用户信息无关的通用信息的意图,例如,用于查询A产品的收益率的意图。专用意图是指用于查询专用信息的意图,即用于查询与特定用户信息相关的专用信息的意图,例如,用于查询用户1的贷款金额及还款期限等专用信息的意图。
S602:若意图类型为通用意图,则基于目标意图查询通用话术数据库,获取目标意图对应的目标响应话术。
其中,通用话术数据库是专用于存储通用响应话术的数据库,是系统数据库中的一个子数据库。通用响应话术是预先设置的针对通用问题进行响应回复的话术。
作为一示例,在待分析文本识别出的目标意图为通用意图时,说明用户想查询与特定用户信息无关的通用信息,这些通用信息可在通用话术数据库中存储有相应的通用响应话术,因此,智能交互终端可基于目标意图查询通用话术数据库,以将与目标意图相对应的通用响应话术作为目标响应话术,使得目标响应话术的获取效率较高。
S603:若意图类型为专用意图,则基于目标意图查询专用信息数据库,获取意图查询结果,基于专用意图对应的话术模板和意图查询结果,获取目标意图对应的目标响应话术。
其中,专用信息数据库是专用于存储用户专用信息的数据库,是系统数据库中的一个子数据库。用户专用信息是用于存储与用户相关的信息,例如,用户的帐户余额或者贷款金额等信息。专用意图对应的话术模板是预先设置的与专用意图相对应的用于针对专用意图进行响应回程的话术对应的模板。例如,针对“我想了解我的月还款信息”,则其对应的话术模板为“您的月还款金额为……,还款日期为……”等。
作为一示例,在待分析文本识别出的目标意图为专用意图时,说明用户想要查询与特定用户信息相关的专用信息,这些专用信息一般存储在专用信息数据库中,因此,智能交互设备可基于目标意图查询专用信息数据库,以快速获取与专用意图相对应的意图查询结果,再将意图查询结果填充在专用意图对应的话术模板上,以获取与目标意图对应的目标响应话术,以保障目标响应话术获取的实时性。
本实施例所提供的基于人工智能的语音响应处理方法中,针对待分析文本识别出的目标意图对应的意图类型,分别采用与通用意图和专用意图相对应的处理方式确定其对应的目标响应话术,以保障目标响应话术的获取效率和实时性。
在一实施例中,如图7所示,步骤S504,即基于目标响应话术,获取目标响应语音,具体包括如下步骤:
S701:若意图类型为通用意图,则基于目标响应话术查询系统数据库,将与目标响应话术对应的通用响应录音确定为目标响应语音。
作为一示例,在待识别文本识别出的目标意图为通用意图时,可将通用话术数据库中与目标意图相对应的通用响应话术确定为目标响应话术,以使目标响应话术的获取效率较快;为了进一步提高响应语音的获取效率,可预先录制与通用响应话术相对应的通用响应录音,并将通用响应录音存储在系统数据库中,在将通用响应话术确定为目标响应话术时,可直接将通用响应话术预先录制的通用响应录音确定为目标响应语音,以提高目标响应语音的获取效率。
S702:若意图类型为专用意图,则对目标响应话术进行语音合成,获取目标响应语音。
作为一示例,在待识别文本识别出的目标意图为专用意图时,其基于目标意图确定的目标响应话术是将与目标意图相对应的意图查询结果填充在话术模板上形成的文本内容,此时,系统数据库中不会存在与目标响应话术相对应的目标响应语音,因此,需采用文本语音转换技术将目标响应话术进行文本语音转换,以获取对应的目标响应语音,以保障目标响应语音的实时性。此处的文本语音转换技术是用于实现将文本内容转换成语音内容的技术,例如TTS语音合成技术。
本实施例所提供的基于人工智能的语音响应处理方法中,针对待分析文本识别出的目标意图对应的意图类型,在意图类型为通用类型,可直接将通用响应录音作为目标响应语音,以提高目标响应语音的获取效率;在意图类型为专用意图时,对目标响应话术进行文本语音转换,从而获取目标响应语音,以提高目标响应语音的实时性。
在一实施例中,如图8所示,步骤S204,即实时监测语音播放模块播放目标语气词录音的播放状态,若播放状态为播放结束,则控制语音播放模块播放目标响应语音,具体包括如下步骤:
S801:实时监测语音播放模块播放目标语气词录音的播放状态,若播放状态为播放结束,则判断在预设时间段内能否获取目标响应语音。
由于目标语气词录音是用于在采集到待分析语音流至播放目标响应语音之间的停顿响应时间内,通过语音播放模块播放给用户的语音,以实现过渡衔接自然,避免停顿响应时间过长而导致用户体验差,因此,需保证在目标语气词录音播放结束后,可实时切换至播放目标响应语音,但当前智能交互设备在播放目标语气词录音结束之后,可能会存在因故障而无法及时获取目标响应语音,使得无法切换播放目标响应语音,此时若无其他响应机制,会使智能交互设备较长时间无响应,会影响用户体验。
因此,智能交互设备可调用预先设置的状态监测工具实时监测语音播放模块播放目标语气词录音的播放状态;若播放状态为播放结束,则需判断能否在预设时间段内能否获取目标响应语音,以根据判断结果进行后续处理。其中,预设时间段是预先设置的时间段;若播放状态为播放未结束,则需继续等待,直至监测到其播放状态为播放结束,才执行判断预设时间段内能否获取目标响应语音。
S802:若在预设时间段内能获取目标响应语音,则实时播放目标响应语音。
作为一示例,若智能交互设备在预设时间段内能获取目标响应语音,则在获取到目标响应语音后,实时播放目标响应语音,以实现从播放目标语气词到播放目标响应语音的实时切换,使用智能交互设备可及时进行语音响应,避免停顿响应时间过长而影响用户体验。
S803:若在预设时间段内不能获取目标响应语音,则执行应急处理机制。
其中,应急处理机制是预先设置的用于在预设时间内不能获取目标响应语音时的处理机制。作为一示例,若智能交互设备在预设时间段内不能获取目标响应语音,此时,可获取语气词播放次数;若语气词播放次数小于预设次数阈值,则随机播放下一语气词录音,以使智能交互设备及时响应,使得目标响应语音播放之前进行响应,而不会使用户在较长时间处于等待响应状态而无响应;若语气词播放次数不小于预设次数阈值,则随机播放故障提示语音,以使用户及时了解智能交互设备是否故障,避免继续等待响应。其中,语气词播放次数是指当前已经播放过的语气词录音的次数。故障提示语音是预先设置的用于提示设备存在故障的语音,该故障提示语音具体可以与不能获取目标响应语音的故障原因相对应。
本实施例所提供的基于人工智能的语音响应处理方法中,根据在目标语气词录音播放结束后的预设时间段内能否获取目标响应语音的判断结果,分别播放目标响应语音或者应急处理机制对应的语音,以实现对用户说话形成的待分析语音流进行及时响应,提高响应效率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种基于人工智能的语音响应处理装置,该基于人工智能的语音响应处理装置与上述实施例中基于人工智能的语音响应处理方法一一对应。如图9所示,该基于人工智能的语音响应处理装置包括待处理语音流获取模块901、待分析语音流获取模块902、播放分析并行处理模块903和响应语音实时播放模块904。各功能模块详细说明如下:
待处理语音流获取模块901,用于获取语音录音模块实时采集的待处理语音流。
待分析语音流获取模块902,用于对待处理语音流进行语句完整性分析,得到待分析语音流。
播放分析并行处理模块903,用于并行执行第一处理进程和第二处理进程,调用第一处理进程控制语音播放模块播放目标语气词录音,调用第二处理进程对待分析语音流进行识别,获取目标响应语音。
响应语音实时播放模块904,用于实时监测语音播放模块播放目标语气词录音的播放状态,若播放状态为播放结束,则控制语音播放模块播放目标响应语音。
优选地,待分析语音流获取模块902,包括停顿时长获取单元、目标停顿点确定单元和待分析语音流获取单元。
停顿时长获取单元,用于采用话音激活检测算法对待处理语音流进行监测,获取语音停顿点及对应的停顿时长。
目标停顿点确定单元,用于将停顿时长大于预设时长阈值的语音停顿点确定为目标停顿点。
待分析语音流获取单元,用于基于相邻两个目标停顿点,得到待分析语音流。
优选地,播放分析并行处理模块903,包括语音时长获取单元和语气词播放控制单元。
语音时长获取单元,用于获取待分析语音流对应的语音时长。
语气词播放控制单元,用于基于语音时长查询系统数据库,基于与语音时长相匹配的原始语气词录音确定目标语气词录音,控制语音播放模块播放目标语气词录音。
优选地,播放分析并行处理模块903,包括待分析文本获取单元、目标意图获取单元、目标响应话术获取单元和目标响应语音获取单元。
待分析文本获取单元,用于对待分析语音流进行语音识别,获取待分析语音流对应的待分析文本。
目标意图获取单元,用于对待分析文本进行语义分析,获取待分析文本对应的目标意图。
目标响应话术获取单元,用于基于目标意图查询系统数据库,获取目标意图对应的目标响应话术。
目标响应语音获取单元,用于基于目标响应话术,获取目标响应语音。
优选地,目标响应话术获取单元,包括意图类型确定子单元、通用话术确定子单元和专用话术确定子单元。
意图类型确定子单元,用于基于目标意图,确定意图类型。
通用话术确定子单元,用于若意图类型为通用意图,则基于目标意图查询通用话术数据库,获取目标意图对应的目标响应话术。
专用话术确定子单元,用于若意图类型为专用意图,则基于目标意图查询专用信息数据库,获取意图查询结果,基于专用意图对应的话术模板和意图查询结果,获取目标意图对应的目标响应话术。
优选地,目标响应语音获取单元,包括通用语音确定子单元和专用语音确定子单元。
通用语音确定子单元,用于若意图类型为通用意图,则基于目标响应话术查询系统数据库,将与目标响应话术对应的通用响应录音确定为目标响应语音。
专用语音确定子单元,用于若意图类型为专用意图,则对目标响应话术进行语音合成,获取目标响应语音。
优选地,响应语音实时播放模块904,包括响应语音接收判断单元、第一响应处理单元和第二响应处理单元。
响应语音接收判断单元,用于实时监测语音播放模块播放目标语气词录音的播放状态,若播放状态为播放结束,则判断在预设时间段内能否获取目标响应语音。
第一响应处理单元,用于若在预设时间段内能获取目标响应语音,则实时播放目标响应语音。
第二响应处理单元,用于若在预设时间段内不能获取目标响应语音,则执行应急处理机制。
关于基于人工智能的语音响应处理装置的具体限定可以参见上文中对于基于人工智能的语音响应处理方法的限定,在此不再赘述。上述基于人工智能的语音响应处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于智能交互设备中的处理器中,也可以以软件形式存储于智能交互设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种智能交互设备,该智能交互设备可以是服务器,其内部结构图可以如图10所示。该智能交互设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该智能交互设备的处理器用于提供计算和控制能力。该智能交互设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该智能交互设备的数据库用于存储执行基于人工智能的语音响应处理方法过程采用或者生成的数据。该智能交互设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于人工智能的语音响应处理方法。本示例中,可读存储介质可以是非易失性可读存储介质,也可以是易失性可读存储介质。
在一个实施例中,提供了一种智能交互设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中基于人工智能的语音响应处理方法,例如图2所示S201-S204,或者图3至图8中所示,为避免重复,这里不再赘述。或者,处理器执行计算机可读指令时实现基于人工智能的语音响应处理装置这一实施例中的各模块/单元的功能,例如图9所示的待处理语音流获取模块901、待分析语音流获取模块902、播放分析并行处理模块903和响应语音实时播放模块904的功能,为避免重复,这里不再赘述。
在一实施例中,提供一个或多个存储有计算机可读指令的可读存储介质,可读存储介质存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述实施例中基于人工智能的语音响应处理方法,例如图2所示S201-S204,或者图3至图8中所示,为避免重复,这里不再赘述。或者,该计算机可读指令被处理器执行时实现上述基于人工智能的语音响应处理装置这一实施例中的各模块/单元的功能,例如图9所示的待处理语音流获取模块901、待分析语音流获取模块902、播放分析并行处理模块903和响应语音实时播放模块904的功能,为避免重复,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。
Claims (20)
- 一种基于人工智能的语音响应处理方法,其中,包括:获取语音录音模块实时采集的待处理语音流;对所述待处理语音流进行语句完整性分析,得到待分析语音流;并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
- 如权利要求1所述的基于人工智能的语音响应处理方法,其中,所述对所述待处理语音流进行语句完整性分析,得到待分析语音流,包括:采用话音激活检测算法对所述待处理语音流进行监测,获取语音停顿点及对应的停顿时长;将停顿时长大于预设时长阈值的语音停顿点确定为目标停顿点;基于相邻两个所述目标停顿点,得到待分析语音流。
- 如权利要求1所述的基于人工智能的语音响应处理方法,其中,所述调用所述第一处理进程控制语音播放模块播放目标语气词录音,包括:获取所述待分析语音流对应的语音时长;基于所述语音时长查询系统数据库,基于与所述语音时长相匹配的原始语气词录音确定目标语气词录音,控制语音播放模块播放目标语气词录音。
- 如权利要求1所述的基于人工智能的语音响应处理方法,其中,所述调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音,包括:对所述待分析语音流进行语音识别,获取所述待分析语音流对应的待分析文本;对所述待分析文本进行语义分析,获取所述待分析文本对应的目标意图;基于所述目标意图查询系统数据库,获取所述目标意图对应的目标响应话术;基于所述目标响应话术,获取目标响应语音。
- 如权利要求4所述的基于人工智能的语音响应处理方法,其中,所述基于所述目标意图查询系统数据库,获取所述目标意图对应的目标响应话术,包括:基于所述目标意图,确定意图类型;若所述意图类型为通用意图,则基于所述目标意图查询通用话术数据库,获取所述目标意图对应的目标响应话术;若所述意图类型为专用意图,则基于所述目标意图查询专用信息数据库,获取意图查询结果,基于所述专用意图对应的话术模板和所述意图查询结果,获取所述目标意图对应的目标响应话术。
- 如权利要求5所述的基于人工智能的语音响应处理方法,其中,所述基于所述目标响应话术,获取目标响应语音,包括:若所述意图类型为通用意图,则基于所述目标响应话术查询系统数据库,将与所述目标响应话术对应的通用响应录音确定为所述目标响应语音;若所述意图类型为专用意图,则对所述目标响应话术进行语音合成,获取目标响应语音。
- 如权利要求1所述的基于人工智能的语音响应处理方法,其中,所述实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音,包括:实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为 播放结束,则判断在预设时间段内能否获取所述目标响应语音;若在所述预设时间段内能获取所述目标响应语音,则实时播放所述目标响应语音;若在所述预设时间段内不能获取所述目标响应语音,则执行应急处理机制。
- 一种基于人工智能的语音响应处理装置,其中,包括:待处理语音流获取模块,用于获取语音录音模块实时采集的待处理语音流;待分析语音流获取模块,用于对所述待处理语音流进行语句完整性分析,得到待分析语音流;播放分析并行处理模块,用于并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;响应语音实时播放模块,用于实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
- 一种智能交互设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:获取语音录音模块实时采集的待处理语音流;对所述待处理语音流进行语句完整性分析,得到待分析语音流;并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
- 如权利要求9所述的智能交互设备,其中,所述对所述待处理语音流进行语句完整性分析,得到待分析语音流,包括:采用话音激活检测算法对所述待处理语音流进行监测,获取语音停顿点及对应的停顿时长;将停顿时长大于预设时长阈值的语音停顿点确定为目标停顿点;基于相邻两个所述目标停顿点,得到待分析语音流。
- 如权利要求9所述的智能交互设备,其中,所述调用所述第一处理进程控制语音播放模块播放目标语气词录音,包括:获取所述待分析语音流对应的语音时长;基于所述语音时长查询系统数据库,基于与所述语音时长相匹配的原始语气词录音确定目标语气词录音,控制语音播放模块播放目标语气词录音。
- 如权利要求9所述的智能交互设备,其中,所述调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音,包括:对所述待分析语音流进行语音识别,获取所述待分析语音流对应的待分析文本;对所述待分析文本进行语义分析,获取所述待分析文本对应的目标意图;基于所述目标意图查询系统数据库,获取所述目标意图对应的目标响应话术;基于所述目标响应话术,获取目标响应语音。
- 如权利要求12所述的智能交互设备,其中,所述基于所述目标意图查询系统数据库,获取所述目标意图对应的目标响应话术,包括:基于所述目标意图,确定意图类型;若所述意图类型为通用意图,则基于所述目标意图查询通用话术数据库,获取所述目标意图对应的目标响应话术;若所述意图类型为专用意图,则基于所述目标意图查询专用信息数据库,获取意图查 询结果,基于所述专用意图对应的话术模板和所述意图查询结果,获取所述目标意图对应的目标响应话术。
- 如权利要求9所述的智能交互设备,其中,所述实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音,包括:实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则判断在预设时间段内能否获取所述目标响应语音;若在所述预设时间段内能获取所述目标响应语音,则实时播放所述目标响应语音;若在所述预设时间段内不能获取所述目标响应语音,则执行应急处理机制。
- 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:获取语音录音模块实时采集的待处理语音流;对所述待处理语音流进行语句完整性分析,得到待分析语音流;并行执行第一处理进程和第二处理进程,调用所述第一处理进程控制语音播放模块播放目标语气词录音,调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音;实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音。
- 如权利要求15所述的可读存储介质,其中,所述对所述待处理语音流进行语句完整性分析,得到待分析语音流,包括:采用话音激活检测算法对所述待处理语音流进行监测,获取语音停顿点及对应的停顿时长;将停顿时长大于预设时长阈值的语音停顿点确定为目标停顿点;基于相邻两个所述目标停顿点,得到待分析语音流。
- 如权利要求15所述的可读存储介质,其中,所述调用所述第一处理进程控制语音播放模块播放目标语气词录音,包括:获取所述待分析语音流对应的语音时长;基于所述语音时长查询系统数据库,基于与所述语音时长相匹配的原始语气词录音确定目标语气词录音,控制语音播放模块播放目标语气词录音。
- 如权利要求15所述的可读存储介质,其中,所述调用所述第二处理进程对所述待分析语音流进行识别,获取目标响应语音,包括:对所述待分析语音流进行语音识别,获取所述待分析语音流对应的待分析文本;对所述待分析文本进行语义分析,获取所述待分析文本对应的目标意图;基于所述目标意图查询系统数据库,获取所述目标意图对应的目标响应话术;基于所述目标响应话术,获取目标响应语音。
- 如权利要求18所述的可读存储介质,其中,所述基于所述目标意图查询系统数据库,获取所述目标意图对应的目标响应话术,包括:基于所述目标意图,确定意图类型;若所述意图类型为通用意图,则基于所述目标意图查询通用话术数据库,获取所述目标意图对应的目标响应话术;若所述意图类型为专用意图,则基于所述目标意图查询专用信息数据库,获取意图查询结果,基于所述专用意图对应的话术模板和所述意图查询结果,获取所述目标意图对应的目标响应话术。
- 如权利要求15所述的可读存储介质,其中,所述实时监测所述语音播放模块播 放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则控制所述语音播放模块播放所述目标响应语音,包括:实时监测所述语音播放模块播放所述目标语气词录音的播放状态,若所述播放状态为播放结束,则判断在预设时间段内能否获取所述目标响应语音;若在所述预设时间段内能获取所述目标响应语音,则实时播放所述目标响应语音;若在所述预设时间段内不能获取所述目标响应语音,则执行应急处理机制。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122179.3 | 2020-02-27 | ||
CN202010122179.3A CN111429899A (zh) | 2020-02-27 | 2020-02-27 | 基于人工智能的语音响应处理方法、装置、设备及介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021169615A1 true WO2021169615A1 (zh) | 2021-09-02 |
Family
ID=71547262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/070450 WO2021169615A1 (zh) | 2020-02-27 | 2021-01-06 | 基于人工智能的语音响应处理方法、装置、设备及介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111429899A (zh) |
WO (1) | WO2021169615A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114283227A (zh) * | 2021-11-26 | 2022-04-05 | 北京百度网讯科技有限公司 | 虚拟人物的驱动方法、装置、电子设备及可读存储介质 |
CN115827840A (zh) * | 2022-11-28 | 2023-03-21 | 同舟智慧(威海)科技发展有限公司 | 基于半监督聚类目标下粒子群算法的人工智能回复方法 |
CN116208712A (zh) * | 2023-05-04 | 2023-06-02 | 北京智齿众服技术咨询有限公司 | 一种提升用户意向的智能外呼方法、系统、设备及介质 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429899A (zh) * | 2020-02-27 | 2020-07-17 | 深圳壹账通智能科技有限公司 | 基于人工智能的语音响应处理方法、装置、设备及介质 |
CN111916082B (zh) * | 2020-08-14 | 2024-07-09 | 腾讯科技(深圳)有限公司 | 语音交互方法、装置、计算机设备和存储介质 |
CN112562663A (zh) * | 2020-11-26 | 2021-03-26 | 珠海格力电器股份有限公司 | 语音的响应方法和装置、存储介质、电子装置 |
CN112463108B (zh) * | 2020-12-14 | 2023-03-31 | 美的集团股份有限公司 | 语音交互处理方法、装置、电子设备及存储介质 |
CN113160813B (zh) * | 2021-02-24 | 2022-12-27 | 北京三快在线科技有限公司 | 输出响应信息的方法、装置、电子设备及存储介质 |
CN112786054B (zh) * | 2021-02-25 | 2024-06-11 | 深圳壹账通智能科技有限公司 | 基于语音的智能面试评估方法、装置、设备及存储介质 |
CN113393840B (zh) * | 2021-08-17 | 2021-11-05 | 硕广达微电子(深圳)有限公司 | 一种基于语音识别的移动终端控制系统及方法 |
CN114385800A (zh) * | 2021-12-17 | 2022-04-22 | 阿里巴巴(中国)有限公司 | 语音对话方法和装置 |
CN116798427B (zh) * | 2023-06-21 | 2024-07-05 | 支付宝(杭州)信息技术有限公司 | 基于多模态的人机交互方法及数字人系统 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103366729A (zh) * | 2012-03-26 | 2013-10-23 | 富士通株式会社 | 语音对话系统、终端装置和数据中心装置 |
JP2014174485A (ja) * | 2013-03-12 | 2014-09-22 | Panasonic Corp | 情報通信端末、およびその対話方法 |
CN107851436A (zh) * | 2015-07-09 | 2018-03-27 | 雅马哈株式会社 | 语音交互方法和语音交互设备 |
CN109785830A (zh) * | 2017-11-15 | 2019-05-21 | 丰田自动车株式会社 | 信息处理装置 |
US20190354630A1 (en) * | 2018-05-17 | 2019-11-21 | International Business Machines Corporation | Reducing negative effects of service waiting time in humanmachine interaction to improve the user experience |
CN110808031A (zh) * | 2019-11-22 | 2020-02-18 | 大众问问(北京)信息科技有限公司 | 一种语音识别方法、装置和计算机设备 |
CN111429899A (zh) * | 2020-02-27 | 2020-07-17 | 深圳壹账通智能科技有限公司 | 基于人工智能的语音响应处理方法、装置、设备及介质 |
-
2020
- 2020-02-27 CN CN202010122179.3A patent/CN111429899A/zh active Pending
-
2021
- 2021-01-06 WO PCT/CN2021/070450 patent/WO2021169615A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103366729A (zh) * | 2012-03-26 | 2013-10-23 | 富士通株式会社 | 语音对话系统、终端装置和数据中心装置 |
JP2014174485A (ja) * | 2013-03-12 | 2014-09-22 | Panasonic Corp | 情報通信端末、およびその対話方法 |
CN107851436A (zh) * | 2015-07-09 | 2018-03-27 | 雅马哈株式会社 | 语音交互方法和语音交互设备 |
CN109785830A (zh) * | 2017-11-15 | 2019-05-21 | 丰田自动车株式会社 | 信息处理装置 |
US20190354630A1 (en) * | 2018-05-17 | 2019-11-21 | International Business Machines Corporation | Reducing negative effects of service waiting time in humanmachine interaction to improve the user experience |
CN110808031A (zh) * | 2019-11-22 | 2020-02-18 | 大众问问(北京)信息科技有限公司 | 一种语音识别方法、装置和计算机设备 |
CN111429899A (zh) * | 2020-02-27 | 2020-07-17 | 深圳壹账通智能科技有限公司 | 基于人工智能的语音响应处理方法、装置、设备及介质 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114283227A (zh) * | 2021-11-26 | 2022-04-05 | 北京百度网讯科技有限公司 | 虚拟人物的驱动方法、装置、电子设备及可读存储介质 |
CN114283227B (zh) * | 2021-11-26 | 2023-04-07 | 北京百度网讯科技有限公司 | 虚拟人物的驱动方法、装置、电子设备及可读存储介质 |
CN115827840A (zh) * | 2022-11-28 | 2023-03-21 | 同舟智慧(威海)科技发展有限公司 | 基于半监督聚类目标下粒子群算法的人工智能回复方法 |
CN116208712A (zh) * | 2023-05-04 | 2023-06-02 | 北京智齿众服技术咨询有限公司 | 一种提升用户意向的智能外呼方法、系统、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
CN111429899A (zh) | 2020-07-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021169615A1 (zh) | 基于人工智能的语音响应处理方法、装置、设备及介质 | |
US11024307B2 (en) | Method and apparatus to provide comprehensive smart assistant services | |
US11887582B2 (en) | Training and testing utterance-based frameworks | |
JP6772198B2 (ja) | 言語モデルスピーチエンドポインティング | |
US9514747B1 (en) | Reducing speech recognition latency | |
US10706873B2 (en) | Real-time speaker state analytics platform | |
US10516782B2 (en) | Conference searching and playback of search results | |
CN106463113B (zh) | 在语音辨识中预测发音 | |
US10057707B2 (en) | Optimized virtual scene layout for spatial meeting playback | |
US11076052B2 (en) | Selective conference digest | |
US8972260B2 (en) | Speech recognition using multiple language models | |
JP5381988B2 (ja) | 対話音声認識システム、対話音声認識方法および対話音声認識用プログラム | |
US11676625B2 (en) | Unified endpointer using multitask and multidomain learning | |
US20180336902A1 (en) | Conference segmentation based on conversational dynamics | |
EP2645364B1 (en) | Spoken dialog system using prominence | |
US20030125945A1 (en) | Automatically improving a voice recognition system | |
US20020198714A1 (en) | Statistical spoken dialog system | |
US20140180667A1 (en) | System and method for real-time multimedia reporting | |
KR20230073297A (ko) | 트랜스포머-트랜스듀서: 스트리밍 및 비스트리밍 음성 인식을 통합하는 하나의 모델 | |
US9691389B2 (en) | Spoken word generation method and system for speech recognition and computer readable medium thereof | |
KR102186641B1 (ko) | 인공지능 기반 음성 답변 자동채점을 통한 지원자 평가방법 | |
WO2022187168A1 (en) | Instantaneous learning in text-to-speech during dialog | |
KR20230158107A (ko) | 효율적인 스트리밍 비-순환 온-디바이스 엔드-투-엔드 모델 | |
KR20220090586A (ko) | 오디오-비주얼 매칭을 사용한 자동 음성 인식 가설 재점수화 | |
EP4287178A1 (en) | Improved performance evaluation of automatic speech recognition engines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21760393 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.01.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21760393 Country of ref document: EP Kind code of ref document: A1 |