CN118116378A - Server, terminal equipment and voice interaction method - Google Patents

Server, terminal equipment and voice interaction method Download PDF

Info

Publication number
CN118116378A
CN118116378A CN202311847031.6A CN202311847031A CN118116378A CN 118116378 A CN118116378 A CN 118116378A CN 202311847031 A CN202311847031 A CN 202311847031A CN 118116378 A CN118116378 A CN 118116378A
Authority
CN
China
Prior art keywords
voice
association
intention
text
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311847031.6A
Other languages
Chinese (zh)
Inventor
张路伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202311847031.6A priority Critical patent/CN118116378A/en
Publication of CN118116378A publication Critical patent/CN118116378A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Some embodiments of the present application show a server, a terminal device, and a voice interaction method, where the method includes: receiving voice data input by a user and sent by terminal equipment; identifying the voice data to determine a voice service; invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide text based on the association intention if the association intention associated with the voice service exists; generating a broadcasting text based on the reply text and the guiding language text; synthesizing broadcasting voice based on the broadcasting text; and sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice. According to the embodiment of the application, the association intention of the voice service is obtained, the guide text is determined according to the association intention and is broadcasted to the user, so that the user can be helped to accurately grasp the voice operation in a natural mode, and the accuracy of voice touch of the user is improved.

Description

Server, terminal equipment and voice interaction method
Technical Field
The present application relates to the field of voice interaction technologies, and in particular, to a server, a terminal device, and a voice interaction method.
Background
With the advancement of technology, intelligent voice products are receiving more and more attention. A user can search for a desired television, a movie, a song, a weather, a stock, a news, a ticket, a scenic spot, a hotel, and the like through voice. However, the user cannot accurately hit the content desired to be searched by the voice interaction mode when using the voice equipment because the user is not familiar with the voice service range and the voice operation. Thus, many users habitually reach certain well-known services with fixed speech utterances, and even forgo the use of speech functions, resulting in a failure to fully exert powerful speech functions. The mode can lead to larger and larger usage amount of the contacted service, and smaller amount of the un-contacted service data, thereby limiting popularization of voice functions. In view of this problem, the solutions currently presented are mostly directed to speaking or to intensively show the voice functions in a specific manner within the terminal randomly presenting the current service type. Such presentation is relatively hard, is less associated with the user's context information, and presents guidance utterances that are, in turn, irrelevant to the utterances currently entered by the user, resulting in the inability to perform the business guidance and popularization functions of the guidance language in the most efficient manner.
Disclosure of Invention
Some embodiments of the present application provide a server, a terminal device, and a voice interaction method, which are used to obtain an association intention of a voice service, determine a text of a guide language according to the association intention, and broadcast the text to a user, so as to help the user to accurately grasp a voice speaking in a natural manner, and improve the accuracy of voice touch of the user.
In a first aspect, some embodiments of the present application provide a server configured to:
Receiving voice data input by a user and sent by terminal equipment;
Identifying the voice data to determine a voice service;
Invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide language text based on the association intention if the association intention associated with the voice service exists, wherein the guide language text is used for guiding a user to use sentences related to the voice data;
Generating a broadcasting text based on the reply text and the guiding language text;
Synthesizing broadcasting voice based on the broadcasting text;
And sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice.
In some embodiments, the server performs, if there is an associated intent associated with the voice service, determining a guide text based on the associated intent, further configured to:
if the time sequence association intention associated with the voice service exists in the entity association map, determining the guide text based on the entity association intention.
In some embodiments, the server is configured to:
If the time sequence association intention associated with the voice service does not exist in the entity association diagram, judging whether the time sequence association intention associated with the voice service exists in the time sequence association diagram, wherein the time sequence association diagram is used for representing the association relation between service or intention time sequences;
if the time sequence association intention associated with the voice service exists in the time sequence association map, judging whether the voice triggering frequency/times of the time sequence association intention is lower than a first preset threshold value;
And if the voice trigger frequency/times of the time sequence association intention are lower than a first preset threshold value, determining the guide language text based on the time sequence association intention.
In some embodiments, the server is configured to:
If the time sequence association pattern does not have the time sequence association intention associated with the voice service, judging whether a scene association intention associated with the voice service exists in a scene association pattern, wherein the scene association pattern is used for representing the association relation between service use scenes;
If scene association intention associated with the voice service exists in the scene association map, judging whether voice triggering frequency/times of the scene association intention are lower than a first preset threshold value or not;
And if the voice trigger frequency/times of the scene association intention are lower than a first preset threshold value, determining the guide text based on the scene association intention.
In some embodiments, the server is configured to:
And if the scene association intention associated with the voice service does not exist in the scene association map, determining the guide text based on the voice service.
In some embodiments, the server performs determining a guide text based on the intent to associate, and is further configured to:
Judging whether a combined slot position exists in the association intention;
if the association intention has a combined slot, judging whether the voice trigger frequency/times of the combined slot is lower than a second preset threshold value;
And if the voice trigger frequency/times of the combined slot is lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the combined slot.
In some embodiments, the server is configured to:
if the association intention does not have a combined slot, or if the voice trigger frequency/times of the combined slot is not lower than a second preset threshold, judging whether the voice trigger frequency/times of a single slot is lower than the second preset threshold;
And if the voice trigger frequency/times of the single slot is lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the single slot.
In a second aspect, some embodiments of the present application provide a terminal device, including:
a sound collector configured to collect voice data input by a user;
a communicator configured to communicate data with the server;
An audio output interface configured to play speech;
A controller configured to:
Acquiring voice data input by a user;
Transmitting the voice data to a server;
Receiving broadcasting voice generated by the server based on the voice data;
and controlling the audio output interface to play the broadcasting voice.
In a third aspect, some embodiments of the present application provide a voice interaction method, which is applied to a server, and includes:
Receiving voice data input by a user and sent by terminal equipment;
Identifying the voice data to determine a voice service;
Invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide language text based on the association intention if the association intention associated with the voice service exists, wherein the guide language text is used for guiding a user to use sentences related to the voice data;
Generating a broadcasting text based on the reply text and the guiding language text;
Synthesizing broadcasting voice based on the broadcasting text;
And sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice.
In a fourth aspect, some embodiments of the present application provide a voice interaction method, which is applied to a terminal device, and includes:
Acquiring voice data input by a user;
Transmitting the voice data to a server;
Receiving broadcasting voice generated by the server based on the voice data;
And controlling the audio output interface to play the broadcasting voice.
Some embodiments of the application provide a server, a terminal device and a voice interaction method. After receiving voice data input by a user, the terminal equipment sends the voice data to the server. The server recognizes the voice data to determine a voice service; invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide text based on the association intention if the association intention associated with the voice service exists. The guiding language is used for guiding the user to use sentences related to the voice data; combining and splicing the reply text and the guide language text to obtain a broadcasting text; synthesizing broadcasting voice based on the broadcasting text; and sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice. According to the embodiment of the application, the association intention of the voice service is obtained, the guide text is determined according to the association intention and is broadcasted to the user, so that the user can be helped to accurately grasp the voice operation in a natural mode, and the accuracy of voice touch of the user is improved.
Drawings
FIG. 1 illustrates a system architecture diagram of voice interactions according to some embodiments;
fig. 2 illustrates a hardware configuration block diagram of a terminal device according to some embodiments;
FIG. 3 illustrates a software configuration diagram of a terminal device according to some embodiments;
FIG. 4 illustrates a schematic diagram of a voice interaction network architecture provided in accordance with some embodiments;
FIG. 5 illustrates a flow chart of a method of voice interaction provided in accordance with some embodiments;
FIG. 6 illustrates a flow chart of a first method of intent-to-associate determination provided in accordance with some embodiments;
FIG. 7 illustrates a flow chart of a second method of determining intent to associate provided in accordance with some embodiments;
FIG. 8 illustrates a flow chart of a third method of determining intent to associate provided in accordance with some embodiments;
FIG. 9 illustrates a flow chart of a fourth method of determining intent to associate provided in accordance with some embodiments;
Fig. 10 illustrates a timing diagram of a voice interaction method provided in accordance with some embodiments.
Detailed Description
For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.
It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
FIG. 1 illustrates an exemplary system architecture to which the voice interaction method and apparatus of the present application may be applied. As shown in fig. 1, 100 is a server, and 200 is a terminal device. The terminal device includes a smart television 200a, a mobile device 200b, and a smart speaker 200c.
In the present application, the server 100 and the terminal device 200 perform data communication through various communication modes. The terminal device 200 may be permitted to make communication connection through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 100 may provide various contents and interactions to the terminal device 200. For example, the terminal device 200 and the server 100 can transmit and receive information, and receive software program updates.
The server 100 may be a server providing various services, such as a background server providing support for audio data collected by the terminal device 200. The background server may perform analysis and other processing on the received data such as audio, and feed back the processing result (e.g., endpoint information) to the terminal device. The server 100 may be a server cluster, or may be a plurality of server clusters, and may include one or more types of servers.
The terminal device 200 may be hardware or software. When the terminal device 200 is hardware, it may be various electronic devices having a sound collection function, including but not limited to a smart speaker, a smart phone, a television, a tablet computer, an electronic book reader, a smart watch, a player, a computer, an AI device, a robot, a smart vehicle, and the like. When the terminal device 200 is software, it can be installed in the above-listed electronic devices. Which may be implemented as a plurality of software or software modules (e.g. for providing sound collection services) or as a single software or software module. The present invention is not particularly limited herein.
It should be noted that, the voice interaction method provided by the embodiment of the present application may be executed by the server 100, may be executed by the terminal device 200, or may be executed by both the server 100 and the terminal device 200, which is not limited in this aspect of the present application.
Fig. 2 shows a hardware configuration block diagram of the terminal device 200 in accordance with the exemplary embodiment. The terminal device 200 as shown in fig. 2 includes at least one of a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280.
The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.
The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.
The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or a near field communication protocol chip, and an infrared receiver. The terminal device 200 can establish transmission and reception of control signals and data signals through the communicator 220 server 100.
The user interface 280 may be used to receive external control signals.
The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; either the detector 230 comprises an image collector, such as a camera, which may be used to collect external environmental scenes, user attributes or user interaction gestures, or the detector 230 comprises a sound collector, such as a microphone or the like, for receiving external sounds.
The sound collector may be a microphone, also called "microphone", which may be used to receive the sound of a user and to convert the sound signal into an electrical signal. The terminal device 200 may be provided with at least one microphone. In other embodiments, the terminal device 200 may be provided with two microphones, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device 200 may be further provided with three, four or more microphones to collect sound signals, reduce noise, identify the source of sound, implement directional recording functions, etc.
Further, the microphone may be built in the terminal device 200, or the microphone may be connected to the terminal device 200 by a wired or wireless means. Of course, the location of the microphone on the terminal device 200 is not limited in the embodiment of the present application. Or the terminal device 200 may not include a microphone, i.e. the microphone is not provided in the terminal device 200. The terminal device 200 may be coupled to a microphone (also referred to as a microphone) via an interface such as the USB interface 130. The external microphone may be secured to the terminal device 200 by external fasteners such as a camera mount with a clip.
The controller 250 controls the operation of the terminal device and responds to the user's operations through various software control programs stored in the memory. The controller 250 controls the overall operation of the terminal device 200.
Illustratively, the controller includes at least one of a central processing unit (Central Processing Unit, CPU), an audio processor, a graphics processor (Graphics Processing Unit, GPU), a RAM (Random Access Memory, RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.
In some embodiments, the operating system of the terminal device is, for example, an Android system, as shown in fig. 3, the terminal device 200 may be logically divided into an application layer (Applications) 21, a kernel layer 22 and a hardware layer 23.
Wherein, as shown in fig. 3, the hardware layers may include the controller 250, the communicator 220, the detector 230, etc. shown in fig. 2. The application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 includes a voice recognition application that can provide a voice interactive interface and services for enabling connection of the terminal device 200 with the server 100.
The kernel layer 22 acts as software middleware between the hardware layer and the application layer 21 for managing and controlling hardware and software resources.
In some embodiments, the kernel layer 22 includes a detector driver for sending the voice data collected by the detector 230 to a voice recognition application. Illustratively, the voice recognition application in the terminal device 200 is started, and in the case where the terminal device 200 establishes a communication connection with the server 100, the detector driver is configured to send the voice data input by the user and collected by the detector 230 to the voice recognition application. The speech recognition application then sends the query information containing the speech data to the intent recognition module 102 in the server. The intention recognition module 102 is used to input voice data transmitted from the terminal device 200 to the language model.
In some embodiments, referring to fig. 4, fig. 4 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present application. In fig. 4, the terminal device is configured to receive input information and output a processing result of the information. The voice recognition module is provided with a voice recognition service for recognizing the audio as a text; the semantic understanding module is provided with semantic understanding service for carrying out semantic analysis on the text; the business management module is deployed with business instruction management service for providing business instructions; the language generation module is provided with a language generation service (NLG) for converting instructions which instruct the terminal equipment to execute into a text language; the voice synthesis module is provided with a voice synthesis (TTS) service, and is used for processing the text language corresponding to the instruction and then sending the processed text language to a loudspeaker for broadcasting. In some embodiments, there may be multiple entity service devices deployed with different service services in the architecture shown in fig. 4, and one or more entity service devices may also aggregate one or more functional services.
In some embodiments, the following describes an example of a process of processing information input to a terminal device based on the architecture shown in fig. 4, taking the information input to the terminal device as a query sentence input through voice as an example:
[ Speech recognition ]
The terminal device may perform noise reduction processing and feature extraction on the audio of the query sentence after receiving the query sentence input through the voice, where the noise reduction processing may include steps of removing echo and environmental noise.
Semantic understanding
Natural language understanding is performed on the identified candidate text and associated context information using acoustic and language models, and the text is parsed into structured, machine-readable information, information such as business fields, intentions, word slots, etc., to express semantics, etc. Deriving an actionable intent determination intent confidence score, the semantic understanding module selecting one or more candidate actionable intents based on the determined intent confidence score,
[ Business management ]
The semantic understanding module issues a query instruction to the corresponding service management module according to the semantic analysis result of the text of the query statement to acquire the query result given by the service, performs actions required by the user to finish the final request, and feeds back the device execution instruction corresponding to the query result.
[ Language Generation ]
Natural Language Generation (NLG) is configured to generate information or instructions into language text. The method can be divided into boring type, task type, knowledge question-answering type and recommendation type. The NLG in the chat type dialogue carries out intention recognition, emotion analysis and the like according to the context, and then generates an openness reply; in the task type dialogue, dialogue reply is generated according to the learned strategy, and general reply comprises clarification requirement, guidance user, inquiry, confirmation, dialogue ending language and the like; generating knowledge (knowledge, entity, fragment, etc.) required by a user according to the recognition and classification of question types, information retrieval or text matching in the knowledge question-answering type dialogue; and in the recommended dialogue system, interest matching and candidate recommended content sorting are carried out according to the hobbies of the user, and then the recommended content for the user is generated.
[ Speech Synthesis ]
The speech synthesis is configured as a speech output presented to the user. The speech synthesis processing module synthesizes a speech output based on text provided by the digital assistant. For example, the generated dialog response is in the form of a text string. The speech synthesis module converts the text string into audible speech output.
It should be noted that the architecture shown in fig. 4 is only an example, and is not intended to limit the scope of the present application. Other architectures may be employed in embodiments of the present application to achieve similar functionality, for example: all or part of the above processes may be completed by the intelligent terminal, and will not be described herein.
With the advancement of technology, intelligent voice products are receiving more and more attention. A user can search for a desired television, a movie, a song, a weather, a stock, a news, a ticket, a scenic spot, a hotel, and the like through voice. However, the user cannot accurately hit the content desired to be searched by the voice interaction mode when using the voice equipment because the user is not familiar with the voice service range and the voice operation. Thus, many users habitually reach certain well-known services with fixed speech utterances, and even forgo the use of speech functions, resulting in a failure to fully exert powerful speech functions. The mode can lead to larger and larger usage amount of the contacted service, and smaller amount of the un-contacted service data, thereby limiting popularization of voice functions. In view of this problem, the solutions currently presented are mostly directed to speaking or to intensively show the voice functions in a specific manner within the terminal randomly presenting the current service type. Such presentation is relatively hard, is less associated with the user's context information, and presents guidance utterances that are, in turn, irrelevant to the utterances currently entered by the user, resulting in the inability to perform the business guidance and popularization functions of the guidance language in the most efficient manner.
Because of unfamiliar voice service scope and use of speech, users can habitually search for their own well-known service content with voice when using terminal devices, but cannot reach many other unknown services. For example, some newly added services which are expected to be promoted with emphasis are limited to the user experience of the whole service functions of the product and the improvement of the product optimization space.
In addition, the same expression mode and part of the same expression mode only give out the display result of one service, so that the search result expected by the user cannot be met. Such as: the voice of 'full river red' cannot judge whether the user desires to search for ancient poems or movies, only gives a service result, and cannot meet the user requirements.
In some embodiments, the voice service is promoted in the following manner: 1) The novice guidance, namely, when the voice interaction system is used for the first time or just, the primary guidance is carried out in a text or animation mode; 2) Function guidance, namely randomly displaying related voice dialects at a specific position of a terminal equipment page; 3) Flow guidance, namely, in a specific interaction flow, a terminal displays a voice call which can be used by a user; 4) Help guidance, i.e. the centralized presentation of voice service functions in the help description of the voice interaction system. The voice guiding mode has the problems of hardness, poor guiding effect, limitation of voice operation, no individual difference of guiding contents and the like.
In order to solve the above technical problems, embodiments of the present application provide a server 100. As shown in fig. 5, the server 100 performs the steps of:
Step S501: receiving voice data input by a user and sent by terminal equipment;
the terminal device 200 receives voice data input by a user, including:
receiving an instruction of starting a voice interaction function input by a user;
and responding to the instruction of starting the voice interaction function, driving the voice collector to start so that the voice collector starts collecting voice data input by a user.
In some embodiments, the instruction of the user input to select the voice interaction application control is received while the display 260 of the terminal device 200 displays a user interface containing the voice interaction application control. The voice interaction application includes a chatting mode, that is, the user chatts with the terminal device 200, a dialog box between the user and the terminal device 200 may be displayed, the display 260 is controlled to display the dialog box in response to an instruction of selecting the voice interaction application control, and the sound collector is driven to start, so that the voice data input by the user is collected through the sound collector.
In some embodiments, the environmental voice data collected by the sound collector is received in real time;
Detecting whether the environmental voice data is smaller than a preset volume or whether the time course of the voice signal of the environmental voice data is smaller than a preset threshold;
If the detected environmental voice data is not less than the preset volume or the detected environmental voice data sound signal time interval is not less than the preset threshold, judging whether the environmental voice data comprises a voice wake-up word or not; the voice wake-up word is a specified word, and is used for starting a voice interaction function, that is, collecting voice data through a voice collector and sending the voice data to the server 100.
The voice wake-up word can be set by default and can be customized by a user. The terminal device 200 may install different voice assistant applications, and by setting different wake words for different voice assistants, different voice assistants may be woken up according to the wake words.
If the environmental voice data includes a voice wake-up word, the terminal device 200 controls the audio output interface 270 to play a prompt tone for prompting the user that a voice command can be currently input and controls the sound collector to start collecting voice data input by the user. For example: when the environment voice data including the voice wake-up word is detected, the on-woolen prompt tone can be broadcasted.
In some embodiments, if the terminal device 200 includes the display 260, the display 260 is controlled to display a voice receiving frame on the current user interface floating layer to prompt the user to be currently in the radio reception state.
When the floating layer of the current user interface displays a voice receiving frame, the voice collector is controlled to start collecting voice data input by a user. If voice data is not received for a long time, the voice interactive program may be turned off and the display of the voice receiving frame may be canceled.
If the environment voice data does not comprise the voice wake-up word, the related operations of playing the prompt tone and controlling the sound collector to start collecting the voice data input by the user are not executed.
In some embodiments, the body of the terminal device 200 is provided with a voice key, and after the user starts to press the voice key of the terminal device 200, the voice collector is started to collect voice data, and after the user stops to press the voice key of the terminal device 200, the voice collector is closed to end collecting voice data.
In some embodiments, the terminal device 200 is a smart tv, and the voice data may be received through a control device, such as a remote controller. The voice data collection is started after the user starts to press the voice key of the control device, and the voice data collection is ended after the user stops pressing the voice key of the control device. The control device transmits the voice data to the terminal device 200 in the form of streaming data during the voice data acquisition process.
In some embodiments, the voice data received by the terminal device 200 input by the user is streaming audio data in nature. After receiving the voice data, the terminal device 200 transmits the voice data to the sound processing module, and performs acoustic processing on the voice data through the sound processing module. The acoustic processing includes sound source localization, denoising, sound quality enhancement, and the like. The sound source localization is used for enhancing or preserving the signals of target speakers under the condition of multi-person speaking, suppressing the signals of other speakers, tracking the speakers and carrying out subsequent voice directional pickup. Denoising is used to remove environmental noise in speech data, and the like. The sound quality enhancement is used to increase the intensity of the speaker's voice when it is low. The purpose of the acoustic processing is to obtain a cleaner and clearer sound of the target speaker in the voice data. The acoustically processed voice data is transmitted to the server 100.
In some embodiments, the terminal device 200 directly transmits to the server 100 after receiving voice data input by a user, performs acoustic processing on the voice data by the server 100, and transmits the voice data after acoustic processing to the semantic service. After performing processing such as voice recognition on the received voice data, the semantic service sends the processed voice data to the terminal device 200.
Step S502: identifying voice data to determine voice traffic;
after receiving the voice data, the voice data is processed by a semantic understanding system. The semantic understanding system obtains corresponding slot information, intention and service scene by carrying out semantic analysis, service distribution, service analysis, text generation and other processes on the obtained voice text. The semantic analysis module performs lexical, syntactic and semantic analysis on the input text, understands the intention of the user, and sends the service processing result to the terminal device 200 for display.
After receiving the voice data, the server 100 recognizes the text corresponding to the voice data, that is, the voice text by using a voice recognition technology, and performs semantic understanding on the voice text to obtain a voice service corresponding to the voice data.
The step of carrying out semantic understanding on the voice text to obtain voice service corresponding to the voice data comprises the following steps:
Performing word segmentation labeling processing on the voice text to obtain word segmentation information;
Illustratively, the voice text is a "song of LiuXX", and the word segmentation labeling process is carried out on the "song of LiuXX" to obtain word segmentation information of [ { LiuXX-LiuXX [ actor-1.0, singer-0.8, roleFeeable-1.0, officialAccount-1.0] }, { — funcwordStructuralParticle-1.0] }, { song-song [ musicKey-1.0] } ].
Carrying out syntactic analysis and semantic analysis on the word segmentation information to obtain slot position information;
illustratively, the word segmentation information is subjected to syntactic analysis and semantic analysis, the obtained central word is 'song', the modifier is 'Liu XX', and the relation is an adjective modification relation. In semantic analysis, the song musicKey and singer are known to have a strong semantic relation before, so that the result of analyzing the semantic slot is as follows: fusion of word segmentation information, [ { LiuXX-LiuXX [ singer-1.0] }, { Song-Song [ musicKey-1.0] } ].
The semantic scene corresponding to the slot position information, namely the business intention, is positioned through the vertical domain classification;
the central control system combines various service scores to obtain the optimal vertical domain service, namely the voice service.
Illustratively, the music search intent is located to the music service by the vertical domain classification. The central control intention set only comprises music_ TOPIC (MUSIC theme), the score is 0.999999393, score { topicSet = [ music_ TOPIC ], ' query ': [ ' song of Liu XX ], and ' task ':0.9999393}, so the optimal service is MUSIC service.
Step S503: invoking a service corresponding to the voice service to determine a reply text corresponding to the voice data;
and the service corresponding to the voice service generates a reply text according to the slot position information by utilizing a natural voice generation technology.
Natural Language Generation (NLG) is a study to make a computer have human-like expression and writing functions. Namely, a section of high-quality natural language text can be automatically generated through a planning process according to some key information and the expression form of the key information in the machine. NLG is a method for converting data in a non-language format into a language format that can be understood by human beings, such as articles, reports, etc., in order to cross the gap between human beings and machines. Natural language generation-NLG there are 2 ways: text-to-text: generating text to language; data-to-text: data-to-language generation.
Slot-based Natural Language Generation (NLG) is a technique that converts structured data into natural language text. It uses predefined slots to represent the required information and then generates corresponding statements from these slots. The following is a general procedure for slot-based natural language generation: 1) Determining a generated target: the objective of generating the statement is determined, for example, to answer a question of the user or to provide some information. 2) Defining a slot: the required slot is defined according to the target. Slots are variables representing information and may include, for example, date, place, person, etc. 3) Filling the groove: and filling the corresponding value into the required slot according to the input structured data. For example, if the goal is to answer a user's question, the slots may include a question topic, keywords, etc., filling key information in the user-posed question into the corresponding slots. 4) Constructing a sentence template: and constructing a template for generating the sentence according to the target and the slot. Templates may contain static text, slot markers, and other variable parts. 5) Generating a statement: and generating a final natural language text according to the generated template and the filled slots. The generation of text may be performed using a template engine or a rule-based generation algorithm. 6) Optional post-treatment: and carrying out post-processing on the generated text according to the requirement. For example, grammar modification, part-of-speech adjustment, or rearrangement of text may be performed to make the generated text more compliant with natural language specifications. It should be noted that the slot-based natural language generation process is determined according to specific application scenarios and requirements, and the specific implementation of each step may also vary from application to application.
Step S504: judging whether an association intention associated with the voice service exists or not;
In some embodiments, the step of determining whether there is an intent to associate with the voice service comprises: and judging whether the entity association intention associated with the voice service exists in the entity association map.
As shown in fig. 6, the step of determining whether there is an entity association intention associated with a voice service in an entity association map includes:
Step S601: acquiring slot position information based on voice data;
step S602: judging whether a target triplet exists in the entity association map;
wherein the target triplet includes the slot information;
Knowledge graph is essentially a semantic network that reveals relationships between entities. The triplet is a general representation of the knowledge graph, and has three basic forms: entity 1-relationship-entity 2, entity-attribute value, and entity-tag value.
For example, entity 1-relationship-entity 2 may be music-subject music-movie, entity-attribute value may be person-height value, and entity-tag value may be region-area-Beijing.
Based on the technical accumulation and system architecture of an intelligent voice platform NLU (Natural Language Understanding ) processing system, a knowledge graph of related entities such as film, music, poetry and the like is constructed and applied to related business association matching of the entities. The reason why the embodiment of the application constructs a new knowledge graph without using the existing knowledge graph of the NLU is that the knowledge graph of the NLU has large module, wide coverage range and tens of millions of data volume, and is not suitable for direct application.
The embodiment of the application can screen partial patterns from the knowledge patterns to construct entity association patterns, and can also construct entity association patterns directly based on the relationship between the entities. The knowledge-graph may include an entity-association graph, and the knowledge-graph may not include an entity-association graph.
Each entity in the entity association graph appears only once in the entity association graph, so that the entity association intention associated with the voice service has unique certainty. The entity association graph is used for representing the association relation between the entities. The entity association graph can be constructed according to the current business requirements. The entity association graph is constituted of entity 1-relationship-entity 2, for example: the same name is different types (e.g. film, music), music XXX-source-movie XXX.
It should be noted that the range of the knowledge graph is relatively large, and one entity may appear in multiple triples. The scope to which the entity association graph relates is relatively small, and an entity can only appear within one triplet.
Judging whether the target triplet exists in the entity association graph refers to judging whether the triplet of the slot position information is included in the entity association graph.
For example: the slot information is 'poetry Bo Suanzi', and the target triples exist in the entity association map when the triples of poetry Bo Suanzi-source-song Bo Suanzi exist in the entity association map. And if all the triples in the entity association graph do not comprise the entity of the poetry and the halibut operator, determining that the target triples do not exist in the entity association graph.
If there is a target triplet in the entity association map, step S603 is executed: it is determined that there is an intent to associate with the voice service, and an entity intent to associate with the voice service is determined based on the target triplet.
Entity association intents may be determined from entities of non-slot information in the target triplet.
For example, the slot information is "poetry Bo Suanzi", the voice service is a poetry service, the target triplet is poetry Bo Suanzi-source-song Bo Suanzi, and then it can be determined that the entity association intention associated with the poetry service is a music search intention based on the entity "song Bo Suanzi".
In some embodiments, the slot information may be one or more. When the slot information is one, whether the entity association map comprises the triplet of the slot information can be directly judged. When the number of the slot information is multiple, one slot information can be randomly selected as target slot information, and whether the entity association map comprises the triples of the target slot information is judged.
In some embodiments, when the slot information is a plurality of, counting the number of triples comprising each slot information in the knowledge graph respectively;
selecting the slot information with the largest triplet number as target slot information;
And judging whether the entity association map comprises a triplet of target slot position information.
If the entity association map does not include the triples of the target slot information, the slot information with the largest number of triples is selected from the rest slot information to be the target slot information after the current target slot information is eliminated. If all the slot information is not in the entity association map, it can be determined that there is no entity association intention associated with the voice service.
Illustratively, the voice data is "play XXX (director name) suspense comedy class (movie type) YYY (movie name)", and the slot information includes "XXX", "suspense comedy class" and "YYY". The number of triples comprising "XXX", "suspected comedy" and "YYY" in the knowledge graph is 6, 8 and 23, respectively. And selecting YYY as target slot position information. And judging whether the entity association map comprises a triplet of YYY.
If the target triplet does not exist in the entity association map, step S604 is executed: it is determined that there is no intent to associate with the voice service.
In some embodiments, the step of determining whether there is an intent to associate with the voice service comprises: and judging whether a time sequence association intention associated with the voice service exists in the time sequence association map.
As shown in fig. 7, the step of determining whether there is a timing correlation intention associated with a voice service in the timing correlation map includes:
step S701: judging whether a target triplet exists in the time sequence association map;
Wherein the first entity of the target triplet is a voice service.
And constructing a time sequence association map according to the execution sequence of the business or the intention. The time sequence association map characterizes the association relation between different business or intention time sequences. Wherein, the corresponding association business can be mapped according to the association intention.
For example, after the user inputs a voice command for searching for a movie, a searched movie interface is displayed, and the next step may be to select a video to play, i.e. input a play control command. The time sequence association map is film-video-association-broadcasting control.
The timing triples are formed as entity 1-relationship-entity 2. Entity 1 (the first entity) performs the sequence before entity 2 (the second entity).
Judging whether the target triplet exists in the time sequence association diagram refers to judging whether the first entity is a triplet of voice service in the time sequence association diagram.
For example: the voice service is a film and television, a film and television-related-broadcasting control triplet exists in the time sequence association map, and a target triplet can be determined to exist in the time sequence association map. If the first entity of all triples in the time sequence association diagram is not the entity of the film and television, determining that the target triples do not exist in the entity association diagram. If the voice traffic is a broadcast control, the movie-associated-broadcast control is not a target triplet because the broadcast control is a second entity.
If the target triplet exists in the timing correlation map, step S702 is executed: determining that there is a timing relationship intent associated with the voice service, determining the timing relationship intent associated with the voice service based on the target triplet. The timing relationship intent may be determined from a second entity in the target triplet.
For example, the voice service is a video service, the target triplet is a video-associated-cast control, and then it may be determined that the timing association intention associated with the video service is a cast control based on the second entity "cast control".
If the target triplet does not exist in the timing correlation map, step S703 is performed: it is determined that there is no timing correlation intention associated with the voice service.
In some embodiments, the target triplet may be one or more. When the target triplet is one, the timing association intention can be directly determined according to the target triplet. When the target triples are multiple, one target triplet can be randomly selected to determine the time sequence association intention.
In some embodiments, when the target triples are multiple, the lowest voice trigger frequency/times can be selected from the time sequence association intents as the target association intents.
In some embodiments, when the target triples are multiple, acquiring the voice trigger frequency/times of the time sequence association intention, and judging whether the voice trigger frequency/times of the time sequence association intention is lower than a first preset threshold;
And counting the number and frequency of the voice triggering of the disagreement graph according to the use condition of the voice service by the specific user.
Illustratively, the number and frequency of voice triggers for the disagreement map are shown in Table 1.
TABLE 1
The step of voice triggering times and frequency statistics of the association intention comprises the following steps:
Carrying out semantic analysis on the historical data in the statistical time through a semantic understanding system to obtain corresponding intention information, wherein the corresponding intention information comprises intention types and trigger times. The intention type is denoted TopicID and the number of triggers is denoted N. The intention information is { "TopicID1": n1, "TopicID2": n2. "TopicIDn": nn }.
In some embodiments, the intent types are weighted according to the importance levels;
{"TopicID1":w1,"TopicID2":w2,...,"TopicIDn":wn}。
Calculating the normalized total times according to the weight:
Nsum=sum(w1*N1,w2*N3,...,wn*Nn)。
according to the number of the intention type triggers, the weight and the total number of times, calculating to obtain the voice trigger frequency of each intention type:
p1=(w1*N1)/Nsum;
p2=(w2*N2)/Nsum;
...
pn=(wn*Nn)/Nsum。
Finally, the voice triggering frequency of each intention type is obtained: { "TopicID1": p1"TopicID2": p 2. "TopicIDn": pn }.
In some embodiments, the intended trigger frequency is calculated directly from the number of triggers without setting weights according to the importance level.
And if the voice trigger frequency/times of the time sequence association intention are lower than a first preset threshold value, determining the time sequence association intention as a target association intention.
In some embodiments, if the voice trigger frequency/times of only one timing intent is below a first preset threshold, then the timing intent may be determined to be the target intent.
If the voice trigger frequency/times of the plurality of timing intent-to-associate are all lower than the first preset threshold, the target intent-to-associate may be determined in the order of the determination, i.e., once the voice trigger frequency/times of one timing intent-to-associate is determined to be lower than the first preset threshold, the timing intent-to-associate is determined to be the target intent-to-associate without performing the subsequent timing intent-to-associate trigger frequency/times determination. The voice trigger frequency/frequency of the alternative time sequence association intention is the lowest and is also selected as the target association intention.
In some embodiments, if the voice trigger frequency/number of time sequence association intentions is not lower than the first preset threshold, one time sequence association intention or the time sequence association intention with the lowest voice trigger frequency/number of times can be randomly selected as the target association intention.
By way of example, the timing correlations are intended for broadcast control, weather search, and music search, with voice trigger frequencies of 5%, 2%, and 1%, respectively. If the first preset threshold is 1.5%, the target association intention is music search. If the first preset threshold is 2.5%, the target associated intention may randomly select one in the weather search and the music search.
In some embodiments, the step of determining whether there is an intent to associate with the voice service comprises: and judging whether a scene association intention associated with the voice service exists in the scene association map.
As shown in fig. 8, the step of determining whether there is a scene association intention associated with a voice service in a scene association map includes:
Step S801: judging whether a target triplet exists in the scene association map;
Wherein the first entity of the target triplet is a use scenario of voice service.
And analyzing the historical service interaction data to obtain the interaction frequency or the relationship tightness degree between the services, and scoring the services to obtain the association degree between the services. And creating a scene association map according to the degree of association of the business scene. The scene association graph characterizes the association relation between service usage scenes. The scene knowledge graph is an operational knowledge base established for deep mining of the user's intention, speculation and co-emotion user intention.
For example, after a user enters a voice command for an air ticket query, an air ticket service or scenario may be determined. The air ticket and the weather search are in a relatively tight relationship, and the time sequence association map is air ticket-association-weather search.
The composition of the scene triplet is scene 1-association-scene 2. The associated scene of scene 1 (the first entity) is scene 2 (the second entity), which may also be denoted scene 1→scene 2. If scene 1 and scene 2 are interrelated, i.e., scene 2 is also scene 1, it can be expressed as
Judging whether the target triplet exists in the scene association graph refers to judging whether the first entity is a triplet of voice service in the scene association graph.
For example: the voice service is an air ticket, and a triplet of the air ticket, the association and the weather exists in the scene association map, so that the existence of a target triplet in the scene association map can be determined. If the first entity of all triples in the scene association map is not the entity of the air ticket, determining that no target triples exist in the entity association map. For the related scene, the voice service can be determined as the target triplet only in the triplet.
If there is a target triplet in the scene association map, step S802 is executed: it is determined that there is a scene association intention associated with the voice service, and the scene association intention associated with the voice service is determined based on the target triplet. Scene association intents may be determined from a second entity in the target triplet.
For example, the voice service usage scenario is a sight, the target triplet is a sight-association-map, and then it may be determined that the scenario-association intent associated with the sight service is a map based on the second entity "map".
If the target triplet does not exist in the scene association map, step S803 is executed: it is determined that there is no scene association intention associated with the voice service.
In some embodiments, the target triplet may be one or more. When the target triplet is one, the scene association intention can be directly determined according to the target triplet. When the target triples are multiple, one target triplet can be randomly selected to determine the scene association intention.
In some embodiments, when the target triples are multiple, the lowest voice trigger frequency/times can be directly selected from the scene association intents as the target association intents.
In some embodiments, when the target triples are multiple, acquiring the voice trigger frequency/times of the scene association intention, and judging whether the voice trigger frequency/times of the scene association intention is lower than a first preset threshold;
and if the voice trigger frequency/times of the scene association intention are lower than a first preset threshold value, determining the scene association intention as a target association intention.
In some embodiments, if the voice trigger frequency/times of only one scene associated intent is below a first preset threshold, then the scene associated intent may be determined to be the target associated intent.
If the voice trigger frequency/times of the plurality of scene association intentions are lower than the first preset threshold, the target association intentions can be determined according to the judging sequence, namely, once the voice trigger frequency/times of one scene association intention is determined to be lower than the first preset threshold, the scene association intention is determined to be the target association intention, and the subsequent scene association intention trigger frequency/times judgment is not required to be performed. The target associated intention with the lowest voice triggering frequency/frequency of the associated intentions of the alternative scenes can be selected.
In some embodiments, if the voice trigger frequency/number of the scene association intentions is not lower than the first preset threshold, one scene association intention or the scene association intention with the lowest voice trigger frequency/number of the scene association intentions may be randomly selected as the target association intention.
The user inputs voice of 'ticket to Beijing in tomorrow', the voice is recognized and then is determined to be ticket inquiry service, the scene is ticket, and the scene association map comprises ticket-association-weather, ticket-association-scenic spot and ticket-association-map. Scene association intentions are weather inquiry, scenic spot inquiry and map search, and the voice trigger frequencies are respectively 10%, 5% and 3%. If the first preset threshold is 4%, the target association intention is map search. If the first preset threshold is 6%, the target associated intent may randomly select one of the sight query and the map search.
In some embodiments, the entity association graph, the timing association graph, and the scene association graph may be aggregated into one fused association graph. It can be determined whether there is an association intention associated with the voice service in the fusion association map, that is, whether there is slot information or a triplet of the voice service in the fusion association map. If there are multiple associated intents, the target associated intent may be further determined based on the voice trigger frequency/times.
In some embodiments, at least two of the entity association graph, the time sequence association graph and the scene association graph can be selected as the association intention for judging whether the voice service exists according to the actual requirements. And determining the priority of selecting the associated intention according to the determined sequence. For example: if the entity association map is preferentially judged whether the association intention associated with the voice service exists, the entity association intention is highest in priority. The determination sequence can be adjusted according to the actual demand to adjust the priority of the association intention.
As shown in fig. 9, from high to low in priority: entity-timing-scenario as an example, the step of determining whether there is an association intention associated with a voice service, includes:
Step S901: judging whether the entity association map has entity association intention associated with the voice service or not;
if the entity association map has an association intention associated with the voice service, step S907 is performed: it is determined that there is a scene association intention associated with the voice service.
If the entity association map does not have the association intention associated with the voice service, step S902 is executed: judging whether a time sequence association intention associated with the voice service exists in the time sequence association map;
If there is a timing correlation intention associated with the voice service in the timing correlation map, step S903 is performed: judging whether the voice triggering frequency/times of the time sequence association intention is lower than a first preset threshold value or not;
If the voice trigger frequency/number of times of the timing association intention is lower than a first preset threshold, step S907 is performed.
If the voice trigger frequency/number of times of the timing association intention is not lower than the first preset threshold, step S904 is executed: and judging whether a scene association intention associated with the voice service exists in the scene association map.
If there is no timing correlation intention associated with the voice service in the timing correlation map, step S904 is performed.
If there is a scene association intention associated with the voice service in the scene association map, step S905 is performed: judging whether the voice triggering frequency/times of the scene association intention is lower than a first preset threshold value or not;
If the voice trigger frequency/number of scene association intentions is lower than a first preset threshold, step S907 is performed.
If the voice trigger frequency/number of the scene association intention is not lower than the first preset threshold, step S906 is executed: it is determined that there is no scene association intention associated with the voice service.
If there is no scene correlation intention associated with the voice service in the scene correlation map, step S906 is performed.
If there is an association intention associated with the voice service, step S505 is performed: determining a guide language text based on the association intention, wherein the guide language text is used for guiding a user to use sentences related to the voice data;
Wherein the associated intent is an entity associated intent, a time sequence associated intent, a scene associated intent or a target associated intent.
And counting the voice triggering times and frequencies of different slots according to the use condition of the voice service by the specific user. The slots include a single slot and a combined slot.
Exemplary, single slot voice triggers and frequencies are shown in table 2.
TABLE 2
Exemplary, the number and frequency of combined slot voice triggers are shown in table 3.
TABLE 3 Table 3
Single slot voice triggering times and frequency statistics steps comprise:
And carrying out slot analysis on the historical data in the statistical time through a semantic understanding system to obtain corresponding slot information, wherein the corresponding slot information comprises slot types and trigger times. The slot type is denoted by slot, and the number of triggers is denoted by N. The slot information is { "slot1": n1, "slot2": n2., "slotn": nn }.
In some embodiments, the slots intended to be supported are respectively weighted according to the importance degree;
{"slot1":w1,"slot2":w2,...,"slotn":wn}。
Calculating the normalized total times according to the weight:
Nsum=sum(w1*N1,w2*N3,...,wn*Nn)。
according to the triggering times, the weights and the total times of the slot types, calculating to obtain the voice triggering frequency/times of each slot type:
p1=(w1*N1)/Nsum;
p2=(w2*N2)/Nsum;
...
pn=(wn*Nn)/Nsum。
finally, the probability of occurrence of each slot type is obtained:
{"slot1":p1,"slot2":p2,...,"slotn":pn}。
In some embodiments, the slot usage probability is calculated directly from the number of occurrences without setting weights according to the importance level.
The algorithm for combining slots is the same as that for a single slot.
A step of determining a guide text based on the associative intention, comprising:
Judging whether a combined slot position exists in the association intention;
if the association intention has a combined slot, judging whether the voice trigger frequency/times of the combined slot is lower than a second preset threshold value;
And if the voice trigger frequency/times of the combined slot is lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the combined slot.
In some embodiments, if the voice trigger frequency/number of times of the combined slot is not lower than the second preset threshold, the guide text is determined based on the information corresponding to the randomly selected combined slot or the combined slot with the lowest voice trigger frequency/number of times.
In some embodiments, if the voice trigger frequency/frequency of the combined slot is not lower than the second preset threshold, the step of determining whether the voice trigger frequency/frequency of the single slot is lower than the second preset threshold is performed.
If the association intention does not have the combined slot, judging whether the voice trigger frequency/times of the single slot is lower than a second preset threshold value;
And if the voice trigger frequency/times of the single slot is lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the single slot.
And if the voice trigger frequency/times of the single slot is not lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the randomly selected single slot or the single slot with the lowest voice trigger frequency/times.
In some embodiments, if the voice trigger frequency/times for only one combined slot/single slot is below a second preset threshold, a guide text is determined based on the information corresponding to the combined slot/single slot. If the voice trigger frequency/times of a plurality of combined slots/single slots are lower than the second preset threshold, the combined slots can be determined according to the judging sequence, namely when the voice trigger frequency/times of one combined slot/single slot are lower than the second preset threshold, the text of the guide language can be determined based on the information corresponding to the combined slot, and the subsequent slot probability judgment is not required to be executed. The guide text may also be determined based on information corresponding to the combined slot/single slot with the lowest frequency/number of voice triggers.
Illustratively, the intent of association is a movie search in which the combined slots include: actor name + [ movie name ], [ director name ] + [ movie name ] + [ year ], the voice trigger frequency/times are 10%, 6% and 3%, respectively. The single slot includes: actor name, director name, movie name and year, and the voice trigger frequency/times are 8%, 1%, 20% and 3%, respectively. If the second preset threshold is 5%, the text of the guide language is determined based on the fact that the director name is the fact that the film name is the fact that the year is the fact. If the second preset threshold is 2%, the guide text is determined based on [ director name ].
In some embodiments, the step of determining the guide text based on the information corresponding to the combined slot/single slot comprises:
And obtaining information corresponding to the combined slot position/single slot position from the knowledge graph, and generating a guide language text based on the information through a natural language generation technology.
For example, the user inputs voice data as "XXX", determines as poem service, and the entity association map has poem XXX-source-movie XXX, and may determine that the association intention is movie searching, and the voice triggering frequency/times of [ director name + [ movie name + [ year ] in the past month in the movie searching are lower than a second preset threshold, and determine that the combined slot is [ director name + [ movie name ] + [ year ]. XXX-director-little A exists in the knowledge graph, XXX-year-2023, and the director and year of XXX are obtained from the knowledge graph to generate a guide text: "XXX of 2023 Small A".
In some embodiments, the step of determining the guide text based on the information corresponding to the combined slot/single slot comprises:
And obtaining information corresponding to the combined slot position/single slot position from the reply text, and generating a guide language text based on the information through a natural language generation technology.
Illustratively, the user inputs voice data of "play football game", which is determined to be sports service, and the reply text is "16 points of today 50 have XX football game". The scene association map is provided with sports-association-reminding, the association intention can be determined to be reminding, no combined slot position exists in the past month in the reminding intention, the voice triggering frequency/times of single slot position [ time ] are lower than a second preset threshold value, and the single slot position is determined to be [ time ]. A specific time is obtained from the reply text. Generating a guide language text: "16 points 50 remind me to watch the game".
In some embodiments, the step of determining the guide text based on the information corresponding to the combined slot/single slot comprises:
and acquiring information corresponding to the combined slot position/single slot position from the storage text corresponding to the slot position, and generating a guide language text based on the information through a natural language generation technology.
For example, the user inputs voice data as "play XXX", determines that the user is a video service, the time sequence association map has a video-association-playing control, and can determine that the association intention is the playing control, and the playing control intention has no combined slot in the past month, and the voice trigger frequency/frequency of a single slot [ position ] is lower than a second preset threshold, and determines that the single slot is [ position ]. The stored text corresponding to the position is "first", "second", "last", etc. Randomly acquiring a text 'first' from the stored texts corresponding to the positions. Generating a guide language text: the "first" plays the corresponding video.
In some embodiments, the step of determining the guide text based on the information corresponding to the combined slot/single slot comprises:
And acquiring information corresponding to the combined slot position/single slot position from the slot position information, and generating a guide language text based on the information through a natural language generation technology.
For example, the user inputs voice data of "query ticket from Qingdao to Beijing in tomorrow", determines the ticket service, the scene association map has ticket-association-weather, can determine that the association intention is weather search, no combination slot is provided in the past month in the weather search intention, the voice triggering frequency/times of single slot (destination) is lower than a second preset threshold, and the single slot is determined to be (destination). And directly acquiring the destination Beijing from the slot information. Generating a guide language text: "Beijing weather".
In some embodiments, a particular scene association graph is opened under particular conditions. The specific condition includes a specific period of time or a specific holiday.
Illustratively, during 9 pm to 5 am, the audio-video-association-volume control is added to the scene association map. If the scene association intention is determined to be volume control according to the map, a corresponding natural sentence generating template is arranged on the volume control intention, a set control value is added, and a guide language text is generated through a natural language generating technology: at night, adjusting the volume you can say "volume set to 8".
Step S506: generating a broadcasting text based on the reply text and the guiding language text;
In some embodiments, the reply text and the guide text may be directly spliced to obtain the broadcast text.
In some embodiments, after the reply text and the guide language text are spliced, grammar correction, part-of-speech adjustment, text deletion or rearrangement and the like are performed on the reply text and the guide language text by using a natural voice generation technology, so that a broadcasting text which is more in line with natural language specifications is obtained.
Step S507: synthesizing broadcast voice based on the broadcast text;
and converting the broadcasting text into broadcasting voice by utilizing a voice synthesis technology.
In some embodiments, the broadcast voice with the preset tone is synthesized according to the broadcast text and the default tone parameters.
In some embodiments, the broadcast voice with the target tone color is synthesized according to the broadcast text and the target tone color parameter. The target tone parameter may be a voice broadcast tone parameter selected by a user, or a tone parameter generated by collecting audio data recorded and read by the user and a plurality of target texts.
Step S508: the broadcast voice is transmitted to the terminal device 200, so that the terminal device 200 plays the broadcast voice.
In some embodiments, the server 100 transmits the broadcast text to the terminal device 200 to cause the terminal device 200 to display the broadcast text.
In some embodiments, the server 100 transmits the voice text to the terminal device 200 to cause the terminal device 200 to display the voice text.
If there is no intent to associate with the voice service, a guide text is determined based on the voice service.
A number of guide text corresponding to the voice service is stored in advance. After the voice service is acquired, a piece of guiding text corresponding to the voice service is acquired randomly.
Illustratively, the swimming traffic correspondence stores the guide text as: "post-swimming attention diet", "you can query nearby baths". When the user inquires that the swimming service is suitable for swimming today, the user determines that the swimming service is the swimming service, the target triples do not exist in the entity association map, and a guiding text is selected from the swimming service.
In some embodiments, as shown in fig. 10, the terminal device 200 receives voice data input by a user and transmits the voice data to the server 100. The semantic service determines slot information and voice service according to the voice data, sends the slot information to service corresponding to the voice service, and the service generates a reply text based on the slot information. The semantic service sends the slot information and the voice service to the entity association sensing module, and the entity association sensing module judges whether the entity association intention exists or not based on the slot information. If the entity association intention exists, the association slot information is determined based on the entity association intention, and the association slot information is sent to the guide language generation module. The guide language generating module generates a guide language based on the associated slot position information and sends the guide language to the semantic service, the semantic service generates a broadcasting text based on the reply text and the guide language text, and synthesizes the broadcasting text into broadcasting voice and sends the broadcasting voice to the terminal equipment 200. The terminal device 200 plays the broadcast voice. If the entity association intention does not exist, the semantic service or the entity association perception module sends the voice service to the time sequence association perception module, and the time sequence association perception module judges whether the time sequence association intention exists or not based on the voice service. If the time sequence association intention exists, the associated slot information is determined based on the time sequence association intention, and the associated slot information is sent to the guide language generation module. If there is no timing correlation intention, the voice service is sent to the scene correlation perception module by the semantic service or the timing correlation perception module. The scene association sensing module judges whether scene association intention exists or not based on the voice service. If the scene association intention exists, the association slot information is determined based on the scene association intention, and the association slot information is sent to the guide language generation module. If no scene association intention exists, an intention association failure message is sent to a guide generation module, and the guide generation module generates guide text based on the voice service.
The embodiment of the application combines the knowledge of the domain with an inference mechanism to infer and decide the business scene. Specifically, after the service type and the slot position information are acquired through a natural language understanding system, a service association configuration library (entity association spectrum, time sequence association spectrum and scene association spectrum) is queried to acquire an association intention set, and then the final association intention and the slot position are determined from the association intention set according to a low-accessibility intention and a slot position decision strategy. And generating a guide language text according to the information corresponding to the slot position.
According to the embodiment of the application, the text of the guide language is generated according to the association relation between the services or the intentions, so that the user can unknowingly know the service speaking operation which can be reached by the voice, and the user can be taught how to use the voice in a more natural and more easily accepted mode. And recommending the intention with low use frequency to the user to trigger, and dynamically improving the popularization frequency of the intention. Through the secondary search recommendation of the association intention, the hidden and traceless comprehensive popularization of the voice service is realized.
Some embodiments of the present application provide a voice interaction method, the method being applicable to a server configured to: receiving voice data input by a user and sent by terminal equipment; identifying the voice data to determine a voice service; invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide language text based on the association intention if the association intention associated with the voice service exists, wherein the guide language text is used for guiding a user to use sentences related to the voice data; generating a broadcasting text based on the reply text and the guiding language text; synthesizing broadcasting voice based on the broadcasting text; and sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice. According to the embodiment of the application, the association intention of the voice service is obtained, the guide text is determined according to the association intention and is broadcasted to the user, so that the user can be helped to accurately grasp the voice operation in a natural mode, and the accuracy of voice touch of the user is improved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A server, configured to:
Receiving voice data input by a user and sent by terminal equipment;
Identifying the voice data to determine a voice service;
Invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide language text based on the association intention if the association intention associated with the voice service exists, wherein the guide language text is used for guiding a user to use sentences related to the voice data;
Generating a broadcasting text based on the reply text and the guiding language text;
Synthesizing broadcasting voice based on the broadcasting text;
And sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice.
2. The server of claim 1, wherein the server performs determining a guide text based on the association intent if there is an association intent associated with the voice service, further configured to:
if the time sequence association intention associated with the voice service exists in the entity association map, determining the guide text based on the entity association intention.
3. The server of claim 2, wherein the server is configured to:
If the time sequence association intention associated with the voice service does not exist in the entity association diagram, judging whether the time sequence association intention associated with the voice service exists in the time sequence association diagram, wherein the time sequence association diagram is used for representing the association relation between service or intention time sequences;
if the time sequence association intention associated with the voice service exists in the time sequence association map, judging whether the voice triggering frequency/times of the time sequence association intention is lower than a first preset threshold value;
And if the voice trigger frequency/times of the time sequence association intention are lower than a first preset threshold value, determining the guide language text based on the time sequence association intention.
4. A server according to claim 3, characterized in that the server is configured to:
If the time sequence association pattern does not have the time sequence association intention associated with the voice service, judging whether a scene association intention associated with the voice service exists in a scene association pattern, wherein the scene association pattern is used for representing the association relation between service use scenes;
If scene association intention associated with the voice service exists in the scene association map, judging whether voice triggering frequency/times of the scene association intention are lower than a first preset threshold value or not;
And if the voice trigger frequency/times of the scene association intention are lower than a first preset threshold value, determining the guide text based on the scene association intention.
5. The server of claim 4, wherein the server is configured to:
And if the scene association intention associated with the voice service does not exist in the scene association map, determining the guide text based on the voice service.
6. The server of claim 1, wherein the server performs determining a guide text based on the associated intent, further configured to:
Judging whether a combined slot position exists in the association intention;
if the association intention has a combined slot, judging whether the voice trigger frequency/times of the combined slot is lower than a second preset threshold value;
And if the voice trigger frequency/times of the combined slot is lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the combined slot.
7. The server of claim 6, wherein the server is configured to:
if the association intention does not have a combined slot, or if the voice trigger frequency/times of the combined slot is not lower than a second preset threshold, judging whether the voice trigger frequency/times of a single slot is lower than the second preset threshold;
And if the voice trigger frequency/times of the single slot is lower than a second preset threshold value, determining the text of the guide language based on the information corresponding to the single slot.
8. A terminal device, comprising:
a sound collector configured to collect voice data input by a user;
a communicator configured to communicate data with the server;
An audio output interface configured to play speech;
A controller configured to:
Acquiring voice data input by a user;
Transmitting the voice data to a server;
Receiving broadcasting voice generated by the server based on the voice data;
and controlling the audio output interface to play the broadcasting voice.
9. A voice interaction method applied to a server, comprising:
Receiving voice data input by a user and sent by terminal equipment;
Identifying the voice data to determine a voice service;
Invoking a service corresponding to a voice service to determine a reply text corresponding to the voice data, and determining a guide language text based on the association intention if the association intention associated with the voice service exists, wherein the guide language text is used for guiding a user to use sentences related to the voice data;
Generating a broadcasting text based on the reply text and the guiding language text;
Synthesizing broadcasting voice based on the broadcasting text;
And sending the broadcasting voice to the terminal equipment so that the terminal equipment plays the broadcasting voice.
10. A voice interaction method applied to a terminal device, comprising:
Acquiring voice data input by a user;
Transmitting the voice data to a server;
Receiving broadcasting voice generated by the server based on the voice data;
And controlling the audio output interface to play the broadcasting voice.
CN202311847031.6A 2023-12-29 2023-12-29 Server, terminal equipment and voice interaction method Pending CN118116378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311847031.6A CN118116378A (en) 2023-12-29 2023-12-29 Server, terminal equipment and voice interaction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311847031.6A CN118116378A (en) 2023-12-29 2023-12-29 Server, terminal equipment and voice interaction method

Publications (1)

Publication Number Publication Date
CN118116378A true CN118116378A (en) 2024-05-31

Family

ID=91217132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311847031.6A Pending CN118116378A (en) 2023-12-29 2023-12-29 Server, terminal equipment and voice interaction method

Country Status (1)

Country Link
CN (1) CN118116378A (en)

Similar Documents

Publication Publication Date Title
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
CN108063969B (en) Display apparatus, method of controlling display apparatus, server, and method of controlling server
EP3190512B1 (en) Display device and operating method therefor
EP2919472A1 (en) Display apparatus, method for controlling display apparatus, and interactive system
CN111372109B (en) Intelligent television and information interaction method
KR20140055502A (en) Broadcast receiving apparatus, server and control method thereof
CN111919249A (en) Continuous detection of words and related user experience
US11568875B2 (en) Artificial intelligence apparatus and method for recognizing plurality of wake-up words
CN112163086A (en) Multi-intention recognition method and display device
CN112165627B (en) Information processing method, device, storage medium, terminal and system
US20240055003A1 (en) Automated assistant interaction prediction using fusion of visual and audio input
CN117809641A (en) Terminal equipment and voice interaction method based on query text rewriting
US11575758B1 (en) Session-based device grouping
CN117809649A (en) Display device and semantic analysis method
WO2023040658A1 (en) Speech interaction method and electronic device
CN118116378A (en) Server, terminal equipment and voice interaction method
CN115602167A (en) Display device and voice recognition method
CN112053688B (en) Voice interaction method, interaction equipment and server
CN114999496A (en) Audio transmission method, control equipment and terminal equipment
CN115359796A (en) Digital human voice broadcasting method, device, equipment and storage medium
CN115146652A (en) Display device and semantic understanding method
CN117809659A (en) Server, terminal equipment and voice interaction method
WO2022193735A1 (en) Display device and voice interaction method
CN115396709B (en) Display device, server and wake-up-free voice control method
US20230267934A1 (en) Display apparatus and operating method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination