CN117831541A - Service processing method based on voiceprint recognition, electronic equipment and server - Google Patents

Service processing method based on voiceprint recognition, electronic equipment and server Download PDF

Info

Publication number
CN117831541A
CN117831541A CN202211204664.0A CN202211204664A CN117831541A CN 117831541 A CN117831541 A CN 117831541A CN 202211204664 A CN202211204664 A CN 202211204664A CN 117831541 A CN117831541 A CN 117831541A
Authority
CN
China
Prior art keywords
voice
information
mode
target
intention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211204664.0A
Other languages
Chinese (zh)
Inventor
代裕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vidaa Netherlands International Holdings BV
Original Assignee
Vidaa Netherlands International Holdings BV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vidaa Netherlands International Holdings BV filed Critical Vidaa Netherlands International Holdings BV
Priority to CN202211204664.0A priority Critical patent/CN117831541A/en
Publication of CN117831541A publication Critical patent/CN117831541A/en
Pending legal-status Critical Current

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application discloses a service processing method, electronic equipment and a server based on voiceprint recognition, wherein the method comprises the following steps: the electronic equipment collects voice data, and voiceprint recognition is carried out on the voice data by using a voiceprint model to obtain first voiceprint information; the electronic equipment acquires target grouping information mapped by the first voiceprint information and acquires a target mode bound by the target grouping information; the electronic equipment sends the target grouping information and voice data to a server; the electronic equipment receives a first voice instruction sent by a server, wherein the first voice instruction is sent by the server when detecting that the electronic equipment has the authority to respond to voice data according to the target grouping information and the operation intention of the voice data; the electronic device responds to the first voice instruction and executes the service indicated by the first voice instruction according to the target grouping information in the target mode. The embodiment of the application can improve the security of service processing and protect the privacy of users.

Description

Service processing method based on voiceprint recognition, electronic equipment and server
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a service processing method, an electronic device, and a server based on voiceprint recognition.
Background
The electronic devices such as the smart television and the smart mobile phone can recognize the user identity of the current input voice data according to voiceprint management, and match the corresponding target modes such as a child mode, an elderly mode and the like according to the user identity. In this way, the electronic device may perform certain business functions in the target mode operating state. The mode configuration of the electronic equipment is simple, the use safety is insufficient, and the privacy of the user is easy to reveal.
Disclosure of Invention
In order to solve the technical problems, the embodiment of the application provides a service processing method, electronic equipment and a server based on voiceprint recognition, which can improve the security of service processing and protect the privacy of users.
In a first aspect, an embodiment of the present application provides an electronic device, including:
a sound collector;
the communicator is used for being in communication connection with the server;
a controller configured to perform:
controlling voice data acquired by a sound collector, and performing voiceprint recognition on the voice data by using a voiceprint model to obtain first voiceprint information; acquiring target grouping information mapped by the first voiceprint information and acquiring a target mode bound by the target grouping information; transmitting the target packet information and voice data to a server; receiving a first voice command sent by a server, wherein the first voice command is sent by the server when detecting that the electronic equipment has the authority to respond to voice data according to the target grouping information and the operation intention of the voice data; and responding to the first voice instruction, and executing the service indicated by the first voice instruction according to the target packet information in the target mode. Therefore, after the user inputs voice data, the electronic equipment can perform voiceprint recognition, further determine that the user belongs to a target group according to the mapping relation between voiceprint information and the group, and then determine a target mode matched with the current user voice operation according to the binding relation between the group and the mode. In addition, the server can verify whether the electronic equipment has the authority to respond to the voice data according to the target grouping information and the voice data, and if the authority verification is successful, the server sends a first voice instruction to the electronic equipment, so that the electronic equipment can execute the service indicated by the operation intention contained in the first voice instruction in a target mode.
In some implementations, before the controller executes the voice data collected by the sound collector, is further configured to execute: acquiring grouping configuration data, mode configuration data and binding relation data; wherein the packet configuration data comprises a packet type; the mode configuration data comprises setting information of mode options contained in a system and an application corresponding to at least one mode type; the binding relation data comprises a binding relation between a grouping type and a mode type and a binding relation between the grouping type and the application data visibility authority; and uploading the grouping configuration data, the mode configuration data and the binding relationship data to a server. Therefore, the electronic device can store the grouping configuration data, the mode configuration data and the binding relation data in the cloud, so that on one hand, the memory consumption of the electronic device can be reduced, on the other hand, the mode configuration can cover a system of the electronic device and mode options contained in each application, so that the mode configuration is more refined, the authority control of service execution under different modes is realized, and the user experience is improved.
In some implementations, after executing the target mode of acquiring the target packet information binding, the controller is further configured to perform: comparing the current mode with the target mode; if the current mode is inconsistent with the target mode, a first data request is sent to the server, wherein the first data request is used for indicating the server to send configuration information and cluster information of the target mode to the electronic equipment; the cluster information is obtained by carrying out cluster analysis on mode configuration data uploaded by a plurality of electronic devices by a server; receiving configuration information and cluster information of a target mode sent by a server; switching a mode operated in the electronic equipment into a target mode according to configuration information of the target mode; and responding to the first voice instruction, and executing the service indicated by the first voice instruction according to the target grouping information and the clustering information. In this way, the electronic equipment decides whether to switch modes according to the consistency of the current mode and the target mode, so as to realize accurate mode control, in addition, the electronic equipment can execute the service by utilizing the target grouping information and the clustering information, can realize accurate authority control of service execution, and can enable the service processing to be more suitable for the intention preference of the user.
In some implementations, the controller is further configured to perform: if the current mode is consistent with the target mode, a second data request is sent to the server, wherein the second data request is used for indicating the server to send cluster information to the electronic equipment; cluster information sent by a server is received; receiving a second voice command sent by the server, wherein the second voice command is sent by the server when detecting that the electronic equipment has the authority to respond to the voice data according to the current grouping information and the operation intention of the voice data; and responding to the second voice instruction, and executing the service indicated by the second voice instruction according to the current grouping information and the clustering information in the current mode. Therefore, when the current mode and the target mode are consistent, the electronic equipment does not need to switch modes, and the states of the current grouping and the current mode are kept unchanged, so that mode accurate control is realized according to a use scene and a user, in addition, the electronic equipment can execute a service by utilizing the target grouping information and the clustering information, accurate authority control of service execution can be realized, and service processing can be more relevant to the intention preference of a user.
In a second aspect, embodiments of the present application provide a server, including:
A communicator for communication connection with an electronic device;
a controller configured to perform: receiving target grouping information and voice data sent by electronic equipment, wherein the target grouping information is grouping information mapped with first voiceprint information, which is obtained after voiceprint recognition is carried out on the voice data by the electronic equipment; carrying out semantic analysis on the voice data and identifying the operation intention of the voice of the user; if the electronic equipment has the authority to respond to the voice data, generating a first voice instruction according to the target grouping information and the operation intention of the voice of the user, and sending the first voice instruction to the electronic equipment, wherein the first voice instruction is used for indicating the electronic equipment to execute the service indicated by the first voice instruction according to the target grouping information in the target mode. Therefore, after the user inputs voice data, the electronic equipment can perform voiceprint recognition, and further determine that the user belongs to a target group according to the mapping relation between voiceprint information and the group so as to match a target mode matched with the voice operation of the current user. In addition, the server can verify whether the electronic equipment has the authority to respond to the voice data according to the target grouping information and the voice data, and if the authority verification is successful, the server sends a first voice instruction to the electronic equipment, so that the electronic equipment can execute the service indicated by the operation intention contained in the first voice instruction in a target mode.
In some implementations, prior to receiving the target packet information and voice data sent by the electronic device, the controller is further configured to perform: receiving grouping configuration data, mode configuration data and binding relation data sent by electronic equipment; wherein the packet configuration data comprises a packet type; the mode configuration data comprises setting information of mode options contained in a system and an application corresponding to at least one mode type; the binding relation data comprises a binding relation between a grouping type and a mode type and a binding relation between the grouping type and the application data visibility authority; packet configuration data, pattern configuration data, and binding relationship data are stored. Therefore, the electronic device can store the grouping configuration data, the mode configuration data and the binding relation data in the cloud, so that on one hand, the memory consumption of the electronic device can be reduced, on the other hand, the mode configuration can cover a system of the electronic device and mode options contained in each application, so that the mode configuration is more refined, the authority control of service execution under different modes is realized, and the user experience is improved.
In some implementations, the controller is further configured to perform: performing cluster analysis on the mode configuration data uploaded by the plurality of electronic devices by using a cluster model to obtain cluster information; the clustering information comprises intention labels mapped by each clustering result and intention preference coefficients corresponding to each intention label; calculating a loss function of the intention recognition model according to the intention recognition model and the intention preference coefficient; and adjusting the operation parameters of the intention recognition model according to the minimized loss function. In this way, the server can cluster and analyze a large number of possible intention preferences of users by integrating pattern configuration data uploaded by a plurality of electronic devices, so that intention labels are mapped for each clustered result, intention preference coefficients of the intention labels are set, and before intention recognition, training of an intention recognition model is guided by using intention preference values in advance, so that the trained intention recognition model is more suitable for the intention preferences of the users, the precision of the intention recognition model is improved, and personalized intention recognition of the users under the condition that user privacy data are not collected is realized.
In some implementations, the controller recognizes an operational intent of the user's voice, specifically configured to: if the intention recognition model cannot recognize the operation intention of the user voice, acquiring a target label from the intention label, wherein the intention preference coefficient corresponding to the target label is greater than or equal to the confidence coefficient of the intention recognition model; and determining the intention corresponding to the target label as the operation intention of the user voice. Or if the intention recognition model can recognize the operation intention of the user voice, detecting the confidence level of the intention recognition model; if the confidence coefficient of the intention recognition model is greater than or equal to a confidence coefficient threshold value, determining the intention output by the intention recognition model as the operation intention of the user voice; and if the confidence coefficient of the intention recognition model is smaller than the confidence coefficient threshold, correcting the intention output by the intention recognition model according to the intention preference coefficient and the probability of the intention output by the intention recognition model. Therefore, through the clustering information and the confidence coefficient of the intention recognition model, the intention recognition result output by the intention recognition model can be decided, and when the confidence coefficient of the intention recognition model is lower, the intention recognition result is corrected by utilizing parameters such as an intention preference coefficient, so that the recognized intention is more in line with the preference of a user, and the accuracy of service execution is improved.
In a third aspect, an embodiment of the present application further provides a service processing method based on voiceprint recognition, where the method includes:
the electronic equipment collects voice data, and voiceprint recognition is carried out on the voice data by using a voiceprint model to obtain first voiceprint information;
the electronic equipment acquires target grouping information mapped by the first voiceprint information and acquires a target mode bound by the target grouping information;
the electronic equipment sends the target grouping information and voice data to a server;
the electronic equipment receives a first voice instruction sent by a server, wherein the first voice instruction is sent by the server when detecting that the electronic equipment has the authority to respond to voice data according to the target grouping information and the operation intention of the voice data;
the electronic device responds to the first voice instruction and executes the service indicated by the first voice instruction according to the target grouping information in the target mode.
In a fourth aspect, an embodiment of the present application further provides a service processing method based on voiceprint recognition, where the method includes:
the method comprises the steps that a server receives target grouping information and voice data sent by electronic equipment, wherein the target grouping information is grouping information which is obtained by the electronic equipment after voiceprint recognition of the voice data and is mapped with first voiceprint information;
The server performs semantic analysis on the voice data and recognizes the operation intention of the user voice;
if the electronic equipment has the authority to respond to the voice data, the server generates a first voice instruction according to the target grouping information and the operation intention of the voice of the user, and sends the first voice instruction to the electronic equipment, wherein the first voice instruction is used for indicating the electronic equipment to execute the service indicated by the first voice instruction according to the target grouping information in the target mode.
In a fifth aspect, embodiments of the present application also provide a computer storage medium having stored therein program instructions which, when run on a computer, cause the computer to perform the methods involved in the above aspects and their respective implementations.
Drawings
FIG. 1 is an operational scenario diagram illustrating voice traffic processing according to an embodiment of the present application;
FIG. 2 is a block diagram of a hardware configuration of an electronic device shown in an embodiment of the present application;
FIG. 3 is a block diagram of a software architecture configuration of a server and an electronic device according to an embodiment of the present application;
fig. 4 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present application;
FIG. 5 is a schematic diagram I of a voiceprint rights management page displayed by an electronic device according to an embodiment of the present disclosure;
Fig. 6 is a schematic diagram two of a voiceprint rights management page displayed by an electronic device according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a first registration page displayed by an electronic device according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a second registration page displayed by an electronic device according to an embodiment of the present application;
FIG. 9 is a schematic diagram of grouping and binding management of an electronic device display according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a mode management page displayed by an electronic device according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a mode-system setup page displayed by an electronic device according to an embodiment of the present application;
fig. 12 is a flowchart of a service processing method based on voiceprint recognition performed by an electronic device side according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a label mapping of a server implementing a clustering result based on a questionnaire, as shown in an embodiment of the present application;
FIG. 14 is a schematic diagram of a server identifying voice intent using cluster information according to an embodiment of the present application;
fig. 15 is a flowchart of interaction between an electronic device and a server according to an embodiment of the present application.
Detailed Description
For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.
It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
Fig. 1 is an operation scenario diagram of voice service processing provided in an embodiment of the present application. As shown in fig. 1, in an operation scenario including a server 100 and an electronic device 200, the electronic device 200 illustratively includes a smart television 200a, a mobile terminal 200b, a smart speaker 200c, and the like.
The server 100 and the electronic device 200 in the present application may perform data interaction through various communication manners. Electronic device 200 may be enabled for communication connections via a Local Area Network (LAN), wireless Local Area Network (WLAN), or other network. The server 100 may provide the electronic device 200 with contents of semantic parsing and intention recognition results, various business-related data, and the like. For example, the electronic device 200 may interact with the server 100 for information and data, receive software program updates, and the like.
The server 100 may be a server providing various services, such as a background server providing support for voice data collected by the electronic device 200. The server 100 may perform semantic analysis, intention recognition, etc. on the received data such as voice, and feed back the processing result (e.g., recognized voice intention) to the electronic device 200. The server 100 may also transmit corresponding service data (e.g., media data, voiceprint rights/mode data, etc.) to the electronic device 200 in response to a service request of the electronic device 200. The server 100 may be a server cluster, or may be a plurality of server clusters, and may include one or more types of servers.
The electronic device 200 may be a hardware device or a software device. When the electronic device 200 is a hardware device, it may be various electronic devices having a sound collection function, including but not limited to a smart television, a smart phone, a smart speaker, a tablet computer, an electronic book reader, a smart watch, a smart game machine, a computer, an AI device, a robot, a smart car terminal, etc. When the electronic device 200 is a software apparatus, at least one software function module/service/model (e.g., a sound collection module, a voiceprint model, a voiceprint rights management module, etc.) may be included, and the software apparatus may be applied to the above-listed hardware electronic device.
The service processing method based on voiceprint recognition provided by the embodiment of the application can be completed through communication interaction between the server 100 and the electronic device 200.
Fig. 2 is a block diagram of a hardware configuration of an electronic device 200 according to an embodiment of the present application. As shown in fig. 2, the electronic apparatus 200 may include at least one of a communicator 210, a detector 220, an external device interface 230, a controller 240, a display 250, an audio output interface 260, a user interface 270, a memory 280, and a power supply. The controller 240 may include: the system comprises a central processor, a video processor, an audio processor, a graphic processor, a RAM and a ROM, wherein the first interface to the nth interface are used for input/output.
The display 250 includes a display screen component for presenting a picture, and a driving component for driving image display, components for receiving an image signal output from the controller 240, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface. The display 250 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.
The communicator 210 is a component for communicating with the external device or the server 100 according to various communication protocol types. For example, communicator 210 may include: at least one of a Wifi module, a Bluetooth module, a wired Ethernet module and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The electronic device 200 may establish a communication connection with the server 100 through the communicator 210 to enable transmission and reception of control signals and data signals.
The user interface 270 may be used to receive external control signals, such as user operations based on user interface inputs.
The detector 220 may be used to collect signals of the external environment or interaction with the outside. For example, the detector 220 may include: and the light receiver is used for acquiring the sensor of the ambient light intensity. The detector 220 may also include an image collector, such as a camera, for collecting external environmental scenes, user attributes, or user interaction gestures. The detector 220 may also include a sound collector for collecting sounds in the external environment, such as collecting voice commands issued by a user to the electronic device 200.
The sound collector may be a microphone, also called "microphone", which may be used to receive the sound of a user and to convert the sound signal into an electrical signal. The electronic device 200 may be provided with at least one microphone. In some embodiments, the electronic device 200 may be provided with two microphones, and may implement a noise reduction function in addition to collecting sound signals. The electronic device 200 may also be provided with three, four or more microphones to enable collection of sound signals, noise reduction, identification of sound sources, directional recording, etc.
Further, the microphone may be built in the electronic device 200, or the microphone may be connected to the electronic device 200 by a wired or wireless means. Of course, the mounting position of the microphone on the electronic device 200 is not limited in the embodiment of the present application. Alternatively, the microphone may not be included in the hardware of the electronic device 200, i.e., the microphone is not built into the electronic device 200. The electronic device 200 may be coupled to a microphone via some interface (e.g., a USB interface, etc.), and the coupled microphone may be secured to any location on the electronic device 200 via an external mount (e.g., a microphone stand).
The controller 240 controls task execution of the electronic device 200 and responds to user's operations or voice instructions through various software programs stored in the memory 270. The controller 240 controls the overall operation of the electronic device 200.
The controller 240 may include: at least one of a central processing unit (Central Processing Unit, CPU), a video processor, an audio processor, a graphic processor (Graphics Processing Unit, GPU), a RAM (Random Access Memory, RAM), a ROM (Read-Only Memory), a first interface to an nth interface for input/output, a communication Bus (Bus), and the like.
The electronic device 200 may have different software configurations under different device types and operating systems. Fig. 3 is a software architecture configuration block diagram of a server and an electronic device according to an embodiment of the present application. As shown in fig. 3, the system of the electronic device 200 may be divided into three layers, an application layer 21, a middleware layer 22, and a hardware layer 23, from top to bottom.
The application layer 21 mainly includes a common application in the electronic device 200, and an application framework (Application Framework). Among other common applications may include: applications developed based on Browser (e.g., HTML5 applications), native applications (Native APPs), speech recognition applications, etc. Wherein the speech recognition application may provide a speech interaction interface and services.
Middleware layer 22 may include middleware for multimedia protocols, system components, and the like. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.
As shown in fig. 3, the hardware layer 23 may include the communicator 210, the detector 220, the controller 240, and the like shown in fig. 2. The middleware layer 22 may include a detector driver, which may specifically include a microphone driver for transmitting voice data collected by the detector 220 to a voice recognition application. Illustratively, when a voice recognition application in the electronic device 200 is started and the electronic device 200 has established a communication connection with the server 100, the microphone driver is configured to send voice data input by the user and collected by the detector 220 to the voice recognition application, and the voice recognition application sends a query request containing the voice data to the server 100.
The server 100 may include a communication control module 101, an intention recognition module 102, and a data storage module 103. After receiving the query request, the server 100 sends the voice data contained in the query request to the intention recognition module 202, and the intention recognition module 202 inputs the voice data into a voice understanding model, and the semantic understanding module is used for carrying out semantic analysis on the voice data and recognizing the voice intention of the user.
Fig. 4 is a schematic diagram of a voice interaction network architecture according to an embodiment of the present application. As shown in fig. 4, the electronic device 200 collects voice data input by a user and transmits the voice data to the server 100. The server 100 may be configured with a speech recognition module, a semantic understanding module, a business service module, etc. Wherein the speech recognition module may be deployed with an ASR (Automatic Speech Recognition ) service for converting speech data into text; the semantic understanding module may be deployed with an NLU (Natural Language Understanding, natural speech understanding) service for semantic parsing of text to identify intent of user speech; the business service module is used for providing query functions (such as weather query, media resource query and the like) according to the business service matched with the voice instruction input by the user. There may be multiple entity service devices deployed with different service services in the architecture shown in fig. 4, and one or more functional services may be integrated in one or more entity service devices.
The following describes an example of a process of processing information input to the electronic device 200 based on the architecture shown in fig. 4, taking the information input to the electronic device as a media play sentence input through voice as an example, the method may include:
[ Speech recognition ]
After receiving the media asset playing sentence input through the voice, the electronic device can perform noise reduction processing and feature extraction on the audio of the media asset playing sentence, wherein the noise reduction processing can comprise the steps of removing echo, environmental noise and the like. The electronic device sends the audio of the asset play sentence to a speech recognition module in the server 100, which converts the audio of the asset play sentence into text.
Semantic understanding
The semantic understanding module utilizes a speech understanding model to perform natural language understanding on the identified candidate text and associated context information, and analyzes the text into structured, machine-readable information, information such as business fields, intentions, word slots and the like to express semantics and the like. After the speech understanding model obtains the actionable intent, a confidence score for the intent may be determined. The semantic understanding module selects one or more candidate actionable intents based on the determined intent confidence scores.
[ Business service ]
And the semantic understanding module issues a first instruction to the corresponding business service module according to the intention recognition result of the text of the media asset playing statement. The business service module obtains the target media asset data in response to the first instruction, and transmits the target media asset data to the electronic device 200. Thus, after receiving the target media asset data, the electronic device 200 plays the target media asset, thereby completing the voice response and the service execution.
It should be noted that the architecture shown in fig. 4 is only an example, and is not intended to limit the scope of the present application. Other architectures may also be employed to achieve similar functionality in embodiments of the present application, for example: all or part of the above-mentioned processes may be performed by the electronic device, which is not described herein. The structure of the server 100 is not limited in this embodiment, and the server 100 may at least include a communicator and a controller, where the communicator is configured to implement communication connection between the server 100 and the electronic device 200, and the controller is configured to perform related operations, control logic, and service functions in the server 100.
In some application scenarios, the electronic device may set corresponding usage modes for users with different identities or groups, so that mode operation and mode switching of the device may be controlled according to the identity and authority of the current user. Taking family members as an example, the patterns can be divided into: parental mode, child mode, senior citizen mode, visitor mode, guest mode, etc. The mode in which the electronic device is operated may be actively switched by the user, e.g. the electronic device is currently operating in a parental mode, which the parent may switch to if child 1 in the family member wants to watch the electronic device. This causes the following problems:
Problem one: the mode setting is simple, and the mode authority control at the system level and the application level is lacked.
And a second problem: based on the second problem, there is also a lack of data privacy protection and visibility management within the application while running in each mode, which may be a potential use security risk. For example, children and elderly people are not suitable to watch terrorist sources in video applications, but the child mode and the elderly mode have no relevant settings, resulting in the fear of children and elderly people. As another example, surveillance videos of bedrooms in homes generally relate to user life privacy, which can lead to privacy leakage or even raise security risks if the bedroom surveillance videos are visible to anyone.
Problem three: the electronic device passively switches modes, and cannot automatically switch modes according to a use scene and a user, so that the user is inconvenient to use.
In order to solve the first and second problems, the electronic device in the embodiment of the present application may set system-level and application-level modes for users with different identities or groups, so as to implement full coverage of mode options and finer mode authority control, thereby improving user experience and use security.
Wherein the system level mode may contain at least one mode option including, but not limited to: theme, font, sound, application list, etc. The application list comprises application types allowed to be displayed by the system level mode.
In one example, the system level mode is set, for example, to:
parental mode: theme (warm), font (moderate), sound (moderate), list of applications (show all applications).
Child mode: themes (eye-protection, lovely), fonts (slightly louder), sounds (moderate), application list (no game applications are shown).
Old man mode: theme (soft), font (big), sound (slightly higher), application list (show video class application, music class application, news class application only).
Guest mode: theme (hot), font (big), sound (slightly high), list of applications (only video class applications, music class applications, game class applications are shown).
Guest mode: theme (concise), font (moderate), sound (moderate), list of applications (only video class applications are shown).
In some implementations, each mode option of the system level mode may include at least one sub-option, such as: the theme mode options may include theme visual styles (e.g., warm, eye-protecting, lovely, etc.), custom themes (themes that the user sets by himself using tools such as a color extractor), etc.; font mode options may include glyphs (e.g., regular, oblique, bolded, etc.), font types (e.g., song Ti, bold, regular script, etc.), font sizes, font colors, etc.; the sound mode options may include tone type (e.g., male, female, child, etc.), volume size, etc. at the time of the voice announcement.
The electronic device may set a corresponding application-level mode for each installed application. The application level mode may contain at least one mode option including, but not limited to: application theme, application font, content classification, etc. The electronic equipment can also set the visibility of the content classification of each application in different modes, so that users with different identities and groups can request application services according to the authority, browse of invisible content is avoided, and privacy safety and use safety of the users are ensured.
In one example, taking a video class application as an example, the content classification of a video class application may be divided into: horror, action, fun, food, electronic contests, financial accounting, etc. The visibility of the content classification of video class applications in different groupings/modes is set for example as follows:
parental group/parental mode: all classified videos are visible.
Child group/child mode: horror, action, fun, food, video of electronic contests are not visible, and other types of video are visible.
Old man group/old man mode: horror, action, make-up, fashion, travel, shopping, electronic contests, and other types of video are not visible.
Visitor group/visitor mode: horror, sadness, financial, car-like videos are not visible, and other kinds of videos are visible.
Guest group/guest mode: only the hot-cast video is visible, and other types of video are not visible.
In another example, taking a video surveillance class application as an example, the content classification of the video surveillance class application may be divided into: living room monitoring video, primary lying monitoring video, secondary lying 1 monitoring video, secondary lying 2 monitoring video, outdoor monitoring video and the like. The visibility of the content classification of video surveillance class applications in different packets/modes is set for example as follows:
parental group/parental mode: all classified surveillance videos are visible.
Child group/child mode: the monitoring videos of the living room, the secondary lying 1 and the outside of the door are visible, and the monitoring videos of the primary lying and the secondary lying 2 are invisible.
Old man group/old man mode: the monitoring videos of the living room, the secondary lying 1, the secondary lying 2 and the outside of the door are visible, and the monitoring video of the primary lying is invisible.
Visitor group/visitor mode: the monitoring videos outside the living room and the door are visible, and the monitoring videos of the primary lying, the secondary lying 1 and the secondary lying 2 are invisible.
Guest group/guest mode: only the surveillance video outside the door is visible, and the other surveillance videos are not visible.
In order to solve the third problem, the embodiment of the present application further provides a mode authority management manner based on voiceprint recognition, through interaction between the electronic device and the user, user voiceprint grouping is preset, a mode of binding each user voiceprint grouping is set, voiceprint registration is performed, and a grouping mapped by the registered voiceprints is set. In this way, after the electronic device receives voice data input by the user, voice print information (hereinafter referred to as first voice print information) of the current user can be extracted according to the voice data, user voice print groups mapped by the first voice print information are identified as target groups, target modes corresponding to the target groups are queried, and the electronic device is switched to the target modes.
For family members, user voiceprint groupings can be set according to user population and identity characteristics. The user voiceprint packet can include: parental group, pediatric group, geriatric group, visitor group, and unregistered group. For example, if a family member includes father (parent 1), mother (parent 2), son (child 1), grandpa (aged 1) and milk (aged 2) with registered voiceprints, the voiceprint information of the father and mother may be stored in the parental group, the voiceprint information of the son may be stored in the child group, and the voiceprint information of the grandpa and milk may be stored in the aged group. Voiceprint information of other registered voiceprint persons (e.g., a jiu, a friend of a father, etc.) is stored to the visitor group. Users of unregistered voiceprint information are uniformly divided into unregistered groups.
The parental group can bind the parental mode, the child group binds the child mode, the elderly group binds the elderly mode, the guest group binds the guest mode, and the unregistered group binds the guest mode. Each mode specifically includes a system level mode and an application level mode.
Fig. 5 and fig. 6 are two schematic diagrams of a voiceprint rights management page displayed by an electronic device according to an embodiment of the present application. The user may launch the voiceprint rights management application or access the voiceprint rights management page through a menu path as specified. As shown in fig. 5, a switch button 51 may be included in the voiceprint rights management page 50, the switch button 51 being used to turn on or off voiceprint rights management functions. The controller responds to the operation of the user to turn on the switch button 51, as shown in fig. 6, controls the display to present the switch button 51 in an on state, and displays a registration button 52, a grouping and binding management button 53, and a mode management button 54 on the voiceprint rights management page 50. As shown in fig. 5, if the switch button 51 is in the off state, the voiceprint rights management page 50 may not display the registration button 52, the group management button 53, and the mode management button 54.
The controller controls the display to display the first registration page 70 shown in fig. 7 and may activate a sound collector (e.g., a microphone) in response to a user clicking on the registration button 52. The first registration page 70 displays a prompt for guiding the user to input preset voice information, which is information content input by voice when the user registers voiceprint, for example, prompt "please speak 'i want to register'". The voice collector (e.g., microphone) sends the voice data collected after the user's first uttered "i am to register" to the controller, which sends the voice data to the voiceprint model.
The voiceprint model can use a voiceprint recognition algorithm to perform operations such as preprocessing, feature extraction and the like on voice data, so as to obtain voiceprint information A registered by a user A. The voiceprint model training and voiceprint recognition algorithm is not limited, and for example, a voiceprint recognition method based on vector quantization can be adopted, and the method can be specifically implemented with reference to related technologies, and is not repeated here.
After the voiceprint model extracts voiceprint information a, the controller may control the display to display a second registration page 80 shown in fig. 8, the second registration page 80 being for guiding the user a to set the packet type mapped by voiceprint information a and complete registration. The second registration page 80 may include an identity settings control 81, a group selection control 82, and a confirm registration button 83.
The identity setting control 81 is used to set an identity of the user a currently performing voiceprint registration, where the identity may be an identity of the user a in a family member, such as "father", "mother", etc., or an identity nickname customized by the user a, such as "Xiaoming".
The group selection control 82 is used to set a target group required by the user a currently performing voiceprint registration, for example, the target group selected by the user a is "parental group" in fig. 8, and if the group type and the identity identification confirm correctly, the user a may click the confirm registration button 83. The controller controls the memory to store the mapping relationship of the voiceprint information a and the "parental group" in response to the user clicking the confirm registration button 83.
The controller responds to the operation of clicking the confirm registration button 83 by the user, if the user A is inquired that the target group is not set through the group selection control 82, the visitor group can be set as the target group, and the memory is controlled to store the mapping relation between the voiceprint information A and the visitor group.
The electronic device can establish and maintain a voiceprint library, and the voiceprint library records the mapping relation between registered voiceprint information and the grouping. The first authority management module is also used for expanding, deleting, modifying and editing the voiceprint library according to user operation, so that dynamic updating of the voiceprint library is realized.
The controller may control the display to display the grouping and binding management page 90 as shown in fig. 9 in response to the user clicking the grouping and binding management button 53. The grouping and binding management page 90 contains user-created grouping types including, but not limited to, parental groups, child groups, senior groups, visitor groups, and guest groups. The user may create new packets, delete existing packets (unrecognized groups are not deletable), and switch packets for voice prints of a user or users through the packet and binding management page 90. For example, a user identified as "Ming" is located within the group of guests, but "Ming" is actually parent 1 of the family members, the user may drag the "Ming" identifier to the group of parents to convert "Ming" to one of the members within the group of parents. After the electronic device detects that the user has deleted the packet or switched the packet, the voiceprint model can be retrained.
The user may also set the mode type bound by each grouping type through the grouping and binding management page 90, for example, the user may bind the parental group to the parental mode according to the use requirement, and may bind the parental group to the elderly mode. Thus, the electronic device can dynamically manage and update the user voiceprint groups and the binding relation between the user voiceprint groups and modes.
Fig. 10 is a schematic diagram of a mode management page displayed by an electronic device according to an embodiment of the present application. The controller may control the display to display the mode management page 110 as shown in fig. 10 in response to the user clicking the mode management button 54. The mode management page 110 displays a mode list 111 containing mode types each having a corresponding system setting button 111A and application setting button 111B. In this way, the mode management page 110 can be used to guide the user in setting the mode options that each mode corresponds to and the system and applications are related to.
Fig. 11 is a schematic diagram of a mode-system setting page displayed by an electronic device according to an embodiment of the present application. For example, the controller may jump to the mode-system settings page in response to the user clicking the system settings button 111A corresponding to the parental mode, as shown in fig. 11. In the page, the user can set mode options such as a system theme, a system font, a system sound, an application list and the like corresponding to the parental mode. In this way, the electronic device can dynamically manage and update the mode configuration information. The controller may jump to a mode-application setting page in response to the user clicking the application setting button 111B corresponding to the parental mode, so that the user may set the mode options involved in each application, as well as set the visibility rights of the data or content categories in each application.
Through the implementation manner, the electronic device can acquire the grouping configuration data, the mode configuration data and the binding relation data. Wherein the packet configuration data includes packet type information. The mode configuration data includes: the mode type, the system settings and the setting information of the mode options related to the application settings in each mode, and the in-application data visibility in each mode. The binding relationship data includes: binding relationship between packet type and schema type. The binding relationship may specifically include: binding relationship between packet type and mode-system settings, binding relationship between packet type and mode-application settings.
In some implementations, the binding relationship data can further include: binding relationship of packet type and application data visibility rights. Alternatively, the mode configuration data may include therein setting information of the visibility of the application data in each mode.
The grouping configuration data, the mode configuration data and the binding relationship data can be stored in the electronic device, so that the local storage of the data is realized. The electronic device can upload the grouping configuration data, the mode configuration data and the binding relation data to the server, so that cloud storage of the data is realized. Thus, the basic data of the voiceprint authority management is constructed, and the basic data can be used for subsequent voiceprint recognition and voice processing.
Fig. 12 is a flowchart of a service processing method based on voiceprint recognition performed by an electronic device side according to an embodiment of the present application. As shown in fig. 12, the method includes:
step S121, receiving voice data input by a user.
Step S122, extracting first voiceprint information according to the voice data, and calculating the similarity between the first voiceprint information and candidate voiceprint information included in the voiceprint library.
The controller may input voice data to the voiceprint model. The voiceprint model obtains first voiceprint information corresponding to a user of a current input voice according to voice data, compares the first voiceprint information with M (greater than or equal to 1) candidate voiceprint information contained in a voiceprint library, and outputs similarity k between the first voiceprint information and each candidate voiceprint information m Wherein M represents the serial number of candidate voiceprint information, and M is more than or equal to 1 and less than or equal to M.
Step S123, matching the target packet mapped by the first voiceprint information according to the similarity.
The controller can be based on { k } m And finding out first target candidate voiceprint information, wherein M is larger than or equal to 1 and smaller than or equal to M. The similarity between the first target candidate voiceprint information and the first voiceprint information is larger than a preset threshold. If the number of the first target candidate voiceprint information exceeds one, searching second target candidate voiceprint information with the maximum similarity with the first voiceprint information from the first target candidate voiceprint information. Further, the controller may query the packet type mapped by the first target candidate voiceprint information or the second target candidate voiceprint information, to obtain the target packet.
Step S124, judging whether the grouping information corresponding to the voice data needs to be updated.
The controller may compare whether the current packet is consistent with the target packet. If the current packet is consistent with the target packet, no update of packet information is required, and step S1210 is performed; if the current packet is inconsistent with the target packet, the packet needs to be switched and the packet information updated, step S125 is performed.
Step S125, updating the grouping information and transmitting the voice data and the updated target grouping information to the server.
Accordingly, at the server side: after receiving the voice data and the updated target grouping information, the server performs semantic analysis on the voice data and identifies the operation intention of the user voice. Further, the server detects whether the electronic device has voice response authority according to the target packet information. And if the voice response authority is provided, sending a first voice instruction to the electronic equipment, wherein the first voice instruction comprises the operation intention of the voice of the user. And if the voice response permission is not available, sending a permission-free prompt message to the electronic equipment.
For example, the server recognizes that the operation intention of the user voice is "play movie a", movie a is horror-class assets, and queries that the user who inputs the voice belongs to a child group, and that the horror-class assets are not visible to the child group, so that the user does not have the right to watch movie a, and the server may send a no-right prompt message to the electronic device. For another example, the server recognizes that the operation intention of the user voice is "start game B", and queries the user who inputs the voice to input the parental group, and the parental group has authority to operate game B, and then the server may send a first voice instruction to the electronic device, where the first voice instruction is used to instruct the electronic device to start the program of game B, and execute the relevant service of game B.
Step S126, query the target pattern of target packet binding.
Step S127, it is determined whether the mode needs to be switched.
The controller may compare whether the current mode is consistent with the target mode. If the current mode is consistent with the target mode, the mode does not need to be switched, and step S1210 is performed. If the current mode is not consistent with the target mode, step S128 is performed.
Step S128, the configuration information of the target mode is obtained, and the current running mode is switched to the target mode.
In step S129, the first voice command sent by the server is received, and the service indicated by the first voice command is executed according to the target packet information in the target mode.
The first voice command is a command sent by the server after the voice data is analyzed, and according to the target grouping information and the operation intention of the user voice, the command sent by the electronic equipment after the electronic equipment is detected to have the authority for responding to the voice data (hereinafter referred to as voice response authority), wherein the first voice command comprises the operation intention of the user voice.
In step S1210, a second voice command sent by the server is received, and a service indicated by the second voice command is executed according to the target packet information in the current mode.
The second voice command is a command sent by the server after the voice data are analyzed and the electronic equipment is detected to have voice response authority according to the current grouping information and the operation intention of the user voice, and the second voice command contains the operation intention of the user voice.
The traffic involved in step S129 and step S1210 may be networking traffic or local traffic. The networking service refers to that the electronic device needs to request service data from the server, and execute corresponding services after receiving the service data sent by the server, including application services, for example: a media play service in a video type application, a page loading service of a browser application, and the like. Local services refer to services that an electronic device can independently complete without communicating with a server, such as: local media files (e.g., video, audio, images, etc.) are played.
The embodiment of the application carries out voiceprint recognition based on the voiceprint library which is built in advance, the grouping configuration data, the mode configuration data and the binding relation data, so as to recognize the target grouping required by the user who inputs the voice currently and inquire the target mode of the binding of the target grouping. In this way, the electronic device can determine whether to update the packet information and whether to switch modes based on the current packet, the current mode, the target packet, and the target mode. After analyzing the voice data and identifying the operation intention of the user voice, the server can further detect whether the electronic equipment has the authority to respond to the operation intention of the user voice according to the operation intention of the user voice, and if the electronic equipment has the voice response authority, the server sends a first voice instruction or a second voice instruction to the electronic equipment so as to enable the electronic equipment to execute the service under the specific grouping and mode states. According to the embodiment of the application, the modes are automatically matched and switched according to the grouping and the use scene of the user, the mode is not required to be switched by the user, the application data visibility is set from the multi-layer configuration mode such as a system and an application, the user privacy is protected, the use safety is improved, the precision and the efficiency of voiceprint authority management and mode control are realized, and better service experience is provided for the user.
In some implementations of embodiments of the present application, the electronic device may upload the mode configuration data to the server. In this way, the server can receive pattern configuration data uploaded by a large number of electronic devices of different models and capabilities, and by clustering the pattern configuration data, analyze preferences of multiple end users for pattern and business intent.
The server may set a cluster model, input pattern configuration data into the cluster model, and train the cluster model. Thus, the clustering model can be used for performing clustering analysis on mode options related to applications and systems in different modes, such as clustering system mode options of system fonts, system topics, system sounds, application lists and the like, and clustering application mode options of application topics, application content classification and the like. The cluster model is also used for outputting a cluster analysis result (hereinafter referred to as cluster information). The server may store the cluster information and a mapping relationship of the cluster information and the pattern type. When the mode configuration data of any electronic equipment in communication connection with the server is updated, the server can use the clustering model to cluster the updated mode configuration data again after receiving the updated mode configuration data, so that the clustering information is updated, and the accuracy of the clustering model and the accuracy of the clustering information are improved. The clustering model is not limited, and may be an operation model such as k-means.
In some exemplary application scenarios, for example, the information input by the user through the voice is "playing Blooming flowers and full moon", and the media resources that can be associated with the "Blooming flowers and full moon" include movies, MVs, song sound sources, short videos, flowers, trailers, etc., which results in that when the server recognizes the operation intention of the user voice, it is unclear whether the user wants to play the movies, MVs, sound sources, etc. of "Blooming flowers and full moon", and thus the electronic device side cannot accurately execute the service. For example, the actual intention of the user is to make the electronic device play the movie "Blooming flowers and full moon", but after the semantic analysis, the intention indicated by the voice command sent to the electronic device by the server is to play the music "Blooming flowers and full moon", resulting in that the intention identified by the server is not identical to the actual intention of the user. To accurately analyze the user's preferences for patterns and intent, a label map in the clustered information may be constructed in the form of a questionnaire.
Fig. 13 is a schematic diagram of a label mapping for a server to implement clustering results based on a questionnaire according to an embodiment of the present application. As shown in fig. 13, for any one mode option M in the system and the application, cluster analysis is performed on the mode option M, the cluster result corresponding to the mode option M includes y types, and if the y types of cluster result are mapped into x types of preference intents, a questionnaire can be started. The content of the questionnaire may include: descriptive information of the y-class clustering result and the mapped x alternative preference intents. An alternative preference intent may be a top-x intent, i.e., the top x intents with the highest likelihood of being obtained from a variety of intents. The server may set x=y×w, where w represents the cluster label mapping coefficient, and the value of w is, for example, 0.5.
With continued reference to fig. 13, after the server generates a questionnaire, the corresponding questionnaire data is pushed to the electronic device. And the electronic equipment controls the display to display the questionnaire according to the questionnaire data, and after the user fills in and submits the questionnaire, the electronic equipment sends a questionnaire result to the server. The server counts the questionnaire results of the first-round questionnaire fed back by the plurality of electronic devices, and if the clustering result z in the questionnaire results exceeding the preset proportion P is mapped to be the intention label g, the mapping relation between the clustering result z and the intention label g is recorded in the clustering information. The value of P is not limited, and may be set to 60%, for example. z represents the serial number of the clustering result, and z is more than or equal to 1 and less than or equal to y; g represents the serial number of alternative preference intention, and g is more than or equal to 1 and less than or equal to x. If the number of intention labels of the clustering result map z is greater than one, for example, the clustering result map z maps the intention labels g1 and g2, intention preference coefficients of the intention labels g1 and g2 may be calculated, and the label coefficients may be used to measure the preference degree of the intention. For example, when the information input by the user through the voice is "playing Blooming flowers and full-moon", the intention label 1 of the media resource map of "Blooming flowers and full-moon" is queried as a movie "Blooming flowers and full-moon", the intention preference coefficient of the intention label 1 is 0.9, the intention label 2 of the media resource map of "Blooming flowers and full-moon" is a song "Blooming flowers and full-moon", and the intention preference coefficient of the intention label 1 is 0.8, when the server identifies the operation intention of the user voice, the intention is preferentially identified as playing the movie "Blooming flowers and full-moon", so that the problem that the electronic equipment cannot accurately execute the service due to the unclear intention identification of the server side is solved.
If the duty cycle of the questionnaire results with the clustering result z mapped to the intention label g is smaller than the preset proportion P, the server may start a second round of questionnaire investigation. The clustering result of the existing mapped intent labels may not be filled into the questionnaire content of the second round. By analogy, the questionnaire can be conducted in batches for a plurality of rounds to find the intention labels mapped by each clustering result, and the mapping relation between the clustering result and the intention labels is recorded in the clustering information. For example, the server may set an upper limit number of questionnaires, and if the number of executed questionnaires reaches the upper limit number, the server does not match the intent label mapped to the clustering result z, and stops the intent label mapping to the clustering result z.
Fig. 13 is a diagram illustrating an intention tag as an example, a server may set various types of tags including, but not limited to, a character tag, an intention tag, an interest tag, etc. at the time of questionnaire investigation. If a plurality of labels of the same type exist and the same label is mapped for a plurality of times, relevant parameters such as label mapping coefficients and the like can be set according to the number of the labels in the type dimension.
After the server performs semantic analysis on the voice data, the operation intention of the user voice can be identified according to the semantic analysis result and the clustering information. The server can set a semantic understanding model, the server inputs voice data sent by the electronic equipment to the semantic understanding model, and the semantic understanding model analyzes the voice data so as to recognize and output the operation intention of the user voice.
Fig. 14 is a schematic diagram of a server according to an embodiment of the present application for recognizing a voice intention using cluster information. As shown in fig. 14, the roles of the cluster information may include: the intention recognition model performs the assistance before intention recognition and the correction after intention recognition based on the cluster information.
As shown in fig. 14, the server realizes the assistance before intention recognition according to the cluster information, mainly by training and optimizing the intention recognition model by using the cluster information, and takes the mapping relation between the cluster result and the intention label contained in the cluster information and the intention preference coefficient of the intention label as parameters for training and optimizing the intention recognition model. The specific implementation mode is as follows: the voice text of the user is input into an intention recognition model for classification, and the intention recognition model can adopt models such as TEXTCNN and the like, namely model parameter training is carried out by collecting training sentences. In the training process, the important evaluation index of the model is the accuracy of each intention, a cross loss function can be adopted to optimize the intention recognition model by minimizing the loss function, and the loss function of the intention recognition model in the embodiment of the application can be set as follows:
in formula (1): x is training sentence (phonetic text) input by user, y' is real intention corresponding to training sentence input by user, f is intention recognition model of training, t is sequence number of training sentence, alpha t Representing the user's intent preference coefficients. Parameters and performance of the intent recognition model are optimized by minimizing the loss function. The purpose of including the user's intent preference value in the loss function loss is to: before intention recognition, training of the intention recognition model is guided by the intention preference value in advance, so that the trained intention recognition model is more close to the intention preference of a user, the precision of the intention recognition model is improved, and personalized intention recognition of the user under the condition that user privacy data is not collected is realized.
As shown in fig. 14, the server may implement the correction after the intention recognition based on the cluster information. The server may determine whether the intention recognition model recognizes the operation intention of the user's voice, and according to the determination result, may include the following two cases:
case one: the intent recognition model does not recognize the operational intent of the user's voice. In this case, if the intention preference coefficient of a certain intention label input into the intention recognition model is greater than or equal to the confidence level of the intention recognition model, the intention label is the target label, and the output result of the intention recognition model is the intention corresponding to the target label.
And a second case: the intention recognition model can recognize the operation intention of the user voice. In the second case, if the confidence level of the intention recognition model is greater than or equal to the confidence level threshold, the recognition result output by the intention recognition model is considered to be reliable, and the server may instruct the electronic device to execute the service matching the intention recognized by the intention recognition model.
In the second case, if the confidence level of the intention recognition model is smaller than the confidence level threshold, the server may correct the intention recognition result according to the intention preference coefficient of the user in combination with the probability of each intention output by the intention recognition model. The correction rule is configured as follows:
first, the probability after correction is calculated according to formula (2):
in formula (2): i represents the number of the current intention, j represents the number of the intention output by the intention recognition model, and alpha i Intent preference coefficient, p, representing current intent i Representing probability of model output corresponding to current intention, alpha j Representing intent preference coefficients, p, corresponding to the intent of the intent recognition model output j Probability, p, representing intent output by intent recognition model i ' representation pair p i And (5) correcting the obtained probability value.
The server calculates p by traversing different i values i After' p can be found i Maximum probability p in max ' and instructs the electronic device to execute the AND p max ' corresponding target intent matching business. By correcting the intention identified by the intention identification model, the service executed by the electronic equipment can be more in accordance with the intention preference of the user, so that the accuracy of service execution is improved, and the user experience is improved.
Fig. 15 is a flowchart illustrating interaction between the electronic device 200 and the server 100 according to an embodiment of the present application. As shown in fig. 15, the electronic device may include a first voice module, a voiceprint model, and a first rights management module. The server may include a second speech module, an intent recognition model, a cluster model, and a second rights management module.
The first voice module is used for realizing the functions of voice print data input (voice print registration), voice data acquisition (service logic is triggered in a voice mode), voice instruction response, voice playing and the like. The voiceprint model is used for extracting voiceprint characteristics, training and processing the voiceprint characteristics, voiceprint similarity comparison and the like, inputting voiceprint data (voiceprint registration) and identifying target groups to which the user belongs through voiceprints so as to determine target modes of target group binding.
The first rights management module is used for: managing the on-off state of the voiceprint authority management function, managing the grouping state, the mode state, the binding state and the like, and obtaining grouping configuration data, mode configuration data and binding relation data; the system is also used for uploading and requesting to query the grouping configuration data, the mode configuration data and the binding relation data to the second right management module by communicating with the second right management module; the method is also used for realizing mode switching, grouping creation, grouping deletion, grouping of user voiceprint switching and other operations, and providing menu page data and the like corresponding to a plurality of management options such as grouping management, binding management, mode management, registration management and the like related in the voiceprint authority management page.
The clustering model is used for carrying out clustering analysis on the mode configuration data to obtain clustering information. The intention recognition model is used for carrying out intention recognition on the voice data and realizing the assistance before intention recognition and the correction after intention recognition according to the clustering information. The second voice module is used for communicating and interacting with the first voice module, so that the server can verify whether the electronic equipment has voice response permission, and send voice instructions or no-permission prompt information to the first voice module according to the permission verification result.
The second authority management module is used for storing grouping configuration data, mode configuration data and binding relation data uploaded by the electronic equipment and storing clustering information output by the clustering model; the method is also used for checking the permission, providing mode option management contained in a system and an application of the electronic equipment, providing application list management based on voiceprint permission, providing preset data of a system level mode and an application level mode and the like.
As shown in fig. 15, the first rights management module acquires the packet configuration data, the mode configuration data, and the binding relationship data, and uploads the packet configuration data, the mode configuration data, and the binding relationship data to the second rights management module for storage. The second rights management module inputs the pattern configuration data to the cluster model. And carrying out cluster analysis on the mode configuration data by the cluster model to obtain cluster information, and sending the cluster information to the second authority management module for storage.
And the first authority management module responds to the operation of registering the voiceprint by the user B, and notifies the voiceprint model to start recording voiceprint information, and the voiceprint model notifies the first voice module to start recording. The first voice module collects voice data A and sends the voice data A to the voiceprint model. And the voiceprint model extracts voiceprint characteristics of the voice data A to obtain voiceprint information A. The voiceprint model may send voiceprint information to the first rights management module, where the first rights management module stores a mapping relationship between the voiceprint information a and a packet to which the user b belongs.
The first voice module collects voice data B and sends the voice data B to the voiceprint model. The voiceprint model extracts voiceprints of the voice data B to obtain voiceprint information B, compares the similarity of the voiceprint information B with candidate voiceprint information in a voiceprint library to obtain a comparison result of similarity, and sends the comparison result to the first authority management module. And the first authority management module acquires the target packet mapped by the voiceprint information B according to the comparison result of the similarity.
If the first authority management module detects that the current packet is inconsistent with the target packet, updating packet information (the updated packet is the target packet), and inquiring the target mode of binding the target packet to the second authority management module.
If the first right management module detects that the current mode is inconsistent with the target mode, a first data request can be sent to the second right management module, wherein the first data request is used for indicating the second right management module to send configuration information and cluster information of the target mode to the first right management module.
If the first rights management module detects that the current mode is consistent with the target mode, the mode is not switched, and a second data request can be sent to the second rights management module, wherein the second data request is used for indicating the second rights management module to send the clustering information to the first rights management module.
The first right management module sends the cluster information and the updated target grouping information to the first voice module. The first voice module sends the cluster information, the voice data B and the target grouping information to the second voice module; and the first authority management module can control the mode switching module to switch the running mode of the system or the target application into the target mode according to the configuration information of the target mode.
And after the second voice module receives the clustering information, the voice data B and the updated grouping information, the voice data B and the clustering information are sent to the intention recognition model. The intention recognition module recognizes the intention of the voice data B according to the clustering information, and sends the intention recognition result of the voice data B to the second voice module. And the second voice module requests the second authority management module to inquire whether the electronic equipment has voice response authority according to the target grouping information and the intention recognition result of the voice data B.
The second authority management module returns an authority judgment result and indicates that the electronic equipment has voice response authority to the second voice module. The second voice module generates and sends a first voice command to the first voice module according to the intention of the voice data B.
The first voice module receives the first voice command and can forward the first voice command to the first authority management module. The first authority management module responds to the first voice instruction and executes the service indicated by the first voice instruction in the target group and the target mode according to the target group information and the cluster information. The electronic equipment can utilize the grouping information and combine the clustering information to perform service processing, on one hand, service verification can be performed through the grouping information, and service authority is guaranteed to meet grouping requirements, and on the other hand, service data is requested through the clustering information, so that the service data is more in accordance with the intention preference of a user.
If the voice data C is collected after the first voice module, the voice data C can be identified by voice print, intention identification, authority verification, grouping/mode state management, and the service matched with the intention can be executed in response to the instruction corresponding to the voice operation C.
In order to protect user privacy and improve efficiency of voiceprint recognition, grouping recognition and mode switching, the voiceprint model is configured in the electronic equipment. The voiceprint model can also be configured in the server, so that the electronic device sends the voice data collected by the voice collector to the server, and the server controls the voiceprint model to carry out processing logic such as voiceprint recognition and the like.
In some implementations, the electronic device may continuously receive a plurality of voice data, e.g., the electronic device sequentially receives voice data D and voice data E, and if the interval between voice data D and voice data E is less than a time threshold (e.g., 0.5 seconds), it is indicated that the input objects of voice data D and voice data E are the same person, because the input objects of voice data are not converted in a shorter time threshold. Therefore, the voiceprint model can not carry out voiceprint recognition and packet recognition on the voice data D, namely the packet information is not updated, so that the voice response speed and the service processing speed are increased.
After the voiceprint model obtains the first voiceprint information of the user currently inputting the voice data, similarity comparison is needed between the first voiceprint information and candidate voiceprint information in a voiceprint library. In some implementations, the voiceprint model can adjust a comparison order between the first voiceprint information and the candidate voiceprint information based on big data statistics, usage scenarios, user usage habits, user activities, and the like, so as to improve efficiency of voiceprint recognition and packet recognition.
In some implementations, when the first voiceprint information is compared with candidate voiceprint information in the voiceprint library, if the similarity between the first voiceprint information and the candidate voiceprint information S is detected to be greater than a similarity threshold (e.g., 0.9), the voiceprint model stops the subsequent comparison flow and takes the packet mapped by the candidate voiceprint information S as the target packet. Therefore, through the limitation of the similarity threshold value, the voiceprint model can timely terminate the subsequent comparison flow when the candidate voiceprint information S is found without performing traversal comparison on the first voiceprint information and all the candidate voiceprint information, so that the time consumption of voiceprint recognition and packet recognition is reduced, and the efficiency of voice response and service processing is improved.
When the user performs operations such as voiceprint registration, grouping management, mode management, binding management, and the like, page configuration and effects of a user interface presented by the electronic device are not limited to the embodiments of the present application. The implementation of the content (including the model and algorithm used, etc.) for voiceprint recognition, voice processing, intent recognition (including assistance before intent recognition and correction after intent recognition), etc. is not limited to the embodiments of the present application. The software and hardware configurations of the electronic device and the server are also not limited to the embodiments of the present application.
The embodiment of the application also provides a computer storage medium, which can store a program. When the computer storage medium is located in an electronic device or a server, the program may include program steps of the voiceprint recognition based service processing method configured by the one-sided device in the above aspects when executed. The computer storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. An electronic device, comprising:
a sound collector;
the communicator is used for being in communication connection with the server;
A controller configured to perform:
controlling voice data acquired by a sound collector, and performing voiceprint recognition on the voice data by using a voiceprint model to obtain first voiceprint information;
acquiring target grouping information mapped by the first voiceprint information and a target mode bound by the target grouping information;
transmitting the target packet information and the voice data to a server;
receiving a first voice instruction sent by a server, wherein the first voice instruction is sent by the server when detecting that the electronic equipment has the authority to respond to the voice data according to the target grouping information and the operation intention of the voice data;
and responding to the first voice instruction, and executing the service indicated by the first voice instruction according to the target grouping information in the target mode.
2. The electronic device of claim 1, wherein prior to the controller executing the voice data collected by the sound collector, is further configured to execute:
acquiring grouping configuration data, mode configuration data and binding relation data; wherein the packet configuration data includes a packet type; the mode configuration data comprises at least one mode type corresponding system and mode option setting information contained in the application; the binding relationship data comprises a binding relationship between a grouping type and a mode type and a binding relationship between the grouping type and the application data visibility authority;
And uploading the grouping configuration data, the mode configuration data and the binding relation data to a server.
3. The electronic device of claim 2, wherein after executing the target mode of acquiring the target packet information binding, the controller is further configured to execute:
comparing the current mode with the target mode;
if the current mode is inconsistent with the target mode, a first data request is sent to a server, wherein the first data request is used for indicating the server to send configuration information and cluster information of the target mode to electronic equipment; the cluster information is obtained by carrying out cluster analysis on mode configuration data uploaded by a plurality of electronic devices by a server;
receiving configuration information and the clustering information of the target mode sent by a server;
switching a mode operated in the electronic equipment into the target mode according to the configuration information of the target mode;
and responding to the first voice instruction, and executing the service indicated by the first voice instruction according to the target grouping information and the clustering information.
4. The electronic device of claim 3, wherein the controller is further configured to perform:
If the current mode is consistent with the target mode, sending a second data request to a server, wherein the second data request is used for indicating the server to send the clustering information to the electronic equipment;
receiving the clustering information sent by a server;
receiving a second voice command sent by a server, wherein the second voice command is sent by the server when detecting that the electronic equipment has the authority to respond to the voice data according to the current grouping information and the operation intention of the voice data;
and responding to the second voice instruction, and executing the service indicated by the second voice instruction according to the current grouping information and the clustering information in the current mode.
5. A server, comprising:
a communicator for communication connection with an electronic device;
a controller configured to perform:
receiving target grouping information and voice data sent by electronic equipment, wherein the target grouping information is grouping information mapped with first voiceprint information, which is obtained after voiceprint recognition is carried out on the voice data by the electronic equipment;
carrying out semantic analysis on the voice data and identifying the operation intention of the voice of the user;
and if the electronic equipment has the authority to respond to the voice data, generating a first voice instruction according to the target grouping information and the operation intention of the voice of the user, and sending the first voice instruction to the electronic equipment, wherein the first voice instruction is used for indicating the electronic equipment to execute the service indicated by the first voice instruction according to the target grouping information in a target mode.
6. The server of claim 5, wherein prior to receiving the target packet information and voice data sent by the electronic device, the controller is further configured to perform:
receiving grouping configuration data, mode configuration data and binding relation data sent by electronic equipment; wherein the packet configuration data includes a packet type; the mode configuration data comprises at least one mode type corresponding system and mode option setting information contained in the application; the binding relationship data comprises a binding relationship between a grouping type and a mode type and a binding relationship between the grouping type and the application data visibility authority;
and storing the grouping configuration data, the mode configuration data and the binding relation data.
7. The server of claim 6, wherein the controller is further configured to perform:
performing cluster analysis on the mode configuration data uploaded by the plurality of electronic devices by using a cluster model to obtain cluster information; the clustering information comprises intention labels mapped by each clustering result and intention preference coefficients corresponding to each intention label;
calculating a loss function of the intention recognition model according to the intention recognition model and the intention preference coefficient;
And adjusting operation parameters of the intention recognition model according to the minimized loss function.
8. The server according to claim 7, wherein the controller recognizes an operation intention of a user voice, and is specifically configured to:
if the intention recognition model fails to recognize the operation intention of the user voice, acquiring a target label from the intention label, wherein the intention preference coefficient corresponding to the target label is greater than or equal to the confidence coefficient of the intention recognition model;
determining the intention corresponding to the target label as the operation intention of the user voice;
or,
if the intention recognition model can recognize the operation intention of the user voice, detecting the confidence degree of the intention recognition model;
if the confidence coefficient of the intention recognition model is greater than or equal to a confidence coefficient threshold value, determining the intention output by the intention recognition model as the operation intention of the user voice;
and if the confidence coefficient of the intention recognition model is smaller than a confidence coefficient threshold, correcting the intention output by the intention recognition model according to the intention preference coefficient and the probability of the intention output by the intention recognition model.
9. A business processing method based on voiceprint recognition is characterized by comprising the following steps:
The electronic equipment collects voice data, and voiceprint recognition is carried out on the voice data by using a voiceprint model to obtain first voiceprint information;
the electronic equipment acquires target grouping information mapped by the first voiceprint information and acquires a target mode bound by the target grouping information;
the electronic equipment sends the target grouping information and the voice data to a server;
the electronic equipment receives a first voice instruction sent by a server, wherein the first voice instruction is sent by the server when detecting that the electronic equipment has the authority to respond to the voice data according to the target grouping information and the operation intention of the voice data;
and the electronic equipment responds to the first voice instruction and executes the service indicated by the first voice instruction according to the target grouping information in the target mode.
10. A business processing method based on voiceprint recognition is characterized by comprising the following steps:
the method comprises the steps that a server receives target grouping information and voice data sent by electronic equipment, wherein the target grouping information is grouping information which is obtained by the electronic equipment after voiceprint recognition of the voice data and is mapped with first voiceprint information;
the server performs semantic analysis on the voice data and recognizes the operation intention of the voice of the user;
If the electronic equipment has the authority to respond to the voice data, the server generates a first voice instruction according to the target grouping information and the operation intention of the user voice, and sends the first voice instruction to the electronic equipment, wherein the first voice instruction is used for indicating the electronic equipment to execute the service indicated by the first voice instruction according to the target grouping information in a target mode.
CN202211204664.0A 2022-09-29 2022-09-29 Service processing method based on voiceprint recognition, electronic equipment and server Pending CN117831541A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211204664.0A CN117831541A (en) 2022-09-29 2022-09-29 Service processing method based on voiceprint recognition, electronic equipment and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211204664.0A CN117831541A (en) 2022-09-29 2022-09-29 Service processing method based on voiceprint recognition, electronic equipment and server

Publications (1)

Publication Number Publication Date
CN117831541A true CN117831541A (en) 2024-04-05

Family

ID=90512173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211204664.0A Pending CN117831541A (en) 2022-09-29 2022-09-29 Service processing method based on voiceprint recognition, electronic equipment and server

Country Status (1)

Country Link
CN (1) CN117831541A (en)

Similar Documents

Publication Publication Date Title
US10650816B2 (en) Performing tasks and returning audio and visual feedbacks based on voice command
US10311877B2 (en) Performing tasks and returning audio and visual answers based on voice command
US20180047391A1 (en) Providing audio and video feedback with character based on voice command
CN111095892B (en) Electronic device and control method thereof
CN109271533A (en) A kind of multimedia document retrieval method
CN111586469B (en) Bullet screen display method and device and electronic equipment
CN108733429B (en) System resource allocation adjusting method and device, storage medium and mobile terminal
CN110992937B (en) Language off-line identification method, terminal and readable storage medium
WO2022022743A1 (en) Method for identifying user on public device, and electronic device
CN111539217B (en) Method, equipment and system for disambiguation of natural language content titles
US20190163436A1 (en) Electronic device and method for controlling the same
EP4343756A1 (en) Cross-device dialogue service connection method, system, electronic device, and storage medium
CN117831541A (en) Service processing method based on voiceprint recognition, electronic equipment and server
JP6944920B2 (en) Smart interactive processing methods, equipment, equipment and computer storage media
CN112380871A (en) Semantic recognition method, apparatus, and medium
CN112165626A (en) Image processing method, resource acquisition method, related device and medium
WO2022193735A1 (en) Display device and voice interaction method
CN112908319B (en) Method and equipment for processing information interaction
KR102444257B1 (en) Electronic apparatus and control method thereof
KR102667498B1 (en) Electronic device and method for controlling the same, and storage medium
CN117809659A (en) Server, terminal equipment and voice interaction method
CN117812422A (en) Display device and voice search method
CN113990292A (en) Method and display device for calling parameters based on multi-language processing
CN118042212A (en) Live interaction method, live interaction device, computer equipment and computer readable storage medium
CN116110400A (en) Voice recognition method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination