CN117524231A - Voice person identification method, voice interaction method and device - Google Patents

Voice person identification method, voice interaction method and device Download PDF

Info

Publication number
CN117524231A
CN117524231A CN202210909507.3A CN202210909507A CN117524231A CN 117524231 A CN117524231 A CN 117524231A CN 202210909507 A CN202210909507 A CN 202210909507A CN 117524231 A CN117524231 A CN 117524231A
Authority
CN
China
Prior art keywords
voice
person
identifier
identification
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210909507.3A
Other languages
Chinese (zh)
Inventor
卞腾
左伟国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202210909507.3A priority Critical patent/CN117524231A/en
Publication of CN117524231A publication Critical patent/CN117524231A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/2803Home automation networks
    • H04L12/2816Controlling appliance services of a home automation network by calling their functionalities
    • H04L12/282Controlling appliance services of a home automation network by calling their functionalities based on user interaction within the home

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Automation & Control Theory (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a voice person identification method, a voice interaction method and a voice interaction device, and relates to the technical field of intelligent home/intelligent families, wherein the voice person identification method comprises the following steps: determining a family identifier and a voice to be recognized; if the state of the family identifier is a person identification state, inputting the voice to be identified into a person identification model corresponding to the family identifier, obtaining the person identification model and outputting a first speaking person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person-distinguishing state after training based on the second speaker identifier and the historical voice set to obtain a person-distinguishing model. The voice identification method and the voice identification device realize that each user in the family generates the speaking person identification corresponding to the user without perception in the normal interaction process with the intelligent home voice, so that personalized recommendation is realized according to the speaking person identification, complicated operation is avoided, and the user experience is improved.

Description

Voice person identification method, voice interaction method and device
Technical Field
The application relates to the technical field of intelligent home/intelligent families, in particular to a voice person identification method, a voice interaction method and a voice interaction device.
Background
With the rapid development of intelligent home, a user binds intelligent home equipment (such as an air conditioner, a refrigerator, an intelligent sound box, a gas stove, a washing machine and the like) in the user by taking home as a unit on a terminal APP, and carries out voice control on the intelligent home equipment by taking the sound box or the APP as a main control inlet so as to achieve the purpose of intelligent control of the whole house; at present, the functions of the intelligent home in equipment control are relatively complete, but the intelligent home is weak in personalized recommendation and experience of information content (news, knowledge encyclopedia, entertainment, catering and the like).
The existing intelligent home personalized recommendation is realized mainly by setting family members, recording voiceprint characteristics of each family member in advance, collecting data such as operation instructions and content request results of each family member, and generating user portraits, so that personalized recommendation is realized. However, personalized recommendation is realized by prerecording voiceprints, the operation is complex, and the user experience is poor.
Disclosure of Invention
The application provides a voice person identification method, a voice interaction method and a voice interaction device, which are used for solving the defect that user experience is low due to complicated personalized recommendation operation according to voiceprints recorded in advance in the prior art.
The application provides a voice person identification method, which comprises the following steps:
determining a family identifier and a voice to be recognized;
if the state of the family identifier is a person identification state, inputting the voice to be identified into a person identification model corresponding to the family identifier, and obtaining the person identification model to output a first speaking person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person-distinguishing state after training based on a second speaker identifier and the historical voice set to obtain the person-distinguishing model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices.
According to the voice person identifying method provided by the application, the person identifying model is obtained based on the training of the second speaker identifier and the historical voice set, and the voice person identifying method comprises the following steps:
voiceprint clustering is carried out on the historical voices in the historical voice set to obtain clustering clusters;
if the number of the historical voices corresponding to each cluster is larger than the set number, determining a second speaker identifier corresponding to each cluster based on each cluster;
And forming a sample pair by the second speaker identification corresponding to each cluster and the historical voice corresponding to each cluster, and training an initial person identification model based on the sample pair to obtain the person identification model.
According to the voice identifying method provided by the application, the step of clustering the historical voices in the historical voice set to obtain clustering clusters comprises the following steps:
acquiring the number of family members through the family identification;
and based on the number of family members, carrying out voiceprint clustering on the historical voices in the historical voice set to obtain each cluster.
According to the voice person identification method provided by the application, after the person identification model is obtained and the first speech person identification is output, the method further comprises the following steps:
and forming a sample pair by the first speaker identifier and the voice to be recognized, training the person identification model, and updating parameters of the person identification model.
The application also provides a voice interaction method, which comprises the following steps:
receiving a home identifier and a voice to be recognized sent by a terminal;
in the person identification scene, based on the family identification and the voice to be identified, determining a current speaking person identification by applying any voice person identification method, and determining a current user portrait based on the current speaking person identification;
Executing the interactive command corresponding to the voice to be recognized based on the current user portrait;
and under the non-recognition scene, executing the interaction command corresponding to the voice to be recognized based on the default user portrait.
According to the voice interaction method provided by the application, the current user portrait is determined based on the speaker identification, and the voice interaction method comprises the following steps:
if the current speaker identification is a default identification, the default user portrait is used as the current user portrait; otherwise, based on the current speaker identification, a mapping relation between the speaker identification and the user portrait is applied to determine the current user portrait.
The application also provides a voice person identification device, comprising:
the determining module is used for determining the family identifier and the voice to be recognized;
the person identifying module is used for inputting the voice to be identified into a person identifying model corresponding to the home identifier if the state of the home identifier is a person identifying state, so that the person identifying model outputs a first speaking person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person-distinguishing state after training based on a second speaker identifier and the historical voice set to obtain the person-distinguishing model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices.
The application also provides a voice interaction device, comprising:
the receiving module is used for receiving the home identifier and the voice to be recognized sent by the terminal;
the person identification module is used for determining a current speaking person identification based on the family identification and the voice to be identified in a person identification scene by applying any one of the voice identification methods, and determining a current user portrait based on the current speaking person identification;
the first execution module is used for executing the interaction command corresponding to the voice to be recognized based on the current user portrait;
and the second execution module is used for executing the interaction command corresponding to the voice to be recognized based on the default user portrait under the non-person identification scene.
The application also provides an electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor being arranged to implement the voice recognition method as described in any one of the above or the voice interaction method as described in any one of the above by execution of the computer program.
The application also provides a computer readable storage medium, the computer readable storage medium comprising a stored program, wherein the program executes to implement any one of the above-mentioned voice recognition method or any one of the above-mentioned voice interaction method.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of speech recognition as described in any of the above or implements a method of speech interaction as described in any of the above.
According to the voice recognition method, the voice interaction method and the voice interaction device, the historical voice set corresponding to the home identification is used for accumulating the voice to be recognized in the past during voice interaction, the second speaker identification determined by the voice to be recognized in the past and the voice to be recognized in the past are used for training to obtain the recognition model, the recognition model provides the first speaker identification of the voice to be recognized for subsequent voice interaction, the speaker identification corresponding to each user in the home is generated in a normal and intelligent home voice interaction process without perception, personalized recommendation is realized according to the speaker identification, the speaker identification corresponding to the user is avoided from being generated in a mode of recording voiceprint features in advance, tedious operation is avoided, and the user experience is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic diagram of a hardware environment of an interaction method of a smart device according to an embodiment of the present application;
FIG. 2 is a flow chart of a voice recognition method provided by the present application;
FIG. 3 is a flow chart of the human-distinguished model training method provided by the present application;
FIG. 4 is a schematic flow chart of a voiceprint clustering method provided by the present application;
FIG. 5 is a flow chart of a voice interaction method provided by the present application;
FIG. 6 is a second flowchart of the voice recognition method provided in the present application;
FIG. 7 is a schematic diagram of a voice recognition device provided by the present application;
FIG. 8 is a schematic structural diagram of a voice interaction device provided by the present application;
fig. 9 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description of the present application and the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiments of the present application, a voice recognition method or a voice interaction method is provided. The voice person distinguishing method or the voice interaction method is widely applied to full-house intelligent digital control application scenes such as intelligent Home (Smart Home), intelligent Home equipment ecology, intelligent Home (Intelligence House) ecology and the like. Alternatively, in the present embodiment, the above-described voice recognition method or voice interaction method may be applied to a hardware environment constituted by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be used to provide services (such as application services and the like) for a terminal or a client installed on the terminal, a database may be set on the server or independent of the server, for providing data storage services for the server 104, and cloud computing and/or edge computing services may be configured on the server or independent of the server, for providing data computing services for the server 104.
The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: a wide area network, a metropolitan area network, a local area network, and the wireless network may include, but is not limited to, at least one of: WIFI (Wireless Fidelity ), bluetooth. The terminal device 102 may not be limited to a PC, a mobile phone, a tablet computer, an intelligent air conditioner, an intelligent smoke machine, an intelligent refrigerator, an intelligent oven, an intelligent cooking range, an intelligent washing machine, an intelligent water heater, an intelligent washing device, an intelligent dish washer, an intelligent projection device, an intelligent television, an intelligent clothes hanger, an intelligent curtain, an intelligent video, an intelligent socket, an intelligent sound box, an intelligent fresh air device, an intelligent kitchen and toilet device, an intelligent bathroom device, an intelligent sweeping robot, an intelligent window cleaning robot, an intelligent mopping robot, an intelligent air purifying device, an intelligent steam box, an intelligent microwave oven, an intelligent kitchen appliance, an intelligent purifier, an intelligent water dispenser, an intelligent door lock, and the like.
The existing personalized recommendation is realized mainly by setting family members, recording voiceprint characteristics of each family member in advance, collecting data such as operation instructions and content request results of each family member, and generating user portraits, thereby realizing personalized recommendation. However, personalized recommendation is realized by prerecording voiceprints, the voiceprint features are required to be recorded by a user through multi-step operation, the operation is complex, the user is not friendly, and the user experience is low.
Therefore, how to collect voiceprint features of a user imperceptibly to achieve personalized recommendation is a technical problem that needs to be solved by those skilled in the art.
Aiming at the technical problems, the embodiment of the application provides a voice person identification method. Fig. 2 is a flow chart of a voice recognition method provided in the present application. As shown in fig. 2, the method includes:
step 210, determining a home identifier and a voice to be recognized;
it should be noted that, the home identifier may be a user identifier corresponding to a user logged into the terminal application, where one user identifier corresponds to one home, or may be created in the application according to a user logged into the terminal application, where one user identifier may correspond to a plurality of home identifiers. The voice to be recognized may be acquired when the terminal application program is awakened by voice, or may be acquired by monitoring voice in real time when the voice contains an awakening word or an interactive command, or may be acquired by reading a voice file, which is not limited in this embodiment of the present application, wherein the voice to be recognized only contains the voice of one user.
Step 220, if the status of the home identifier is a person identification status, inputting the voice to be identified into a person identification model corresponding to the home identifier, obtaining the person identification model, and outputting a first speech person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person identification state after training based on the second speaker identifier and the historical voice set to obtain a person identification model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices.
In consideration of the fact that when a user performs voice interaction with the intelligent home equipment through a terminal application program, the terminal can collect audio data of voice of the user, and according to different characteristics of voiceprint characteristics of each person, voices to be recognized of different users can be collected for multiple times during voice interaction, and the multiple times of collected voices to be recognized can be used as samples to be trained to obtain a person recognition model, so that a mapping relation between voiceprints of the user and identities of the speaking persons is generated under the condition that the user does not feel.
Specifically, when the state of the family identifier is the accumulated member voice state, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, generating a second speaker identifier according to a voiceprint clustering result of the historical voice set, and updating the state of the family identifier into a person-distinguishing state after training according to the second speaker identifier and the historical voice in the historical voice set to obtain a person-distinguishing model corresponding to the family identifier. When the state of the family identifier is the person identification state, inputting the voice to be identified into the person identification model corresponding to the family identifier, and obtaining a first speech person identifier output by the person identification model. The second speaker identifier may be a plurality of speaker identifiers, and the first speaker identifier is one of the second speaker identifiers.
It should be noted that, the status of the home id may be stored in a memory or a configuration file, or in a database, which is not limited in this embodiment of the present application. The historical voice set corresponding to the home identifier stores the voice to be identified which is required to be identified in the past, and the historical voice set can be stored in a path corresponding to the home identifier in a voice file mode or can be cached in a memory, and the embodiment of the application is not limited to the voice. Preferably, the status of the home identity is stored in a database; the historical voice set is stored in a path corresponding to the home identifier in the form of a voice file, and path information of the voice file in the historical voice set is stored in a database in the unit of the home identifier.
In addition, the family identifier corresponding person identifying model obtained through training according to the second speaker identifier and the historical voices in the historical voice set may be obtained by performing voiceprint clustering on the historical voice set after the number of the historical voices in the historical voice set reaches a specified number, generating the second speaker identifier according to each cluster in the voiceprint clustering result, and then training the initial person identifying model based on each cluster to generate the second speaker identifier and the historical voices in the historical voice set, or may be obtained by performing clustering on the historical voices in the historical voice set to obtain each cluster in the voiceprint clustering result, generating the second speaker identifier by applying each cluster after the number of the historical voices in each cluster reaches the specified number, and then training the initial person identifying model based on each cluster to generate the second speaker identifier and the historical voices in the historical voice set.
On the other hand, if the recognition model generates the voice to be recognized of the user beyond training in the practical application, the recognition model can output a default identification. For example, the distinguished model is trained from the accumulated historic voices of the family identification A, B and the three family members C, and then the distinguished model outputs a default identification when recognizing the voice to be recognized of the person D.
According to the voice recognition method, historical voice sets corresponding to the family identifications accumulate historical voice to be recognized when voice interaction is performed, the second speaker identifications determined by the historical voice to be recognized and the historical voice to be recognized are used for training to obtain the recognition model, the recognition model provides first speaker identifications of the voice to be recognized for subsequent voice interaction, speaking person identifications corresponding to users are generated in a non-perception mode in a normal and intelligent home voice interaction process of each user in the family, personalized recommendation is further achieved according to the speaking person identifications, the fact that voice marks corresponding to the users are generated in a voice print feature recording mode in advance is avoided, tedious operation is avoided, and user experience is improved.
Based on the above embodiments, fig. 3 is a schematic flow chart of the human-distinguished model training method provided in the present application. As shown in fig. 3, training to obtain a speaker-dependent model based on the second speaker identification and the historical speech set includes:
Step 310, voiceprint clustering is carried out on the historical voices in the historical voice set to obtain clustering clusters;
step 320, if the number of the historical voices corresponding to each cluster is greater than the set number, determining a second speaker identifier corresponding to each cluster based on each cluster;
and 330, forming a sample pair by the second speaker identification corresponding to each cluster and the historical voice corresponding to each cluster, and training the initial person identification model based on the sample pair to obtain the person identification model.
Considering that voice interaction is used by some members in families, the number of the to-be-recognized voices of the members is relatively large, if whether the initial human recognition model can be trained is judged directly according to the number in the historical voice set, the voice sample of the family members is unbalanced, the human recognition model training effect is not ideal, and the accuracy of human recognition model recognition is further affected. Therefore, the embodiment of the application clusters the historical voices in the historical voice set first, and judges whether the historical voices in each cluster meet the training conditions, so that the samples are more balanced, the training effect of the human-identifying model is improved, and the accuracy of the human-identifying model identification is improved.
Specifically, voiceprint clustering is carried out on the historical voices in the historical voice set to obtain clustering clusters, when the number of the historical voices corresponding to the clustering clusters is larger than the set number, second speaker identifiers corresponding to the clustering clusters are generated according to the clustering clusters, the second speaker identifiers corresponding to the clustering clusters and the historical voices corresponding to the clustering clusters form sample pairs, then the initial person identification model is trained based on the sample pairs, and the training is completed to obtain the person model.
It should be noted that, when the number of the historical voices corresponding to each cluster is smaller than the set number, the voices to be recognized are continuously accumulated in the historical voice set until the number of the historical voices corresponding to each cluster is larger than the set number, and then the initial person identification model is trained. The number of settings may be preset in a configuration file or hard coded in an application program, or may be dynamically set by a user, which is not limited in the embodiment of the present application. In addition, the forming of the second speaker identifier corresponding to each cluster into the sample pair with the history voice corresponding to each cluster may specifically be determining the second speaker identifier corresponding to the history voice in the history voice set based on the second speaker identifier corresponding to each cluster and the history voice corresponding to each cluster, and forming the sample pair according to the second speaker identifier corresponding to the history voice in the history voice set, where the second speaker identifier in the sample pair is used as a sample tag, and the history voice in the sample pair is used as a sample.
Based on the above embodiments, fig. 4 is a schematic flow chart of the voiceprint clustering method provided in the present application. As shown in fig. 4, voiceprint clustering is performed on the historical voices accumulated by the home identifications, including:
step 410, acquiring the number of family members through family identification;
step 420, clustering the historical voices in the historical voice set based on the number of family members to obtain clusters.
Considering that the number of clusters specifically needed is known before clustering, the accuracy of clustering can be improved. Therefore, the embodiment of the application acquires the number of family members according to the family identifier, and then carries out voiceprint clustering on the historical voices in the historical voice set corresponding to the family identifier according to the number of family members to obtain a clustering result.
It should be noted that, the number of family members may be set by a user when creating a family identifier or when modifying family identifier information, or may be set by a voice interaction manner, which is not limited in the embodiment of the present application.
According to the voice recognition method provided by the embodiment of the application, voice print clustering is carried out on the historical voices accumulated by the home identifications by setting the number of the family members corresponding to the home identifications, so that the accuracy of clustering results is improved, the person recognition accuracy of the person recognition model is further improved, personalized recommendation is more accurate, and the experience of a user is further improved.
Based on the above embodiment, after the person-identifying model is obtained to output the first speaker identifier, the method further includes: and forming a sample pair by the first speech person identifier and the voice to be recognized, training the person identification model, and updating parameters of the person identification model.
In order to enable a user to experience personalized recommendation as soon as possible, the number of the historical voices in the historical voice set corresponding to the home identification of the human identification model is not too large during initial training, and the accuracy of the human identification model obtained through training is insufficient. So as to improve the accuracy of identifying the person-identifying model.
Specifically, a first speech person identifier and a speech to be recognized output by the person identification model are formed into a sample pair, the person identification model is trained by using the sample pair, and model parameters of the person identification model are updated. The first speaker identification in the sample pair is used as a sample label, and the voice to be recognized in the sample pair is used as a sample.
It should be noted that, the person-identifying model may be trained when the person-identifying model outputs the first speaker identifier each time, and the first speaker identifier output by the person-identifying model may be accumulated multiple times, and then a sample pair set is formed, and the person-identifying model is trained by the sample pair set.
In addition, whether the person-distinguishing model needs to be further trained can be determined through strategy configuration, when the strategy configuration indicates that the person-distinguishing model needs to be further trained, a sample pair is formed by the first speech person identification and the voice to be recognized, the person-distinguishing model is trained by using the sample pair, model parameters of the person-distinguishing model are updated, and otherwise, the person-distinguishing model does not need to be further trained. The configuration policy may be configured by the user in the terminal application program, or the loss of the human-distinguished model is converged within a preset threshold range, and the configuration policy is automatically changed, which is not limited in the embodiment of the present application.
Fig. 5 is a flow chart of a voice interaction method provided in the present application. As shown in fig. 5, the method includes:
step 510, receiving a home identifier and a voice to be recognized sent by a terminal;
step 520, in the person-identifying scene, based on the family identifier and the voice to be identified, the voice person-identifying method provided in any of the above embodiments is applied to determine the current speaker identifier, and based on the current speaker identifier, the current user portrait is determined;
step 530, executing the interactive command corresponding to the voice to be recognized based on the current user portrait;
Step 540, in the non-human-distinguished scene, executing the interactive command corresponding to the voice to be recognized based on the default user portrait.
Specifically, after receiving a home identifier activated by a current terminal and a voice to be recognized of a user acquired by the terminal, determining a scene corresponding to the current home identifier, and in a person identification scene, applying the voice person identification method provided by any embodiment, determining the current speaker identifier through the received home identifier and the voice to be recognized, obtaining a current user portrait according to the current speaker identifier, then carrying out interactive command recognition on the voice to be recognized, and carrying out corresponding command operation by combining user characteristics in the user portrait. Under the non-person-distinguishing scene, namely, the voice does not need to be subjected to person identity identification, the voice to be identified is directly subjected to interactive command identification, and command operation is performed in a standard mode of default user portrait.
It should be noted that, after the user portrait is trained to obtain the person-distinguishing model, the user portrait of the second speaker identifier corresponding to each cluster is constructed through the interactive command or the interactive content of the history voice corresponding to each cluster, or the user portrait is constructed or perfected according to the first speaker identifier output by the person-distinguishing model and the interactive command or the interactive content of the voice to be recognized in the application process of the person-distinguishing model, or the user portrait is constructed in an active manner, for example: in the person-distinguishing scene, after the user performs voice interaction, the user actively sets a nickname or selects a favorite item in a voice mode, so that a user portrait is constructed, and the embodiment of the application is not limited. The default user representation is represented as interacting using a standard interaction model. Preferably, the user portrait, the speaker identifier corresponding to the user portrait, and the mapping relationship between the user portrait and the speaker identifier are stored in a database.
According to the voice interaction method provided by the embodiment of the application, under the person identification scene, the voice person identification method provided by any embodiment is called to obtain the current speaker identification, and voice interaction is carried out according to the user portrait corresponding to the current speaker identification, so that the speaker identification corresponding to each user in a household is generated without perception in the normal and intelligent household voice interaction process, personalized recommendation is realized by constructing the user portrait according to the speaker identification, the generation of the user corresponding speaker identification in a mode of recording voiceprint features in advance is avoided, cumbersome operation is avoided, and the user experience is improved.
Based on the above embodiment, determining the current user representation based on the current speaker identification in step 520 includes:
if the current speaker identification is the default identification, the default user portrait is used as the current user portrait; otherwise, based on the current speaker identification, a mapping relationship between the speaker identification and the user portrayal is applied to determine the current user portrayal.
It should be noted that, considering that a person other than a home identifier may have voice interaction with an intelligent terminal, so that a person identification model cannot be identified and a default identifier is output, therefore, in the embodiment of the application, before determining a current user portrait according to a current speaker identifier, the speaker identifier is judged, if the current speaker identifier is the default identifier, default users are mutually used as the current user portrait, otherwise, the current user portrait corresponding to the current speaker identifier can be obtained according to the mapping relationship between the application speaker identifier and the user portrait recorded in a database and according to the current speaker identifier.
FIG. 6 is a second flowchart of a voice recognition method provided in the present application. As shown in fig. 6, the method includes:
in step 610, the user starts an intelligent home control application (app) through the terminal and logs in to the personal account number, the user sets the scene to a person-distinguishing scene (whether the person-distinguishing scene is required to be authorized by the user or not), and the user wakes up the terminal (sound box, household appliance with voice interaction) and initiates voice interaction normally.
In step 620, the ai cloud service decodes and recognizes the voice to be recognized, determines whether the terminal MAC has enabled the person recognition scene, and if so, proceeds to step 630, and does not enter the voice person recognition procedure without enabling the voice interaction procedure in the standard mode.
In step 630, the ai cloud service determines whether the home identifier status of the home is a person identification status (the status may be determined by determining whether the speaker identifier exists in the home identifier, and the person identification status is represented, and the presence of the person identification model exists in the home identifier), if the status is the person identification status, the process proceeds to step 650, otherwise, the process proceeds to step 640.
Step 640, storing the decoding result of the voice to be recognized in a history voice set corresponding to the home identifier (the form of a voice file in the history voice set is stored in a path corresponding to the home identifier, and path information of the voice file in the history voice set is stored in a database), clustering the voice to be recognized and the history voice in the history voice set corresponding to the home, recording the clustering result in the database, so as to determine whether the number of the history voices in each cluster of the history voice set corresponding to the home identifier reaches a set number (for example, the number of the home is 10, each person accumulates at least 10 valid audio data), and if the number of the history voices does not reach the set number, performing a voice interaction flow in a standard mode; after the set number is reached, step 660 is entered.
Step 650, calling a person identification model to identify the voice to be identified, obtaining a first speaking person identifier corresponding to the voice to be identified, judging whether the first speaking person identifier has a nickname, and responding to a reply with the nickname of the user in the voice interaction result of the user when the first speaking person identifier has the nickname (for example, the nickname corresponding to the first speaking person identifier is Zhang San, the user is assumed to inquire weather, and the returned result is you, zhang San, today weather is clear, and the temperature is 25 ℃; open a multiple round of conversation without nickname asking the user "how do you want me to call you? After the user replies with voice, voice recognition is carried out, a nickname of the user is obtained, and the nickname is associated with the first speaker identification; the first speaker identification is associated with the user nickname and stored in a database for use in the next voiceprint recognition.
Step 660, invoking a voiceprint clustering algorithm to perform voiceprint clustering on the historical voices in the historical voice set, obtaining a plurality of second speaker identifiers (a plurality of members in a family, possibly converging to be a plurality of members), training according to the second speaker identifiers and the historical voice set to obtain a speaker recognition model, and storing the speaker recognition model in a database in a family unit for use when the speaker recognition model recognizes voices to be recognized.
Step 670, the above steps 610 to 660 have completed the basic flow of voice recognition, on the basis of which related strategies can be flexibly set to dynamically adjust the voice recognition, when the strategies are that the recognition model needs to be continuously trained, the voice accumulating operation is continuously performed, the first speaker identification output by the recognition model and the voice to be recognized are applied to the recognition model, the parameters of the recognition model are updated, and the accuracy of the recognition of the subsequent recognition model is improved.
The semantic recognition device and the voice interaction device provided by the application are described below, the voice recognition device described below and the voice recognition method described above can be correspondingly referred to each other, and the voice interaction device described below and the voice interaction method described above can be correspondingly referred to each other.
Fig. 7 is a schematic structural diagram of a voice recognition device provided by the present application. As shown in fig. 7, the voice recognition device includes: a determination module 710 and a human identification module 720.
Wherein,
a determining module 710, configured to determine a home identifier and a voice to be recognized;
the person identifying module 720 is configured to input the voice to be identified into a person identifying model corresponding to the home identifier if the status of the home identifier is a person identifying status, obtain the person identifying model, and output a first speech person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person identification state after training based on the second speaker identifier and the historical voice set to obtain a person identification model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices.
The voice identifying device provided by the embodiment of the application is used for determining the home identifier and the voice to be identified through the determining module; the person identifying module is used for inputting the voice to be identified into the person identifying model corresponding to the family identifier if the state of the family identifier is the person identifying state, so that the person identifying model is obtained and the first speaking person identifier is output; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person identification state after training based on the second speaker identifier and the historical voice set to obtain a person identification model; the second speaker identification is determined based on the voiceprint clustering result of the historical voice set, so that the speaker identification corresponding to each user in the family is generated without perception in the normal voice interaction process with the intelligent home, personalized recommendation is realized according to the speaker identification, the speaker identification corresponding to the user is generated in a mode of recording the voiceprint characteristics in advance, cumbersome operation is avoided, and the user experience is improved.
Based on any of the above embodiments, the person identifying module 720 includes:
the clustering sub-module is used for carrying out voiceprint clustering on the historical voices in the historical voice set to obtain clustering clusters;
The identifier generation sub-module is used for determining a second speaker identifier corresponding to each cluster based on each cluster if the number of the historical voices corresponding to each cluster is larger than the set number;
and the model training sub-module is used for forming a sample pair by the second speaker identification corresponding to each cluster and the historical voice corresponding to each cluster, and training the initial person-distinguishing model based on the sample pair to obtain the person-distinguishing model.
Based on any of the above embodiments, the clustering sub-module includes:
the number determination submodule is used for obtaining the number of family members through family identification;
and the voiceprint clustering sub-module is used for carrying out voiceprint clustering on the historical voices in the historical voice set based on the number of family members to obtain clustering clusters.
Based on any of the above embodiments, the person-distinguishing module 720 further includes a model retraining sub-module for forming a sample pair from the first speaker identifier and the speech to be recognized, training the person-distinguishing model, and updating parameters of the person-distinguishing model.
Fig. 8 is a schematic structural diagram of a voice interaction device provided in the present application. As shown in fig. 8, the voice interaction apparatus includes: a receiving module 810, a person identifying module 820, a first executing module 830 and a second executing module 840.
Wherein,
a receiving module 810, configured to receive a home identifier and a voice to be recognized sent by a terminal;
the person identifying module 820 is configured to determine, in a person identifying scenario, a current speaker identifier based on the home identifier and the voice to be identified, by applying the voice identifying method provided in any one of the above embodiments, and determine a current user portrait based on the current speaker identifier;
the first execution module 830 is configured to execute an interaction command corresponding to the voice to be recognized based on the current user portrait;
the second execution module 840 is configured to execute an interactive command corresponding to the voice to be recognized based on the default user portrait in the non-human-recognition scenario.
The voice interaction device provided by the embodiment of the application is used for receiving the home identifier and the voice to be recognized sent by the terminal through the receiving module; the person identifying module is used for determining the current speaker identification based on the family identification and the voice to be identified in the person identifying scene by applying the voice person identifying method provided by any one of the embodiments, and determining the current user portrait based on the current speaker identification; the first execution module is used for executing an interaction command corresponding to the voice to be recognized based on the current user portrait; the second execution module is used for executing the interaction command corresponding to the voice to be recognized based on the default user portrait under the non-distinguished scene, so that the utterances corresponding to the users are generated without perception in the normal interaction process of the voice of the home with the intelligent home, and further personalized recommendation is realized by constructing the user portrait according to the utterances, the utterances corresponding to the users are generated in a mode of recording voiceprint features in advance, cumbersome operation is avoided, and the user experience is improved.
Based on any of the above embodiments, the person identification module 820 includes: the user portrait acquisition sub-module is used for taking the default user portrait as the current user portrait if the current speaker identifier is the default identifier; otherwise, based on the current speaker identification, a mapping relationship between the speaker identification and the user portrayal is applied to determine the current user portrayal.
Fig. 9 illustrates a physical schematic diagram of an electronic device, as shown in fig. 9, which may include: processor 910, communication interface (Communications Interface), memory 930, and communication bus 940, wherein processor 910, communication interface 920, and memory 930 communicate with each other via communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a voice recognition method or a voice interaction method. The voice person distinguishing method comprises the following steps: determining a family identifier and a voice to be recognized; the state of the family identifier is a person identification state, the voice to be identified is input into a person identification model corresponding to the family identifier, the person identification model is obtained, and a first speaking person identifier is output; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person identification state after training based on the second speaker identifier and the historical voice set to obtain a person identification model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices. The voice interaction method comprises the following steps: receiving a home identifier and a voice to be recognized sent by a terminal, determining a speaker identifier of the voice to be recognized by applying the voice recognition method provided by any one of the embodiments based on the home identifier and the voice to be recognized in a person recognition scene, and determining a user portrait based on the speaker identifier; executing an interactive command corresponding to the voice to be recognized based on the user portrait; and under the non-person-distinguishing scene, executing the interactive command corresponding to the voice to be recognized based on the default user portrait.
Further, the logic instructions in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present application further provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the voice recognition method or the voice interaction method provided by the above methods. The voice person distinguishing method comprises the following steps: determining a family identifier and a voice to be recognized; the state of the family identifier is a person identification state, the voice to be identified is input into a person identification model corresponding to the family identifier, the person identification model is obtained, and a first speaking person identifier is output; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person identification state after training based on the second speaker identifier and the historical voice set to obtain a person identification model; a second speaker identification is determined based on voiceprint clustering results of the set of historical voices. The voice interaction method comprises the following steps: receiving a home identifier and a voice to be recognized sent by a terminal, determining a speaker identifier of the voice to be recognized by applying the voice recognition method provided by any one of the embodiments based on the home identifier and the voice to be recognized in a person recognition scene, and determining a user portrait based on the speaker identifier; executing an interactive command corresponding to the voice to be recognized based on the user portrait; and under the non-person-distinguishing scene, executing the interactive command corresponding to the voice to be recognized based on the default user portrait.
In still another aspect, the present application further provides a computer readable storage medium, where the computer readable storage medium includes a stored program, where the program executes the voice recognition method or the voice interaction method provided by the above methods. The voice person distinguishing method comprises the following steps: determining a family identifier and a voice to be recognized; the state of the family identifier is a person identification state, the voice to be identified is input into a person identification model corresponding to the family identifier, the person identification model is obtained, and a first speaking person identifier is output; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person identification state after training based on the second speaker identifier and the historical voice set to obtain a person identification model; a second speaker identification is determined based on voiceprint clustering results of the set of historical voices. The voice interaction method comprises the following steps: receiving a home identifier and a voice to be recognized sent by a terminal, determining a speaker identifier of the voice to be recognized by applying the voice recognition method provided by any one of the embodiments based on the home identifier and the voice to be recognized in a person recognition scene, and determining a user portrait based on the speaker identifier; executing an interactive command corresponding to the voice to be recognized based on the user portrait; and under the non-person-distinguishing scene, executing the interactive command corresponding to the voice to be recognized based on the default user portrait.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (10)

1. A method for voice recognition, comprising:
determining a family identifier and a voice to be recognized;
if the state of the family identifier is a person identification state, inputting the voice to be identified into a person identification model corresponding to the family identifier, and obtaining the person identification model to output a first speaking person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person-distinguishing state after training based on a second speaker identifier and the historical voice set to obtain the person-distinguishing model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices.
2. The method for identifying a person by using voice according to claim 1, wherein the training to obtain the person-identifying model based on the second speaker identification and the historical voice set comprises:
voiceprint clustering is carried out on the historical voices in the historical voice set to obtain clustering clusters;
if the number of the historical voices corresponding to each cluster is larger than the set number, determining a second speaker identifier corresponding to each cluster based on each cluster;
and forming a sample pair by the second speaker identification corresponding to each cluster and the historical voice corresponding to each cluster, and training an initial person identification model based on the sample pair to obtain the person identification model.
3. The method for identifying people from voice according to claim 2, wherein the clustering the historical voices in the historical voice set to obtain clusters includes:
acquiring the number of family members through the family identification;
and clustering the historical voices in the historical voice set based on the number of family members to obtain each cluster.
4. A method for recognizing a person by speech according to any one of claims 1 to 3, wherein after the obtaining the person-recognition model outputs the first speaker identification, the method further comprises:
And forming a sample pair by the first speaker identifier and the voice to be recognized, training the person identification model, and updating parameters of the person identification model.
5. A method of voice interaction, comprising:
receiving a home identifier and a voice to be recognized sent by a terminal;
in a person-identifying scenario, based on the home identifier and the speech to be identified, determining a current speaker identifier by applying the speech person-identifying method of any one of claims 1 to 4, and determining a current user representation based on the current speaker identifier;
executing the interactive command corresponding to the voice to be recognized based on the current user portrait;
and under the non-recognition scene, executing the interaction command corresponding to the voice to be recognized based on the default user portrait.
6. The voice interaction method of claim 5, wherein the determining the current user representation based on the speaker identification comprises:
if the current speaker identification is a default identification, the default user portrait is used as the current user portrait; otherwise, based on the current speaker identification, a mapping relation between the speaker identification and the user portrait is applied to determine the current user portrait.
7. A voice recognition device, comprising:
the determining module is used for determining the family identifier and the voice to be recognized;
the person identifying module is used for inputting the voice to be identified into a person identifying model corresponding to the home identifier if the state of the home identifier is a person identifying state, so that the person identifying model outputs a first speaking person identifier; otherwise, outputting a default identifier, storing the voice to be recognized into a historical voice set corresponding to the family identifier, and updating the state of the family identifier into a person-distinguishing state after training based on a second speaker identifier and the historical voice set to obtain the person-distinguishing model; the second speaker identification is determined based on voiceprint clustering results of the set of historical voices.
8. A voice interaction device, comprising:
the receiving module is used for receiving the home identifier and the voice to be recognized sent by the terminal;
the person identification module is used for determining a current speaking person identifier based on the family identifier and the voice to be identified in a person identification scene by applying the voice person identification method according to any one of claims 1 to 4, and determining a current user portrait based on the current speaking person identifier;
The first execution module is used for executing the interaction command corresponding to the voice to be recognized based on the current user portrait;
and the second execution module is used for executing the interaction command corresponding to the voice to be recognized based on the default user portrait under the non-person identification scene.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 6.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of claims 1 to 6 by means of the computer program.
CN202210909507.3A 2022-07-29 2022-07-29 Voice person identification method, voice interaction method and device Pending CN117524231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210909507.3A CN117524231A (en) 2022-07-29 2022-07-29 Voice person identification method, voice interaction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210909507.3A CN117524231A (en) 2022-07-29 2022-07-29 Voice person identification method, voice interaction method and device

Publications (1)

Publication Number Publication Date
CN117524231A true CN117524231A (en) 2024-02-06

Family

ID=89750001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210909507.3A Pending CN117524231A (en) 2022-07-29 2022-07-29 Voice person identification method, voice interaction method and device

Country Status (1)

Country Link
CN (1) CN117524231A (en)

Similar Documents

Publication Publication Date Title
US20220317641A1 (en) Device control method, conflict processing method, corresponding apparatus and electronic device
CN110211580B (en) Multi-intelligent-device response method, device, system and storage medium
US10129352B2 (en) Information management method
CN108959634B (en) Video recommendation method, device, equipment and storage medium
CN109360558A (en) A kind of method and apparatus of voice answer-back
CN114755931A (en) Control instruction prediction method and device, storage medium and electronic device
CN117524231A (en) Voice person identification method, voice interaction method and device
CN114915514B (en) Method and device for processing intention, storage medium and electronic device
CN110098985A (en) The method and apparatus of vocal behavior detection
CN116364079A (en) Equipment control method, device, storage medium and electronic device
CN116072124A (en) User identity recognition method, storage medium and electronic device
CN115148204B (en) Voice wakeup processing method and device, storage medium and electronic device
CN114489317B (en) Interaction method, interaction device, terminal and computer readable storage medium
CN117912492A (en) Evaluation method and device for equipment rejection rate
CN118259598A (en) Household appliance control method, storage medium and electronic device
CN118280335A (en) Speech synthesis method, storage medium and electronic device
CN116312558A (en) Voice interaction processing method and device, storage medium and electronic device
CN116540557A (en) Audiovisual combined intelligent household appliance terminal control method, device and storage medium
CN116403575A (en) Wake-free voice interaction method and device, storage medium and electronic device
CN115314331A (en) Control method and device of intelligent terminal, storage medium and electronic device
CN118152995A (en) Interaction method and system in vertical field based on large language model
CN116612763A (en) Voiceprint registration method, storage medium and electronic device
CN118158012A (en) Method and device for determining combined command, storage medium and electronic device
CN117008865A (en) Play setting method and device, storage medium and electronic device
CN114979543A (en) Intelligent household control method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination