CN115171702A - Digital twin voiceprint feature processing method, storage medium and electronic device - Google Patents

Digital twin voiceprint feature processing method, storage medium and electronic device Download PDF

Info

Publication number
CN115171702A
CN115171702A CN202210603562.XA CN202210603562A CN115171702A CN 115171702 A CN115171702 A CN 115171702A CN 202210603562 A CN202210603562 A CN 202210603562A CN 115171702 A CN115171702 A CN 115171702A
Authority
CN
China
Prior art keywords
voice
voice data
group
target
pieces
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210603562.XA
Other languages
Chinese (zh)
Inventor
邓邱伟
朱文博
王迪
张丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Qingdao Haier Intelligent Home Appliance Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Qingdao Haier Intelligent Home Appliance Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Qingdao Haier Intelligent Home Appliance Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202210603562.XA priority Critical patent/CN115171702A/en
Publication of CN115171702A publication Critical patent/CN115171702A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a digital twin voiceprint feature processing method, a storage medium and an electronic device, and relates to the technical field of smart families, wherein the method comprises the following steps: acquiring a plurality of pieces of voice data to be processed, wherein each piece of voice data is voice data which is acquired by one intelligent device in a group of intelligent devices and matched object information is not identified through voiceprint characteristics; performing sound source separation processing on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data; and performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined. Through the method and the device, the problem that the accuracy of user identity recognition is poor due to the fact that the accuracy of the input voiceprint features is low in the voiceprint feature processing mode in the related technology is solved.

Description

Digital twin voiceprint feature processing method, storage medium and electronic device
Technical Field
The application relates to the technical field of smart homes, in particular to a digital twin voiceprint feature processing method, a storage medium and an electronic device.
Background
In the intelligent voice conversation system, the identity of the user can be identified based on the voiceprint characteristics registered by the user, and corresponding services can be provided for the user based on the identified user identity. When performing voiceprint feature registration, a user may perform voiceprint feature entry through some device (e.g., a cell phone). When the user is used in the subsequent process after the registration is finished, the user identity can be identified by comparing the characteristics according to the registered voiceprint characteristics.
However, in the above-described processing method of the voiceprint feature, when the voiceprint feature is recorded, it is necessary for the user to generate a sound according to information such as a presented character, to acquire voice data by a sound test unit on the device, and to extract the voiceprint feature from the acquired voice data. The collected voice data is the voice data sent by the user for one-time sounding, and is easily influenced by the current state of the user, so that the accuracy of the input voiceprint features is low, and the accuracy of user identity recognition in the subsequent use process is further influenced.
Therefore, the processing mode of the voiceprint features in the related technology has the problem of poor accuracy of user identity identification caused by low accuracy of the input voiceprint features.
Disclosure of Invention
The embodiment of the application provides a digital twin voiceprint feature processing method, a storage medium and an electronic device, and aims to at least solve the problem that the accuracy of user identity identification is poor due to the fact that the accuracy of an input voiceprint feature is low in a voiceprint feature processing mode in the related technology.
According to an aspect of an embodiment of the present application, there is provided a digital twin voiceprint feature processing method, including: acquiring a plurality of pieces of voice data to be processed, wherein each piece of voice data is the voice data which is acquired by one intelligent device in a group of intelligent devices and matched object information is not identified through voiceprint characteristics; performing sound source separation processing on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data; and performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined.
In an exemplary embodiment, the performing a sound source separation process on the pieces of speech data to obtain at least one target speech group includes: performing sound source separation processing on the plurality of pieces of voice data to obtain at least one initial voice group, wherein each initial voice group comprises voice data belonging to one sound source in the plurality of pieces of voice data; and determining the voice groups with the number of voice data larger than or equal to two in the at least one initial voice group as the at least one target voice group.
In an exemplary embodiment, said performing a sound source separation process on said plurality of pieces of speech data to obtain at least one initial speech group includes: executing voice merging operation on the voice data to obtain merged voice data; inputting the merged voice data into a sound source recognition model to obtain a sound source recognition result output by the sound source recognition model, wherein the sound source recognition result is used for indicating a sound source to which each piece of voice data belongs; and determining the voice data which belongs to the same sound source and is indicated by the sound source identification result as an initial voice group to obtain at least one initial voice group.
In an exemplary embodiment, after the performing the sound source separation process on the pieces of speech data to obtain at least one target speech group, the method further includes: performing clustering processing on the plurality of pieces of voice data to obtain at least one reference voice group, wherein each reference voice group comprises voice data of the same cluster obtained by clustering in the plurality of pieces of voice data; and updating the at least one target voice group by using the at least one reference voice group to obtain the updated at least one target voice group.
In an exemplary embodiment, the performing a clustering process on the pieces of speech data to obtain at least one reference speech group includes: performing dimensionality reduction processing on each piece of voice data to obtain each piece of voice data subjected to dimensionality reduction; and clustering the plurality of pieces of voice data according to the distance between each piece of voice data to obtain the at least one reference voice group.
In an exemplary embodiment, said updating said at least one target speech group using said at least one reference speech group to obtain said updated at least one target speech group comprises: performing the following operations on each target voice group to obtain the updated at least one target voice group, wherein each target voice group is a current voice group when the following steps are performed: determining a matching voice group matched with the current voice group in the at least one reference voice group, wherein the ratio of the number of the same voice data contained in the matching voice group and the current voice group to the total number of the voice data in the current voice group is greater than or equal to a target ratio threshold; under the condition that the number of the same voice data contained in the matching voice group and the current voice group is more than or equal to two, removing the voice data which do not belong to the matching voice group in the current voice group to obtain an updated current voice group; and removing the current voice group when the number of the same voice data contained in the matching voice group and the current voice group is less than two.
In an exemplary embodiment, before the acquiring the pieces of voice data to be processed, the method further includes: receiving first voice data sent by a first intelligent device in the group of intelligent devices, wherein the first voice data is voice data which is not identified to be matched and has recorded voiceprint characteristics; determining the first voice data as a piece of voice data to be processed under the condition that the first voice data meets voice screening conditions, wherein the voice screening conditions comprise at least one of the following conditions: the number of words recognized from the first voice data is greater than or equal to a target number of words, and the signal-to-noise ratio of the first voice data is greater than or equal to a preset signal-to-noise ratio.
In an exemplary embodiment, after the performing a voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, the method further includes: receiving second voice data sent by a second intelligent device in the group of intelligent devices, wherein the second voice data is voice data sent by a use object of the second intelligent device; under the condition that the voiceprint features of the second voice data are matched with target voiceprint features in the at least one voiceprint feature, acquiring object information of the using object through the second intelligent device; and saving the object information of the use object with the corresponding relation and the target voiceprint characteristics.
According to another aspect of the embodiments of the present application, there is also provided a digital twin voiceprint feature processing apparatus, including: the voice processing device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a plurality of pieces of voice data to be processed, and each piece of voice data is voice data which is acquired by one intelligent device in a group of intelligent devices and does not identify matched object information through voiceprint characteristics; a first executing unit, configured to perform sound source separation processing on the multiple pieces of voice data to obtain at least one target voice group, where each target voice group includes at least two pieces of voice data belonging to a sound source in the multiple pieces of voice data; and the second execution unit is used for executing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined.
In one exemplary embodiment, the first execution unit includes: a first execution module, configured to perform sound source separation processing on the multiple pieces of voice data to obtain at least one initial voice group, where each initial voice group includes voice data belonging to a sound source in the multiple pieces of voice data; and the determining module is used for determining the voice data number in the at least one initial voice group to be more than or equal to two voice groups as the at least one target voice group.
In one exemplary embodiment, the first execution module includes: the execution submodule is used for executing voice combination operation on the voice data to obtain combined voice data; the input submodule is used for inputting the merged voice data into a sound source recognition model to obtain a sound source recognition result output by the sound source recognition model, wherein the sound source recognition result is used for indicating a sound source to which each piece of voice data belongs; and the determining submodule is used for determining the voice data which belongs to the same sound source and is indicated by the sound source identification result into an initial voice group to obtain at least one initial voice group.
In one exemplary embodiment, the apparatus further comprises: a third executing unit, configured to perform, after performing sound source separation processing on the multiple pieces of voice data to obtain at least one target voice group, perform clustering processing on the multiple pieces of voice data to obtain at least one reference voice group, where each reference voice group includes voice data of the multiple pieces of voice data that belong to the same cluster obtained by clustering; and the updating unit is used for updating the at least one target voice group by using the at least one reference voice group to obtain the updated at least one target voice group.
In one exemplary embodiment, the third execution unit includes: the processing module is used for carrying out dimension reduction processing on each piece of voice data to obtain each piece of voice data after dimension reduction; and the clustering module is used for clustering the plurality of pieces of voice data according to the distance between each piece of voice data to obtain the at least one reference voice group.
In one exemplary embodiment, the update unit includes: a second execution module, configured to perform the following operations on each target speech group to obtain the updated at least one target speech group, where each target speech group is a current speech group when the following steps are performed: determining a matching voice group matched with the current voice group in the at least one reference voice group, wherein the ratio of the number of the same voice data contained in the matching voice group and the current voice group to the total number of the voice data in the current voice group is greater than or equal to a target ratio threshold; under the condition that the number of the same voice data contained in the matched voice group and the current voice group is more than or equal to two, removing the voice data which do not belong to the matched voice group in the current voice group to obtain an updated current voice group; and removing the current voice group when the number of the same voice data contained in the matching voice group and the current voice group is less than two.
In one exemplary embodiment, the apparatus further comprises: a first receiving unit, configured to receive, before the multiple pieces of voice data to be processed are obtained, first voice data sent by a first intelligent device in the group of intelligent devices, where the first voice data is voice data in which a matched voiceprint feature is recorded and is not identified; a determining unit, configured to determine the first voice data as a piece of voice data to be processed if the first voice data satisfies a voice screening condition, where the voice screening condition includes at least one of: the number of words recognized from the first voice data is larger than or equal to a target number of words, and the signal-to-noise ratio of the first voice data is larger than or equal to a preset signal-to-noise ratio.
In one exemplary embodiment, the apparatus further comprises: a second receiving unit, configured to receive second voice data sent by a second smart device in the group of smart devices after performing a voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, where the second voice data is voice data sent by a use object of the second smart device; a second obtaining unit, configured to obtain, by the second smart device, object information of the usage object when a voiceprint feature of the second voice data matches a target voiceprint feature of the at least one voiceprint feature; and the storage unit is used for storing the object information of the using object with the corresponding relation and the target voiceprint characteristics.
According to another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the above digital twin voiceprint feature processing method when running.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for processing the digital twin voiceprint feature by using the computer program.
In the embodiment of the application, after sound source separation processing is performed on a plurality of pieces of acquired voice data, feature extraction is performed on the voice data in a voice group obtained through separation, and the plurality of pieces of voice data to be processed are acquired, wherein each piece of voice data is voice data which is acquired by one intelligent device in a group of intelligent devices and is not identified with matched object information through voiceprint features; performing sound source separation processing on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data; the voice data in each target voice group is subjected to voiceprint feature extraction operation to obtain at least one voiceprint feature, the at least one voiceprint feature is the voiceprint feature of corresponding object information to be determined, the obtained voice data are grouped by combining sound source separation processing, the voiceprint feature of the voice data in the group comprising at least two voice data is extracted, different voice data belonging to the same sound source can reflect different states of a user, the voiceprint feature extraction is carried out through different voice data of the same sound source, the purpose of improving the accuracy of the extracted voiceprint feature representation of the corresponding user can be achieved, after the corresponding object information is input, the user identity recognition is carried out based on the voiceprint feature extracted in the mode, the technical effect of improving the accuracy of the user identity recognition can be achieved, and the problem that the accuracy of the user identity recognition is poor due to the fact that the input voiceprint feature accuracy is low in the voiceprint feature processing mode in the related technology is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive labor.
FIG. 1 is a schematic diagram of a hardware environment for an alternative digital twin voiceprint feature processing method according to an embodiment of the present application;
FIG. 2 is a schematic flow diagram of an alternative digital twin voiceprint feature processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative digital twin voiceprint feature processing method according to an embodiment of the present application;
FIG. 4 is a block diagram of an alternative digital twin voiceprint feature processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of an embodiment of the present application, a digital twin voiceprint feature processing method is provided. The digital twin voiceprint feature processing method is widely applied to full-House intelligent digital control application scenes such as intelligent families (Smart Home), intelligent homes, intelligent household equipment ecology, intelligent residence (Intelligent House) ecology and the like. Alternatively, in this embodiment, the digital twin voiceprint feature processing method described above may be applied to a hardware environment formed by the terminal device 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be configured to provide a service (e.g., an application service) for the terminal or a client installed on the terminal, set a database on the server or independent of the server, and provide a data storage service for the server 104, and configure a cloud computing and/or edge computing service on the server or independent of the server, and provide a data operation service for the server 104.
The network may include, but is not limited to, at least one of: wired networks, wireless networks. The wired network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks, which may include, but are not limited to, at least one of the following: WIFI (Wireless Fidelity), bluetooth. Terminal equipment 102 can be but not limited to be PC, the cell-phone, the panel computer, intelligent air conditioner, intelligent cigarette machine, intelligent refrigerator, intelligent oven, intelligent kitchen range, intelligent washing machine, intelligent water heater, intelligent washing equipment, intelligent dish washer, intelligent projection equipment, intelligent TV, intelligent clothes hanger, intelligent (window) curtain, intelligence audio-visual, smart jack, intelligent stereo set, intelligent audio amplifier, intelligent new trend equipment, intelligent kitchen guarding equipment, intelligent bathroom equipment, intelligence robot of sweeping the floor, intelligence robot of wiping the window, intelligence robot of mopping the ground, intelligent air purification equipment, intelligent steam ager, intelligent microwave oven, intelligent kitchen is precious, intelligent clarifier, intelligent water dispenser, intelligent lock etc..
The digital twin voiceprint feature processing method according to the embodiment of the present application may be executed by the server 104, may be executed by the terminal device 102, or may be executed by both the server 104 and the terminal device 102. The terminal device 102 may execute the digital twin voiceprint feature processing method according to the embodiment of the present application, or may execute the digital twin voiceprint feature processing method by a client installed thereon.
Taking the server 104 to execute the digital twin voiceprint feature processing method in the present embodiment as an example, fig. 2 is a schematic flow chart of an optional digital twin voiceprint feature processing method according to the present embodiment, and as shown in fig. 2, the flow chart of the method may include the following steps:
step S202, a plurality of pieces of voice data to be processed are obtained, wherein each piece of voice data is the voice data which is collected by one intelligent device in a group of intelligent devices and does not identify matched object information through voiceprint characteristics.
The digital twin voiceprint feature processing method in this embodiment may be applied to a scenario where a voiceprint feature in voice data is processed, where the voice data may be voice data acquired by an intelligent voice system running on an intelligent device (the voice data acquired by the intelligent voice system through a sound pickup component on the intelligent device, and the sound pickup component may be a microphone array, etc.), and the acquired voice data may include voice data of multiple voice objects, or may be voice data of a single voice object, which is not limited herein. Here, the digital twin is a full life cycle process of integrating multidisciplinary, multi-physical quantity, multi-scale and multi-probability simulation processes by fully utilizing data such as physical models, sensor updates and operation histories, and completing mapping in a virtual space so as to reflect corresponding entity equipment. Correspondingly, the digital twin voiceprint feature processing method in the embodiment can be applied to an intelligent device or a server with a digital twin function (for example, running a digital twin system).
Optionally, the above manner of acquiring the voice data may be performed in a state where the voice object is not perceived, that is, the acquisition of the daily voice interaction data of the voice object may be achieved, and the voice object is not required to separately respond to a voice entry operation to achieve voice acquisition. In this embodiment, the voice data collection is performed in a state where a voice object is not perceived as an example.
In an intelligent voice conversation system, voiceprints are more and more widely recognized and applied, but are basically realized on the basis of registered voiceprints, the traditional registered voiceprints are most widely used in the existing intelligent equipment, and generally, a user can register a certain voiceprint through certain equipment, such as a mobile phone, and when the user subsequently reuses the equipment, the user can perform characteristic comparison according to registered audio to determine whether the voiceprint is the same person. At present, voiceprint feature processing based on non-registered voiceprints is generally carried out by using a clustering mode to obtain feature results, when the clustering result is incorrect, the subsequent classification result is often easy to be wrong, the realization effect of the non-registered voiceprints is relatively poor, and the non-registered voiceprints are not generally applied.
Moreover, users increasingly want to obtain services provided by smart home devices in an imperceptible manner, such as voiceprint recognition, and based on the services, the digital twin voiceprint feature processing method provided in this embodiment can optimize non-registered voiceprint recognition in which multiple persons in a family are imperceptible. Before voiceprint recognition is carried out, a plurality of pieces of voice data to be processed can be obtained through the intelligent device, wherein each piece of voice data is voice data which is collected by one intelligent device in a group of intelligent devices and matched object information is not recognized through voiceprint characteristics.
The set of smart devices may be smart devices bound under the same account (e.g., a home account), for example, smart home devices carrying a smart voice system, for example, smart speakers, smart televisions, and the like, and may also be other smart devices, for example, smart screens, which is not limited herein. The object information may be object information of the voice object, such as an object nickname, an object age, an object gender, an object preference, and the like, which is not limited herein.
Optionally, after a certain intelligent device of the group of intelligent devices is awakened (or when the certain intelligent device is not awakened), when a voice signal is detected, the collection of the voice signal may be started, so as to obtain a piece of voice data, perform voiceprint feature recognition on the obtained voice data, and determine whether a voiceprint feature of the collected voice data matches a voiceprint feature of existing corresponding object information, if so, the piece of voice data has matching object information, otherwise, it is determined that the piece of voice data does not have matching object information. For object information for which there is no match, it may be saved as accumulated voice data. When the accumulated voice data exceeds the preset number, the accumulated voice data can be subjected to subsequent processing, and here, the voice data can be stored in the form of a voice file, which can also be called audio, audio data or an audio file.
For example, since too little accumulated audio may result in poor recognition effect, a certain number of audio files may be accumulated in advance as a basis for judgment, for example, the number of the set audio files may be 6.
Step S204, performing sound source separation processing on the multiple pieces of voice data to obtain at least one target voice group, where each target voice group includes at least two pieces of voice data belonging to one sound source in the multiple pieces of voice data.
After the multiple pieces of voice data to be processed are obtained, the server may perform sound source separation processing on the multiple pieces of voice data, and then determine, from the multiple pieces of voice data, voice data belonging to the same sound source, that is, obtain at least one target voice group, where the target voice group includes at least two pieces of voice data belonging to one sound source in the multiple pieces of voice data. A voice group including one of the pieces of voice data may be removed, or may be subjected to subsequent processing as accumulated voice data, again without limitation.
For example, the number of pieces of accumulated voice data may be 6, and each piece of accumulated voice data is represented by voice data 1, voice data 2, voice data 3, voice data 4, voice data 5, and voice data 6. Before the 6 pieces of voice data are subjected to the sound source separation operation, in order to ensure the effect of sound source separation, the 6 pieces of voice data collected in different time periods may be merged, and a sound source separation algorithm interface therein is called to perform sound source separation processing on the merged voice data, so as to obtain a grouping of the voice data, where sound sources of the voice data included in each grouped voice group are the same, for example, a voice group one includes voice data 1, 2, and 4, and a voice group two includes voice data 3, 5, and 6.
Step S206, voice print feature extraction operation is carried out on the voice data in each target voice group to obtain at least one voice print feature, wherein the at least one voice print feature is the voice print feature of the corresponding object information to be determined.
In this embodiment, after the target voice groups are obtained, a voiceprint feature extraction operation may be performed on the voice data in each target voice group, so as to obtain a voiceprint feature of the object information to be determined, so that when a subsequent user performs voice interaction, voiceprint recognition of the user is achieved.
When extracting the voiceprint features, a feature extraction model can be used to extract the voiceprint features in the voice data. The feature extraction model may be constructed based on the volume, intonation, and timbre of the voice data, or may be constructed based on other feature dimensions, which is not limited herein. Through the feature extraction model, the feature values of three dimensions corresponding to the voice data can be output, so that the voiceprint features of the voice data are determined.
Through the steps S202 to S206, a plurality of pieces of voice data to be processed are obtained, where each piece of voice data is voice data of object information which is acquired by one intelligent device in a group of intelligent devices and is not identified by voiceprint features; performing sound source separation processing on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data; and performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined, so that the problem of poor accuracy of user identity identification caused by low accuracy of the input voiceprint feature in a voiceprint feature processing mode in the related technology is solved, and the accuracy of user identity identification is improved.
In an exemplary embodiment, performing a sound source separation process on a plurality of pieces of speech data to obtain at least one target speech group includes:
s11, sound source separation processing is carried out on a plurality of pieces of voice data to obtain at least one initial voice group, wherein each initial voice group comprises voice data belonging to one sound source in the plurality of pieces of voice data;
s12, determining the voice groups with the number of the voice data more than or equal to two in at least one initial voice group as at least one target voice group.
In this embodiment, when the sound source separation processing is performed on a plurality of pieces of voice data, the sound source separated voice groups are initial voice groups, each of which includes one or more pieces of voice data belonging to one sound source among the plurality of pieces of voice data. If only one piece of voice data is included in one initial voice group, the confidence coefficient of the initial voice group is considered to be low, the initial voice group can be discarded, and the voice groups with the number of the voice data more than or equal to two in the initial voice group are reserved, so that the target voice group can be determined.
For example, the number of pieces of accumulated voice data may be 6, and each piece of accumulated voice data is represented by voice data 1, voice data 2, voice data 3, voice data 4, voice data 5, and voice data 6. After sound source separation, the resulting groupings are as follows: initial speech set one contains speech data 1 and 2, initial speech set two contains speech data 3, 5 and 6, and initial speech set three contains speech data 4. In this case, the initial voice group three may be discarded, and the initial voice group one and the initial voice group two may be used as the target voice group.
Through this embodiment, the quantity through the voice data that the screening contains is less than two pronunciation groups, can avoid the voice data's that contains in the pronunciation group quantity too little to produce harmful effects to voiceprint feature extraction, and then promoted voiceprint feature processing's accuracy nature.
In one exemplary embodiment, performing a sound source separation process on a plurality of pieces of speech data to obtain at least one initial speech group includes:
s21, performing voice merging operation on the voice data to obtain merged voice data;
s22, inputting the merged voice data into a sound source recognition model to obtain a sound source recognition result output by the sound source recognition model, wherein the sound source recognition result is used for indicating a sound source to which each piece of voice data belongs;
and S23, determining the voice data which belongs to the same sound source and is indicated by the sound source identification result as an initial voice group to obtain at least one initial voice group.
In this embodiment, in order to ensure the accuracy of sound source separation, a speech merging operation may be performed on a plurality of pieces of speech data, after the merged speech data is obtained, the merged speech data is input to the sound source recognition model, so as to obtain a sound source recognition result output by the sound source recognition model, and the speech data belonging to the same sound source indicated by the sound source recognition result is determined as an initial speech group, so as to obtain at least one initial speech group.
The sound source separation is mainly realized by the following steps: firstly, 6 sentences of audio files are merged, after merging, judgment is carried out through an end-to-end sound source separation algorithm based on an attractor, whether sound sources in different time periods are the same sound source or not is counted, and a counted result is returned.
For example, as shown in fig. 3, the sound source recognition model is a transform model, referring to an End-to-End Speaker partitioning for an Unknown Number of Speakers Based on an attractor of an Encoder-Decoder (End-to-End Speaker partitioning for an Unknown Number of Speakers with Encoder-Decoder Based actors). Wherein, the transformer encoder is an end-to-end model of self-attention, and the output embedding layer is used for encoding-decoding based on attractor; judging the spoke number (speaker number) of the data by the attractor, and finally obtaining the spoke number (speaker number) of result; scoring each speader (speaker) by evaluation of the model, and considering the speader if the score exceeds a specified threshold; the threshold value is determined by experience and adjustment of different channel devices, and cannot be completely unified.
Alternatively, the Loss function corresponding to the sound source identification model may be binary cross entropy Loss function (binary cross entropy Loss function), as shown in formula (1):
Figure BDA0003670535640000121
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003670535640000122
predicting the probability that a sample is a positive case, y, for a model t For the sample label, if the sample belongs to the positive example, the value is 1, otherwise the value is 0 t,s For the sample labels in a single loss function,
Figure BDA0003670535640000123
the probability that a sample is a positive case is predicted for the model in a single loss function.
Through this embodiment, through living source separation's mode, will amplify the audio frequency of same object for the result that obtains is more credible, and based on the sound source identification model, the effect of the voiceprint discernment of non-registration formula has been promoted.
In an exemplary embodiment, after performing the sound source separation process on the plurality of pieces of speech data to obtain at least one target speech group, the method further includes:
s31, clustering multiple pieces of voice data to obtain at least one reference voice group, wherein each reference voice group comprises voice data of the same cluster obtained by clustering in the multiple pieces of voice data;
and S32, updating the at least one target voice group by using the at least one reference voice group to obtain at least one updated target voice group.
If inaccurate grouping exists in a voice group obtained based on sound source separation, voiceprint features obtained by extracting voiceprint features of voice data in the grouping cannot accurately represent a user, and therefore an identification result obtained by using the extracted voiceprint features for identity identification is inaccurate. In order to at least partially solve the above problem, in this embodiment, the sound source separation result may be updated by the clustering result through the combination of the sound source separation and the voice clustering, for example, the disputed voice data is discarded, so as to improve the accuracy of voiceprint feature extraction,
the server may perform clustering processing on the plurality of pieces of voice data to obtain a plurality of clusters, each cluster may include at least one piece of voice data, thereby obtaining at least one reference voice group, and each reference voice group in the at least one reference voice group may include at least one piece of voice data belonging to the same cluster obtained by clustering among the plurality of pieces of voice data.
The at least one target speech group may be updated using the clustered at least one reference speech group to obtain an updated at least one target speech group. The update operation of the voice group may be: and eliminating disputed voice data in each target voice group, wherein the disputed voice data refers to voice data with inconsistent sound source separation and distance results.
Optionally, after obtaining the at least one reference speech group, the at least one reference speech group may be updated by using the at least one target speech group to obtain the updated at least one reference speech group, and the voiceprint feature extraction is performed on the speech data in each reference speech group. It is also considered that the target voice group is obtained by clustering plural pieces of voice data, and the reference voice group is obtained by performing sound source separation processing on plural pieces of voice data, and other operations are similar to those in the foregoing embodiment. In addition, the order of performing the sound source separation processing on the plurality of pieces of voice data and performing the clustering processing on the plurality of pieces of voice data may be arbitrary, which is not limited in this embodiment.
For example, the method can be based on a t-SNE (t-distributed Stochastic Neighbor Embedding) algorithm, which mainly uses KL (Kullback-Leibler) divergence to perform a loss function iteration, so that points close to each other become closer, and points farther away from each other become farther, thereby solving the congestion problem in the clustering process. the corresponding loss function for t-SNE is shown in equation (2):
Figure BDA0003670535640000141
wherein the similarity between data points i and j in the high-dimensional data can be represented by P (j | i), correspondingly, the similarity between i and j in the low-dimensional data can be represented by Q (j | i),
Figure BDA0003670535640000142
all divergence distances of n points are represented.
When a clustering algorithm interface is called to process a plurality of pieces of voice data, two reference voice groups can be obtained, wherein a first reference voice group (comprising voice data 1 and 2) corresponds to a first class cluster, and a second reference voice group (comprising voice data 3, 4 and 6) corresponds to a second class cluster. If the target speech group obtained through the sound source separation algorithm interface is a first speech group (including speech data 1, 2, 4) and a second speech group (including speech data 3, 5, 6), the target speech group may be updated by using the reference speech group, and the updated target speech group is: target speech set one (containing speech data 1 and 2) and target speech set two (containing speech data 3 and 6).
By adopting the non-registration mode, audio files of a user are accumulated, a core characteristic result of a cluster is obtained by combining sound source separation and clustering algorithms, the core characteristic result is used as a specific characteristic result of a certain speaker, and when other audio to be classified is input, the core characteristic result is compared with the characteristic to judge whether the speaker is the speaker or not.
By the embodiment, the grouping of the voice data is determined by combining the sound source separation and the voice clustering, so that the accuracy of voiceprint feature extraction can be improved.
In one exemplary embodiment, performing a clustering process on a plurality of pieces of speech data to obtain at least one reference speech group comprises:
s41, performing dimensionality reduction on each piece of voice data to obtain each piece of voice data subjected to dimensionality reduction;
and S42, clustering the plurality of pieces of voice data according to the distance between each piece of voice data to obtain at least one reference voice group.
In this embodiment, when performing clustering processing on multiple pieces of voice data, dimension reduction processing may be performed on each piece of voice data first to obtain each piece of voice data after dimension reduction, and then a clustering algorithm interface is called to cluster the multiple pieces of voice data, so as to obtain at least one reference voice group.
Illustratively, a t-SNE clustering algorithm can be adopted, the t-SNE is an unsupervised clustering algorithm and is a high-dimensional data dimension reduction mode, and the data after dimension reduction can be updated to enable the data after dimension reduction to be closer to the mutual distance between the original data. That is, the algorithm has a good classifying effect on samples after dimensionality reduction of high-dimensionality data, and the clustering algorithm is mainly implemented by inputting multiple pieces of speech data, for example, 6 sentences, into the clustering algorithm at the same time and receiving the clustered results (at least one reference speech group) returned by the algorithm interface.
Through the embodiment, the voice data are subjected to dimensionality reduction, a good classification effect can be achieved, and the accuracy of clustering the voice data is improved.
In an exemplary embodiment, updating the at least one target speech group with the at least one reference speech group to obtain an updated at least one target speech group comprises:
s51, executing the following operations on each target voice group to obtain at least one updated target voice group, wherein each target voice group is the current voice group when the following steps are executed:
determining a matching voice group matched with the current voice group in at least one reference voice group, wherein the ratio of the number of the same voice data contained in the matching voice group and the current voice group to the total number of the voice data in the current voice group is greater than or equal to a target ratio threshold value;
under the condition that the number of the same voice data contained in the matching voice group and the current voice group is more than or equal to two, removing the voice data which do not belong to the matching voice group in the current voice group to obtain an updated current voice group;
in the case where the number of the same voice data included in the matching voice group and the current voice group is less than two, the current voice group is removed.
In this embodiment, when updating the target speech group, a matching speech group matching the current speech group in the at least one reference speech group may be determined, where the ratio of the number of the same speech data included in the matching speech group and the current speech group to the total number of speech data in the current speech group is greater than or equal to a target ratio threshold (e.g., 50%). Under the condition that the number of the same voice data contained in the matched voice group and the number of the same voice data contained in the current voice group are greater than or equal to two, removing the voice data which do not belong to the matched voice group in the current voice group to obtain an updated current voice group; in the case where the matching voice group and the current voice group contain the same voice data in an amount less than two, the current voice group is removed.
That is, the matching voice group may be determined based on the number of the matched voice data, the voice data common to the matching voice group and the current voice group may be retained, and other rejection may be performed, and the matching voice group and the current voice group may also be rejected, where the number of the same voice data included in the matching voice group and the current voice group is smaller than two current voice groups, and the other retained current voice group may be used as the target voice group.
Through the embodiment, the result of sound source separation is combined with the clustering result, and the disputed audio result is abandoned, so that the screened characteristics are more consistent with the audio characteristics actually required to be classified by the user, and the accuracy of the voiceprint recognition result is ensured.
In an exemplary embodiment, before the obtaining of the pieces of voice data to be processed, the method further includes:
s61, receiving first voice data sent by a first intelligent device in a group of intelligent devices, wherein the first voice data is voice data of which matched object information is not identified through voiceprint features;
s62, under the condition that the first voice data meets a voice screening condition, determining the first voice data as a piece of voice data to be processed, wherein the voice screening condition comprises at least one of the following conditions: the number of words recognized from the first voice data is greater than or equal to the target number of words, and the signal-to-noise ratio of the first voice data is greater than or equal to a preset signal-to-noise ratio.
In this embodiment, the acquired pieces of voice data are all voice data that satisfy a voice screening condition, where the voice screening condition is a condition for screening accumulated voice data, and may include, but is not limited to, at least one of the following: the number of recognized words in the first voice data is larger than or equal to the target number of words, and the signal-to-noise ratio of the first voice data is larger than or equal to a preset signal-to-noise ratio.
For a first intelligent device in a group of intelligent devices, a user may send a voice control instruction (or a voice interaction instruction) to the first intelligent device, the first intelligent device may collect first voice data carrying the voice control instruction and send the first voice data to a server, and the server may receive the first voice data sent by the first intelligent device.
After receiving the first voice data, the server may extract voiceprint features of the first voice data and match the voiceprint features with existing voiceprint features on the server. The existing voiceprint features may be voiceprint features in a voiceprint feature library, which may include the entered voiceprint features, i.e., the voiceprint features into which corresponding object information has been entered. At this time, the existing voiceprint features on the server do not include a voiceprint feature for which corresponding object information is not entered.
If there is a matching voiceprint feature in the existing voiceprint features and the matching voiceprint feature has corresponding object information (i.e., matching object information), the first speech data can be processed based on the matching object information, e.g., an intent of the first speech data is identified, and the first speech data is responded to based on the identified intent; when interaction with a voice object of the first intelligent device is needed, an interactive statement is generated according To the matched object information, and the interactive statement is broadcasted To the voice object of the first intelligent device through TTS (Text To Speech).
If there is no matching voiceprint feature for the existing voiceprint feature, it may be determined that object information matching the first speech data is not recognized through the voiceprint feature. In this case, it may be determined whether the first voice data satisfies a voice screening condition, for example, whether the number of recognized words in the first voice data is greater than or equal to a target word number, and whether a signal-to-noise ratio of the first voice data is greater than or equal to a preset signal-to-noise ratio. If the voice screening condition is satisfied, the first voice data may be determined as a piece of voice data to be processed.
Optionally, if the existing voiceprint features on the server include a voiceprint feature into which the corresponding object information is not recorded, the voiceprint feature of the first voice data and the voiceprint feature into which the corresponding object information is not recorded may be determined, and based on the comparison result, whether the voiceprint feature of the corresponding object information which is not recorded and is matched with the voiceprint feature of the first voice data exists is determined. If not, whether the first voice data meets the voice screening condition or not can be judged in a similar mode. If the voice object information is the voice object information, the first intelligent device can acquire the object information of the voice object of the first intelligent device, and the corresponding relation between the acquired object information and the matched voiceprint characteristics of the object information which is not recorded with the corresponding voiceprint characteristics is stored.
For example, when the user uses the smart home system, the user may perform voice interaction with the used smart home device, and voice data during the interaction may be temporarily reserved at the cloud according to different families (i.e., the voice file or the audio file is saved). The user's every voice interaction, the high in the clouds is through the evaluation of the word number and the SNR of the speech data's recognition result, and then filters speech data, and the SNR is negative, and the recognition word number is less than 4 words, can delete this speech data through the high in the clouds.
Through the embodiment, the voice data which contains more words than the threshold value of the number of words and has the signal-to-noise ratio reaching the preset signal-to-noise ratio is selected as the accumulated voice data, so that the accuracy of voiceprint feature extraction can be improved.
In an exemplary embodiment, after performing a voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, the method further comprises:
s71, receiving second voice data sent by a second intelligent device in the group of intelligent devices, wherein the second voice data is voice data sent by a using object of the second intelligent device;
s72, under the condition that the voiceprint features of the second voice data are matched with the target voiceprint features in the at least one voiceprint feature, acquiring object information of the using object through the second intelligent device;
and S73, storing the object information of the using object with the corresponding relation and the target voiceprint characteristics.
For each voiceprint feature, the voiceprint feature can be stored in a database so as to be compared with the voiceprint feature of the voice data acquired by any one of the intelligent devices in the group of intelligent devices, and whether object information corresponding to the voiceprint feature which is compared successfully needs to be acquired is determined based on a comparison result.
In this embodiment, the server may receive second voice data sent by a second smart device in the group of smart devices, where the second voice data is voice data sent by a use object of the second smart device. After receiving the second voice data, the server may extract voiceprint features of the second voice data and match the voiceprint features with existing voiceprint features on the server. The manner in which the second intelligent device collects and sends the second voice data and the manner in which the server performs voiceprint feature matching are similar to those in the foregoing embodiment, and details are not repeated here.
If the existing voiceprint features have the matched voiceprint features, and the matched voiceprint features have the corresponding object information, or if the existing voiceprint features do not have the matched voiceprint features, the second speech data may be processed in a manner similar to that in the foregoing embodiment, which has already been described, and is not described herein again.
If the existing voiceprint features have the matched voiceprint features and the matched voiceprint features do not have the corresponding object information, for example, the matched object information is matched with the target voiceprint features in the at least one voiceprint feature, the object information of the using object can be obtained through the second intelligent device, so that the personalized voice interaction service can be provided for the using object. Alternatively, guidance information for object information setting, guidance for using an object for nickname setting, and the like may be generated by the second smart device, and age, preference, and the like may also be set.
If the object information of the use object is acquired, the object information of the use object having the correspondence relationship and the target voiceprint feature can be saved. When the second voice data is processed or the voice data subsequent to the target of use is processed at this time, the intention of the second voice data or the subsequent voice data is recognized, and the second voice data or the subsequent voice data is responded based on the recognized intention.
For example, when a certain user speaks, by comparing the audio with the characteristics of the audio described above, the user can be guided to perform nickname setting when the person is considered to be the same person. And comparing the same, judging whether the user is the user, and when the user is considered as the user, completing the process of voiceprint recognition through the answer. When the user is not considered, the audio can be accumulated again, and the process of extracting the voiceprint features is repeatedly executed.
Through this embodiment, the voice data's that the matching based on the vocal print characteristic was to gathering compare to when discerning the vocal print characteristic of matching, but the vocal print characteristic of matching does not have the object information that corresponds, the guide carries out the type-in of object information, can improve the promptness and the convenience that the object information acquireed, also convenience of customers carries out the vocal print characteristic type-in of no perception, improves user's use and experiences.
It should be noted that for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., a ROM (Read-Only Memory)/RAM (Random Access Memory), a magnetic disk, an optical disk) and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the methods according to the embodiments of the present application.
According to another aspect of the embodiment of the application, a digital twin voiceprint feature processing device for implementing the digital twin voiceprint feature processing method is further provided. Fig. 4 is a block diagram of an optional digital twin voiceprint feature processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 4, the apparatus may include:
a first obtaining unit 402, configured to obtain multiple pieces of voice data to be processed, where each piece of voice data is voice data that is collected by one intelligent device in a group of intelligent devices and matched object information is not identified through voiceprint features;
a first executing unit 404, connected to the first obtaining unit 402, configured to perform sound source separation processing on the multiple pieces of voice data to obtain at least one target voice group, where each target voice group includes at least two pieces of voice data belonging to a sound source in the multiple pieces of voice data;
and the second execution unit 406 is connected to the first execution unit 404, and is configured to perform a voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, where the at least one voiceprint feature is a voiceprint feature of the corresponding object information to be determined.
It should be noted that the first obtaining unit 402 in this embodiment may be configured to execute the step S202, the first executing unit 404 in this embodiment may be configured to execute the step S204, and the second executing unit 406 in this embodiment may be configured to execute the step S206.
Acquiring a plurality of pieces of voice data to be processed through the module, wherein each piece of voice data is the voice data which is acquired by one intelligent device in a group of intelligent devices and is not identified with matched object information through voiceprint characteristics; performing sound source separation processing on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data; and performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined, so that the problem that the accuracy of user identity identification is poor due to low accuracy of the input voiceprint feature in the voiceprint feature processing mode in the related technology is solved, and the accuracy of user identity identification is improved.
In one exemplary embodiment, the first execution unit includes:
the system comprises a first execution module, a second execution module and a third execution module, wherein the first execution module is used for executing sound source separation processing on a plurality of pieces of voice data to obtain at least one initial voice group, and each initial voice group comprises voice data belonging to one sound source in the plurality of pieces of voice data;
and the determining module is used for determining the voice data number which is more than or equal to two voice groups in at least one initial voice group as at least one target voice group.
In one exemplary embodiment, the first execution module includes:
the execution submodule is used for executing voice combination operation on a plurality of pieces of voice data to obtain combined voice data;
the input submodule is used for inputting the merged voice data into the sound source recognition model to obtain a sound source recognition result output by the sound source recognition model, wherein the sound source recognition result is used for indicating a sound source to which each piece of voice data belongs;
and the determining submodule is used for determining the voice data which belongs to the same sound source and is indicated by the sound source identification result into an initial voice group to obtain at least one initial voice group.
In an exemplary embodiment, the apparatus further includes:
a third execution unit, configured to perform sound source separation processing on the multiple pieces of voice data to obtain at least one target voice group, and then perform clustering processing on the multiple pieces of voice data to obtain at least one reference voice group, where each reference voice group includes voice data belonging to the same cluster obtained by clustering in the multiple pieces of voice data;
and the updating unit is used for updating the at least one target voice group by using the at least one reference voice group to obtain at least one updated target voice group.
In one exemplary embodiment, the third execution unit includes:
the processing module is used for carrying out dimensionality reduction processing on each piece of voice data to obtain each piece of voice data subjected to dimensionality reduction;
and the clustering module is used for clustering the plurality of pieces of voice data according to the distance between each piece of voice data to obtain at least one reference voice group.
In one exemplary embodiment, the update unit includes:
a second execution module, configured to perform the following operations on each target speech group to obtain at least one updated target speech group, where each target speech group is a current speech group when the following steps are performed:
determining a matching voice group matched with the current voice group in at least one reference voice group, wherein the ratio of the number of the same voice data contained in the matching voice group and the current voice group to the total number of the voice data in the current voice group is greater than or equal to a target ratio threshold;
under the condition that the number of the same voice data contained in the matching voice group and the current voice group is more than or equal to two, removing the voice data which do not belong to the matching voice group in the current voice group to obtain an updated current voice group;
in the case where the number of the same voice data included in the matching voice group and the current voice group is less than two, the current voice group is removed.
In an exemplary embodiment, the apparatus further includes:
the first receiving unit is used for receiving first voice data sent by a first intelligent device in a group of intelligent devices before a plurality of pieces of voice data to be processed are obtained, wherein the first voice data are voice data which are not identified and matched and have recorded voiceprint characteristics;
the determining unit is used for determining the first voice data as a piece of voice data to be processed under the condition that the first voice data meets voice screening conditions, wherein the voice screening conditions comprise at least one of the following conditions: the number of words recognized from the first voice data is greater than or equal to the target number of words, and the signal-to-noise ratio of the first voice data is greater than or equal to a preset signal-to-noise ratio.
In an exemplary embodiment, the apparatus further comprises:
the second receiving unit is used for receiving second voice data sent by a second intelligent device in a group of intelligent devices after voiceprint feature extraction operation is carried out on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the second voice data are voice data sent by a use object of the second intelligent device;
a second obtaining unit, configured to obtain, by a second smart device, object information of a usage object when a voiceprint feature of the second voice data matches a target voiceprint feature of the at least one voiceprint feature;
and the storage unit is used for storing the object information of the use object with the corresponding relation and the target voiceprint characteristics.
It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.
According to still another aspect of an embodiment of the present application, there is also provided a storage medium. Alternatively, in this embodiment, the storage medium may be a program code for executing any one of the digital twin voiceprint feature processing methods described in the embodiments of the present application.
Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the steps of:
s1, acquiring a plurality of pieces of voice data to be processed, wherein each piece of voice data is voice data which is acquired by one intelligent device in a group of intelligent devices and matched object information is not identified through voiceprint characteristics;
s2, sound source separation processing is carried out on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data;
and S3, performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined.
Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.
Optionally, in this embodiment, the storage medium may include but is not limited to: u disk, ROM, RAM, mobile hard disk, magnetic disk or optical disk, etc.
According to still another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above digital twin voiceprint feature processing method, which may be a server, a terminal, or a combination thereof.
Fig. 5 is a block diagram of an alternative electronic device according to an embodiment of the present application, as shown in fig. 5, including a processor 502, a communication interface 504, a memory 506, and a communication bus 508, wherein the processor 502, the communication interface 504, and the memory 506 are communicated with each other via the communication bus 508, and wherein,
a memory 506 for storing a computer program;
the processor 502, when executing the computer program stored in the memory 506, implements the following steps:
s1, acquiring a plurality of pieces of voice data to be processed, wherein each piece of voice data is voice data which is acquired by one intelligent device in a group of intelligent devices and matched object information is not identified through voiceprint characteristics;
s2, sound source separation processing is carried out on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data;
and S3, performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined.
Alternatively, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus. The communication interface is used for communication between the electronic device and other equipment.
The memory may include RAM, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
As an example, the memory 506 may include, but is not limited to, the first obtaining unit 402, the first executing unit 404, and the second executing unit 406 in the digital twin voiceprint feature processing apparatus. In addition, but not limited to, other module units in the digital twin voiceprint feature processing apparatus may also be included, and are not described in detail in this example.
The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), an NP (Network Processor), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
It can be understood by those skilled in the art that the structure shown in fig. 5 is only an illustration, and the device implementing the digital twin voiceprint feature processing method may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 5 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 5, or have a different configuration than shown in FIG. 5.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, and may also be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or at least two units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The foregoing is only a preferred embodiment of the present application and it should be noted that, as will be apparent to those skilled in the art, numerous modifications and adaptations can be made without departing from the principles of the present application and such modifications and adaptations are intended to be considered within the scope of the present application.

Claims (10)

1. A digital twin voiceprint feature processing method is characterized by comprising the following steps:
acquiring a plurality of pieces of voice data to be processed, wherein each piece of voice data is acquired by one intelligent device in a group of intelligent devices and matched object information is not identified through voiceprint features;
performing sound source separation processing on the plurality of pieces of voice data to obtain at least one target voice group, wherein each target voice group comprises at least two pieces of voice data belonging to one sound source in the plurality of pieces of voice data;
and performing voiceprint feature extraction operation on the voice data in each target voice group to obtain at least one voiceprint feature, wherein the at least one voiceprint feature is the voiceprint feature of the corresponding object information to be determined.
2. The method according to claim 1, wherein said performing a sound source separation process on the plurality of pieces of speech data to obtain at least one target speech group comprises:
performing sound source separation processing on the plurality of pieces of voice data to obtain at least one initial voice group, wherein each initial voice group comprises voice data belonging to one sound source in the plurality of pieces of voice data;
and determining the voice groups with the number of the voice data more than or equal to two in the at least one initial voice group as the at least one target voice group.
3. The method of claim 2, wherein said performing a sound source separation process on said plurality of pieces of speech data to obtain at least one initial speech group comprises:
performing voice merging operation on the voice data to obtain merged voice data;
inputting the merged voice data into a sound source recognition model to obtain a sound source recognition result output by the sound source recognition model, wherein the sound source recognition result is used for indicating a sound source to which each piece of voice data belongs;
and determining the voice data which belongs to the same sound source and is indicated by the sound source identification result as an initial voice group to obtain at least one initial voice group.
4. The method according to claim 1, wherein after said performing a sound source separation process on said plurality of pieces of speech data to obtain at least one target speech group, said method further comprises:
performing clustering processing on the plurality of pieces of voice data to obtain at least one reference voice group, wherein each reference voice group comprises voice data which belong to the same cluster obtained by clustering in the plurality of pieces of voice data;
and updating the at least one target voice group by using the at least one reference voice group to obtain the at least one updated target voice group.
5. The method of claim 4, wherein the clustering the plurality of pieces of speech data to obtain at least one reference speech group comprises:
performing dimensionality reduction processing on each piece of voice data to obtain each piece of voice data subjected to dimensionality reduction;
and clustering the plurality of pieces of voice data according to the distance between each piece of voice data to obtain the at least one reference voice group.
6. The method of claim 4, wherein said updating the at least one target speech group using the at least one reference speech group to obtain the updated at least one target speech group comprises:
executing the following operations on each target voice group to obtain the updated at least one target voice group, wherein each target voice group is the current voice group when the following steps are executed:
determining a matching voice group which is matched with the current voice group in the at least one reference voice group, wherein the ratio of the number of the same voice data contained in the matching voice group and the current voice group to the total number of the voice data in the current voice group is greater than or equal to a target ratio threshold;
under the condition that the number of the same voice data contained in the matching voice group and the current voice group is more than or equal to two, removing the voice data which do not belong to the matching voice group in the current voice group to obtain an updated current voice group;
and removing the current voice group when the number of the same voice data contained in the matching voice group and the current voice group is less than two.
7. The method according to claim 1, wherein before the obtaining of the pieces of speech data to be processed, the method further comprises:
receiving first voice data sent by a first intelligent device in the group of intelligent devices, wherein the first voice data is voice data of which matched object information is not identified through voiceprint features;
determining the first voice data as a piece of voice data to be processed under the condition that the first voice data meets voice screening conditions, wherein the voice screening conditions comprise at least one of the following conditions: the number of words recognized from the first voice data is larger than or equal to a target number of words, and the signal-to-noise ratio of the first voice data is larger than or equal to a preset signal-to-noise ratio.
8. The method according to any one of claims 1 to 7, wherein after said performing a voiceprint feature extraction operation on the speech data in each target speech group, resulting in at least one voiceprint feature, the method further comprises:
receiving second voice data sent by a second intelligent device in the group of intelligent devices, wherein the second voice data is voice data sent by a use object of the second intelligent device;
under the condition that the voiceprint features of the second voice data are matched with target voiceprint features in the at least one voiceprint feature, acquiring object information of the using object through the second intelligent device;
and saving the object information of the use object with the corresponding relation and the target voiceprint characteristics.
9. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 8.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 8 by means of the computer program.
CN202210603562.XA 2022-05-30 2022-05-30 Digital twin voiceprint feature processing method, storage medium and electronic device Pending CN115171702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210603562.XA CN115171702A (en) 2022-05-30 2022-05-30 Digital twin voiceprint feature processing method, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210603562.XA CN115171702A (en) 2022-05-30 2022-05-30 Digital twin voiceprint feature processing method, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN115171702A true CN115171702A (en) 2022-10-11

Family

ID=83483584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210603562.XA Pending CN115171702A (en) 2022-05-30 2022-05-30 Digital twin voiceprint feature processing method, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN115171702A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1462428A (en) * 2001-03-30 2003-12-17 索尼公司 Sound processing apparatus
KR20040078460A (en) * 2003-03-04 2004-09-10 삼성전자주식회사 Method of removing the inferior speech synthesis units to improve naturalness of the synthetic speech
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal
CN111179940A (en) * 2018-11-12 2020-05-19 阿里巴巴集团控股有限公司 Voice recognition method and device and computing equipment
CN113299276A (en) * 2021-05-25 2021-08-24 北京捷通华声科技股份有限公司 Multi-person multi-language identification and translation method and device
CN114333852A (en) * 2022-01-07 2022-04-12 厦门快商通科技股份有限公司 Multi-speaker voice and human voice separation method, terminal device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1462428A (en) * 2001-03-30 2003-12-17 索尼公司 Sound processing apparatus
KR20040078460A (en) * 2003-03-04 2004-09-10 삼성전자주식회사 Method of removing the inferior speech synthesis units to improve naturalness of the synthetic speech
CN111179940A (en) * 2018-11-12 2020-05-19 阿里巴巴集团控股有限公司 Voice recognition method and device and computing equipment
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal
CN113299276A (en) * 2021-05-25 2021-08-24 北京捷通华声科技股份有限公司 Multi-person multi-language identification and translation method and device
CN114333852A (en) * 2022-01-07 2022-04-12 厦门快商通科技股份有限公司 Multi-speaker voice and human voice separation method, terminal device and storage medium

Similar Documents

Publication Publication Date Title
CN107978311B (en) Voice data processing method and device and voice interaction equipment
EP3611895B1 (en) Method and device for user registration, and electronic device
JP6912605B2 (en) Voice identification feature optimization and dynamic registration methods, clients, and servers
JP5853029B2 (en) Passphrase modeling device and method for speaker verification, and speaker verification system
WO2017092342A1 (en) Recommendation method and device
CN107886949A (en) A kind of content recommendation method and device
CN108874895B (en) Interactive information pushing method and device, computer equipment and storage medium
US20210027789A1 (en) Voice-Controlled Management of User Profiles
CN108899033B (en) Method and device for determining speaker characteristics
CN109859747B (en) Voice interaction method, device and storage medium
CN113314119B (en) Voice recognition intelligent household control method and device
WO2019233361A1 (en) Method and device for adjusting volume of music
CN111785291A (en) Voice separation method and voice separation device
CN113505272A (en) Behavior habit based control method and device, electronic equipment and storage medium
CN111611358A (en) Information interaction method and device, electronic equipment and storage medium
CN111816170A (en) Training of audio classification model and junk audio recognition method and device
CN114078472A (en) Training method and device for keyword calculation model with low false awakening rate
JP6910002B2 (en) Dialogue estimation method, dialogue activity estimation device and program
CN115171702A (en) Digital twin voiceprint feature processing method, storage medium and electronic device
CN111145761A (en) Model training method, voiceprint confirmation method, system, device and medium
CN110970019A (en) Control method and device of intelligent home system
CN106373576A (en) Speaker confirmation method based on VQ and SVM algorithms, and system thereof
CN110767238B (en) Blacklist identification method, device, equipment and storage medium based on address information
CN113421552A (en) Audio recognition method and device
CN115705378A (en) Resource recommendation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination