CN116312503A - Voice data recognition method and device, chip and electronic equipment - Google Patents

Voice data recognition method and device, chip and electronic equipment Download PDF

Info

Publication number
CN116312503A
CN116312503A CN202310155884.7A CN202310155884A CN116312503A CN 116312503 A CN116312503 A CN 116312503A CN 202310155884 A CN202310155884 A CN 202310155884A CN 116312503 A CN116312503 A CN 116312503A
Authority
CN
China
Prior art keywords
voice
data
voice data
target user
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310155884.7A
Other languages
Chinese (zh)
Inventor
杨毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zeku Technology Shanghai Corp Ltd
Original Assignee
Zeku Technology Shanghai Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zeku Technology Shanghai Corp Ltd filed Critical Zeku Technology Shanghai Corp Ltd
Priority to CN202310155884.7A priority Critical patent/CN116312503A/en
Publication of CN116312503A publication Critical patent/CN116312503A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a voice data identification method, which comprises the following steps: extracting speaker identification from original voice data to obtain voice identification of a target user of the original voice data, inputting the original voice data and the voice identification of the target user into a pre-trained speaker voice extraction model under the condition that the voice identification of the target user is registered locally to obtain target voice data corresponding to the target user, and performing voice recognition on the target voice data to obtain a recognition result of the target voice data. The embodiment of the application also provides a voice data recognition device, a chip and electronic equipment.

Description

Voice data recognition method and device, chip and electronic equipment
Technical Field
The present disclosure relates to voice data recognition technologies, and in particular, to a method, an apparatus, a chip, and an electronic device for recognizing voice data.
Background
Along with the improvement of the technological level, various advanced technologies enable our lives to be more intelligent, especially in various scenes of voice awakening, such as devices like intelligent sound boxes and voice assistants of terminals.
In the related art, keywords in given voice data are mainly identified, the equipment is in a low-power monitoring state under the condition of no external voice, when the received voice data are wake-up sentences, the equipment can perform interactive work after being awakened, however, the voice wake-up equipment can keep a good wake-up rate in a clean environment, and the voice of a speaker is required to be clean enough.
At present, a keyword-based voice awakening technology has a good awakening rate in a clean and stable environment, but the false awakening rate is greatly increased under the condition that the environment has large noise or under the condition of a plurality of speakers, and similarly, after terminal equipment is awakened by voice, the voice conversation is weak in duration and the recognition rate is reduced under the condition of a noisy environment or under the condition of a plurality of speakers; therefore, the technical problem of low accuracy of the recognition result obtained by the existing voice data recognition method can be seen.
Disclosure of Invention
The embodiment of the application provides a voice data recognition method, a voice data recognition device, a voice data recognition chip and electronic equipment, which can improve the accuracy of a recognition result obtained by the voice data recognition method.
The technical scheme of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a method for identifying voice data, including:
extracting speaker identification from original voice data to obtain a voice identification of a target user of the original voice data;
under the condition that the voice identifier of the target user is registered locally, inputting the original voice data into a pre-trained speaker voice extraction model to obtain target voice data corresponding to the target user;
and carrying out voice recognition on the target voice data to obtain a recognition result of the target voice data.
In a second aspect, an embodiment of the present application provides a device for recognizing voice data, including:
the first extraction module is used for extracting speaker identification from the original voice data to obtain the voice identification of a target user of the original voice data;
the second extraction module is used for inputting the original voice data into a pre-trained speaker voice extraction model under the condition that the voice identifier of the target user is registered locally, so as to obtain target voice data corresponding to the target user;
and the recognition module is used for carrying out voice recognition on the target voice data to obtain a recognition result of the target voice data.
In a third aspect, an embodiment of the present application provides a chip, including: and a processor for calling and running a computer program from the memory, so that the device on which the chip is mounted performs the voice data recognition method according to one or more embodiments.
In a fourth aspect, embodiments of the present application provide an electronic device, including:
a processor and a storage medium storing instructions executable by the processor, which when executed by the processor, cause the electronic device to perform the method of recognizing speech data according to one or more embodiments described above.
In a fifth aspect, embodiments of the present application provide a computer storage medium storing executable instructions that, when executed by one or more processors, perform the method for recognizing voice data according to one or more embodiments described above.
The embodiment of the application provides a voice data identification method, a voice data identification device, a voice data identification chip and electronic equipment, wherein the voice data identification method comprises the following steps: extracting speaker identification from original voice data to obtain a voice identification of a target user of the original voice data, inputting the original voice data and the voice identification of the target user into a pre-trained speaker voice extraction model under the condition that the voice identification of the target user is registered locally to obtain target voice data corresponding to the target user, and performing voice recognition on the target voice data to obtain a recognition result of the target voice data; that is, in this embodiment, before performing speech recognition, the original speech data is firstly subjected to speaker extraction, so that the voice identifier of the target user of the original speech data can be obtained, after obtaining the voice identifier of the target user, the original speech data and the voice identifier of the target user are input into the trained speaker speech extraction model only when the voice identifier of the target user is registered locally, so that the speech data corresponding to the target user, that is, the target speech data, is obtained, and then the target speech data is subjected to speech recognition, so that the original speech data is extracted through the speaker identifier extraction, and the speech data of the target user in the original speech data is extracted through the trained speaker speech extraction model, so that the target speech data can be obtained.
Drawings
Fig. 1 is a flow chart of an alternative method for recognizing voice data according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for recognizing voice data in the related art;
fig. 3 is a flowchart of an example one of an alternative voice data recognition method according to an embodiment of the present application;
fig. 4 is a flowchart illustrating an example two of an alternative voice data recognition method according to an embodiment of the present application;
fig. 5 is a flowchart illustrating an example three of an alternative voice data recognition method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an alternative voice data recognition device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an alternative chip according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
An embodiment of the present application provides a method for identifying voice data, and fig. 1 is a schematic flow chart of an alternative method for identifying voice data provided in the embodiment of the present application, as shown in fig. 1, the method for identifying voice data may include:
S101: extracting speaker identification from the original voice data to obtain the voice identification of a target user of the original voice data;
currently, in the electronic device, in collecting voice data and identifying the voice data, the collected voice data is not generally pure voice data of a user, generally, environment voice data including the user, and/or multi-user voice data including the user, that is, mixed voice data, in the related art, in identifying the voice data, in identifying the mixed voice data, accuracy of an identification result of the voice data is poor due to a situation that a large noise exists in an environment or a situation that a plurality of speakers exist in the environment.
In order to improve accuracy of a recognition result of voice data, the embodiment of the application provides a method for recognizing voice data, which performs speaker identification extraction on acquired original voice data, wherein the original voice data may be pure voice data of a target speaker, voice data including the target speaker and environmental noise, and voice data including the target speaker and other speakers.
Here, the obtaining of the original voice data may be the original voice data collected by the electronic device by using its receiver, or may be the original voice data sent by other electronic devices, which is not limited in detail in the embodiment of the present application.
After the original voice data is obtained, speaker identification extraction is performed on the original voice data, so that an aggregate embedded vector can be extracted, and the vector is called Speaker identification (Speaker Identification, speaker ID), namely the voice identification in the embodiment of the application, which is used for distinguishing voices among different speakers/users, so that the voice identification of the target user of the original voice data can be obtained by performing Speaker identification extraction on the original voice data, and the voice identification of the target user is used for identifying the voice of the target user.
S102: under the condition that the voice identification of the target user is registered locally, inputting the original voice data and the voice identification of the target user into a pre-trained speaker voice extraction model to obtain target voice data corresponding to the target user;
after the voice identifier of the target user is extracted through S101, it is further required to determine whether the voice identifier of the target user is already registered locally, and after determining that the voice identifier of the target user is already registered locally, the original voice data and the voice identifier of the target user are input into a pre-trained speaker voice extraction model, so that the target voice data corresponding to the voice identifier of the target user can be output.
When the voice identifier of the target user is not registered locally, the pre-trained speaker voice extraction model cannot process the original voice data, so that if the voice identifier of the target user needs to be further identified, the voice identifier of the target user needs to be registered locally, then the original voice data and the voice identifier of the target user are input into the pre-trained speaker voice extraction model, and accordingly target voice data corresponding to the target user is obtained, and if the voice identifier does not need to be further identified, the original voice data is not processed.
Thus, the target voice data corresponding to the target user can be extracted from the original voice data, namely the pure voice data of the target user, wherein the voice data only comprises the voice of the target user, and does not comprise environmental noise, the voice of other speakers and the like.
S103: and performing voice recognition on the target voice data to obtain a recognition result of the target voice data.
After extracting the clean voice data of the target user, that is, the target voice data, in S102, in order to obtain the recognition result, voice recognition needs to be performed on the target voice data, so that the target voice data is converted into text, so that the text meaning expressed by the target voice data is known, where a related algorithm for converting sound into text may be used, for example, voice recognition may be performed on the target voice data by using a neural network model, so that the recognition result of the target voice data may be obtained.
Because the recognition result is obtained by performing voice recognition on the pure voice data of the target user, compared with the voice recognition on the original voice data directly, the accuracy of the recognition result is improved.
After the recognition result is obtained, the text of the obtained voice data is known, and then a response may be performed according to the text, where the response may be a wake-up, an output voice, or an output text.
For the target user, in order to register the voice identifier of the target user locally in advance, in an optional embodiment, the method further includes:
acquiring voice data of a target user;
extracting speaker identification from voice data of a target user to obtain a voice identification of the target user;
the voice identification of the target user is registered locally.
It can be understood that, firstly, the voice data of the target user is obtained, where the target user may send out a section of pure voice, so that the electronic device receives the voice data of the target user, then, speaker identification extraction is performed on the voice data of the target user, so as to obtain the identification of the target user, if the target user is first registered, the voice identification of the target user is not registered in the local, so that the voice identification of the target user is registered in the local, and the registration of the target user is completed in the local.
The above-mentioned voice data of the target user may be one piece of acquired voice data or may be a fusion of a plurality of pieces of acquired voice data, which is not particularly limited in the embodiment of the present application.
For fusion of acquired pieces of voice data, in an alternative embodiment, acquiring voice data of a target user includes:
acquiring first voice data of a target user and second voice data of the target user;
and carrying out weighted summation on the first voice data and the second voice data to obtain the voice data of the target user.
It can be understood that, the first voice data of the target user and the second voice data of the target user are obtained, and since the first voice data and the second voice data are both voice data of the target user, in order to obtain accurate pure voice data of the target user, the first voice data and the second voice data may be weighted and summed, and the result obtained by the weighted and summed is determined as the voice data of the target user.
The weight value of the first voice data and the weight value of the second voice data in the weighted summation may be preset or may be set in real time, which is not specifically limited in the embodiment of the present application.
It should be noted that, in the embodiment of the present application, three or more pieces of voice data of the target user may be obtained, and then the three or more pieces of voice data are weighted and summed, so as to obtain the voice data of the target user.
In addition, in an alternative embodiment, for a case where the sound of the target user changes, the method further includes:
acquiring voice data of a target user;
extracting speaker identification from voice data of a target user to obtain a current voice identification of the target user;
when the voice identification of the target user is registered to the local, deleting the voice identification of the target user registered to the local, and registering the current voice identification of the target user to the local.
It will be appreciated that when the target user needs to re-register his voice identification after registration is completed because of a change in voice, for example, when the target user changes due to a sick voice or changes due to a voice in a period of time of becoming voice, in order to make the voice identification registered to the local coincide with the current voice identification of the target user, the target user may re-register the current voice identification to the local.
The voice data of the target user is acquired first, where the voice data of the target user may be one piece of acquired voice data, or may be voice data obtained by fusing multiple pieces of voice data, which is not specifically limited in this embodiment of the present application.
After the voice data of the target user is obtained, similarly, the voice data of the target user is consistent with the initial registration, speaker identification extraction is performed on the voice data of the target user, so that the current voice identification of the target user can be obtained, whether the target user is registered locally is judged, if the target user is registered locally, the voice identification of the target user registered locally is deleted, and the current voice identification of the target user is registered locally, so that re-registration of the target user is completed.
Additionally, in order to train the trained speaker speech extraction model, in an alternative embodiment, the method further includes:
acquiring a training data set from the acquired sample data set;
inputting the training data set into a preset speaker voice extraction model for training to obtain a trained speaker voice extraction model;
and determining the trained speaker voice extraction model based on the trained speaker voice extraction model.
That is, a sample data set is obtained by collection, wherein the sample data set is: the relevant data of the mixed voice and the voice data of the user corresponding to the relevant data of the mixed voice, wherein the relevant data of the mixed voice comprises: mixing voice data and voice identification of a user in the mixed voice data; and acquiring a training data set from the sample data set, wherein the training data set is a plurality of groups of data in the sample data set, inputting the training data set into a preset speaker voice extraction model for training, so that a trained speaker voice extraction model can be obtained, and finally determining the trained speaker voice extraction model based on the trained speaker voice extraction model.
Therefore, through the training, a trained speaker voice extraction model can be obtained, the trained speaker voice extraction model is stored and used for carrying out voice recognition on target voice data, so that a recognition result is obtained, and the accuracy of the recognition result can be improved.
To improve the accuracy of the trained speaker speech extraction model, in an alternative embodiment, determining the trained speaker speech extraction model based on the trained speaker speech extraction model includes:
Obtaining a validation data set from the sample data set;
inputting the related data of the mixed voice in the verification data set into a trained speaker voice extraction model to obtain voice data of a user corresponding to the related data of the mixed voice;
when the voice data of the user corresponding to the obtained related data of the mixed voice is the same as the voice data of the user corresponding to the related data of the mixed voice in the verification data set, determining the trained speaker voice extraction model as the trained speaker voice extraction model;
when the voice data of the user corresponding to the obtained related data of the mixed voice is different from the voice data of the user corresponding to the related data of the mixed voice in the verification data set, re-selecting the training data set from the sample data set, inputting the training data set into the trained speaker voice extraction model for training so as to re-obtain the trained speaker voice extraction model, and returning to execute the training based on the trained speaker voice extraction model to determine the trained speaker voice extraction model.
It can be understood that the verification data set is obtained from the collected sample data set, and the relevant data of the mixed voice in the verification data set is input into the trained speaker voice extraction model, so that the voice data of the user corresponding to the relevant data of the mixed voice data can be obtained.
In order to verify whether the accuracy of the trained speaker voice extraction model meets the requirement, the voice data of the user corresponding to the obtained related data of the mixed voice is compared with the voice data of the user corresponding to the related data of the mixed voice in the verification data set, whether the voice data of the user corresponding to the related data of the mixed voice and the voice data of the user are identical is judged, if the voice data of the user corresponding to the voice data of the mixed voice and the voice data of the user are identical, the accuracy of the trained speaker voice extraction model meets the requirement, so that the verification is passed, the trained speaker voice extraction model is directly determined to be the trained speaker voice extraction model, if the voice data of the trained speaker voice extraction model is different, the accuracy of the trained speaker voice extraction model does not meet the requirement, so that the verification is not passed, a training data set is needed to be selected again from the sample data set, and the trained speaker voice extraction model is continuously trained by utilizing the reselected trained training data set until the trained speaker voice extraction model is verified to pass, and the trained speaker voice extraction model is obtained.
In an alternative embodiment, for the case that the voice identifier of the target user is not registered locally, the method further includes:
And when the voice identification of the target user is not registered locally, generating prompt information.
That is, when the voice identifier of the target user is not registered locally, the prompt information may be generated, where the prompt information is used to prompt that the voice identifier of the target user is not registered, that is, generate the prompt information that the voice identifier of the target user is not registered, so as to prompt the target user, and may also display a registration entry of the voice identifier, so that the user may enter the registration interface through the registration entry of the voice identifier, to realize registration of the voice identifier.
After extracting the target voice data, voice recognition is required to be performed on the target voice data to obtain a recognition result, and in order to obtain the recognition result, in an alternative embodiment, voice recognition is performed on the target voice data to obtain the recognition result of the target voice data, including:
extracting features of the target voice data to obtain voiceprint features of the target voice data;
inputting voiceprint features into a trained voice recognition model to obtain a recognition result;
when the identification result exists in a preset word stock, determining that the identification result is correct;
and when the recognition result does not exist in the preset word stock, returning to execute feature extraction on the target voice data to obtain voiceprint features of the target voice data.
It may be appreciated that the target voice data is first subjected to feature extraction, so as to obtain voiceprint features of the target voice data, where the voiceprint features include one or more of the following: the voiceprint features may be part of the voiceprint features or all of the voiceprint features, and the embodiments of the present application are not limited in this way.
After the voiceprint feature is obtained, the voiceprint feature is input into a trained voice recognition model, so that a recognition result is obtained, wherein the recognition result is a word corresponding to target voice data, whether keywords in the word corresponding to the target voice data exist in a preset word stock is judged, if yes, the recognition result obtained through recognition is correct, and if not, the recognition result obtained through recognition is wrong, so that the step of extracting the feature of the target voice data is carried out, and the voiceprint feature of the target voice data is obtained until the obtained recognition result exists in the preset word stock.
The following illustrates, by way of example, a method of recognizing speech data as described in one or more of the embodiments described above.
Fig. 2 is a schematic flow chart of a voice data recognition method in the related art, as shown in fig. 2, a voice signal (equivalent to the voice data) is input to an electronic device, so as to realize voice signal input, for the electronic device, after receiving the voice signal, preprocessing and feature extraction processing are performed, wherein the preprocessing and feature extraction processing include framing, windowing and other operations, voice characteristics of the voice signal are obtained after the preprocessing and feature extraction processing, the voice characteristics are input to a trained recognition model (equivalent to the voice recognition model) for voice recognition (Recognize), words corresponding to the voice signal are obtained, and words in a preset word bank are matched, so that template matching is realized, wherein when the words exist in the word bank, the matching is determined to be successful, when the words do not exist in the word bank, the matching is determined to be failed, a judgment result is that the voice signal output is determined for the input voice signal, when the matching is determined to be failed, the voice signal output is determined to be processed again, and when the matching is determined to be failed, the voice signal is preprocessed and feature extraction processing is performed again. In this way, recognition of the speech signal is achieved.
In addition, as for fig. 2, the recognition Model is obtained by training (Train) a dynamic neural network (Dynamic Neural Network, DNN) Model, and similarly, a training set prepared for the recognition Model is input into the recognition Model, a Loss function (Loss) is calculated (calculated) by using an output value and a true value of the Model, and network parameters in the recognition Model are updated by using a minimum value of the Loss function, so that a trained recognition Model is obtained, and the Model (Save Model) is stored.
Then, in order to improve the recognition effect, the present embodiment adds a voice extraction module to process (corresponding to the speaker voice extraction model) before the voice signal enters the recognition model, so as to improve the accuracy of voice recognition, and specifically includes the following steps:
assuming that the target speaker (equivalent to the target user) has completed an offline registration process that uses the speaker recognition model to generate an aggregate embedded vector speaker ID representing the speaker's voice characteristics; firstly, extracting the voice signals of a target person from the collected voice signals including environmental sounds and other human voice interferences through a speaker extraction (Speaker Extraction, SE) model, and inputting the signals into a subsequent recognition model to complete the whole wake-up and recognition process. After the SE model extraction process, the speech of the target speaker can be enhanced, so that a higher speech wake-up rate and recognition rate can be achieved, and the speech can be mounted on a plurality of terminals.
FIG. 3 is a flowchart of an example I of an alternative voice data recognition method according to an embodiment of the present application, as shown in FIG. 3, the example mainly includes three modules of processing, and the first is the registration of Speaker ID of a target Speaker; the second module is speech extraction of the target speaker based on the SE model; thirdly, matching templates of the extracted voice signals; finally, the voice signal recognition output is obtained through judgment.
Registration of Speaker ID for module 1: in the registration & extraction (SE) identity information stage, firstly, a user needs to record voice at a device end, the Speaker ID of a target Speaker is generated by extracting the Speaker ID of the recorded voice signal, the Speaker ID of the target Speaker is stored locally, so that registration of the Speaker ID of the target Speaker is completed, after the registration is prompted, the target Speaker can also select to update, and because the current voice tone and voice state of the target Speaker have larger changes with the previously recorded state, such as the conditions of illness, sound change and the like, the Speaker ID of the target Speaker is updated.
For block 2, speech extraction of the targeted speaker based on the SE model: fig. 4 is a flowchart of an example two of an alternative voice data recognition method provided in the embodiment of the present application, as shown in fig. 4, the module is a training schematic diagram of a SE Model, and the module performs preprocessing and feature extraction on a target voice signal of a target Speaker to generate a Speaker ID of the target Speaker, and inputs the Speaker ID, pure voice of the target Speaker and voice mixed by a plurality of persons into a neural network together for training, selects a suitable Loss function, and finally generates a neural network Model (Neural Network Model, NN Model) of the SE Model.
Fig. 5 is a flowchart of an example three of an alternative voice data recognition method provided in the embodiment of the present application, and as shown in fig. 5, this part is to extract the voice of the target speaker in the mixed voice through the SE model. Wherein, this part has a detection module, when the Speaker ID of the target Speaker is found to have been registered in advance, the NN model generated in fig. 4 is combined with the mixed speech signal to perform reasoning, then the frequency domain masking information (information) obtained by the reasoning is applied to the mixed frequency domain full-connection layer (FullyConnectedLayer, FC layer) to obtain a separated speech signal (corresponding to the target speech data), and the separated speech signal is used to implement keyword recognition and continuous speech recognition, and this step is used for the subsequent semantic recognition of the obtained separated target signal; if the user is not registered, no reasoning is done.
Module 3, carrying out template matching on the extracted voice signals: the speech signal after SE processing is consistent with the recognition method of the input speech signal in fig. 2, and will not be described here again. Through the whole workflow, voice wakeup and voice recognition enhancement based on the target speaker information are completed.
It should be noted that, the technical solution in this example may include the following expansion modes: the robustness of the model can be improved by adding multilingual training, and the voice extraction performance of the model can be improved by adopting a better network; the extraction of the voiceprint features comprises the following steps: frequency domain features of a Short-time fourier transform (STFT) based speech signal, but are not limited to power spectral features and logarithmic power spectral features.
According to the embodiment, the voice extraction method is combined with voice awakening and recognition, the Speaker ID of the target Speaker can be updated in time according to the voiceprint characteristics of the target Speaker, and the voice of the target Speaker can be extracted from environmental noise and human voice interference according to the SE model, so that higher voice awakening rate and recognition rate are realized; the embodiment does not limit a single user for registration, and can register and separate a plurality of speakers, thereby greatly improving user experience.
The voice extraction module is added to process the voice signal before the voice signal enters the recognition model, so that the accuracy of voice recognition is improved; and extracting the voice signals of the target person from the collected voice signals including environmental sounds and other human voice interferences through the SE model, and inputting the signals into a subsequent recognition model to complete the whole wake-up and recognition process.
After the SE model extraction process, the speech of the target speaker can be enhanced, so that a higher speech wake-up rate and recognition rate can be achieved, and the speech can be mounted on a plurality of terminals.
Therefore, through the embodiment, the voice signal enhancement of the target speaker can be realized under the condition of larger environmental noise, so that the wake-up rate and the voice recognition rate of the equipment are improved; the extracted target speaker voice is stored in the voice library, so that the voice library can be well adapted to the emotion and physiological sounding state change of the speaker, and the user experience of voice awakening equipment is further improved.
The embodiment of the application provides a voice data recognition method, which comprises the following steps: extracting speaker identification from original voice data to obtain a voice identification of a target user of the original voice data, inputting the original voice data and the voice identification of the target user into a pre-trained speaker voice extraction model under the condition that the voice identification of the target user is registered locally to obtain target voice data corresponding to the target user, and performing voice recognition on the target voice data to obtain a recognition result of the target voice data; that is, in this embodiment, before performing speech recognition, the original speech data is firstly subjected to speaker extraction, so that the voice identifier of the target user of the original speech data can be obtained, after obtaining the voice identifier of the target user, the original speech data and the voice identifier of the target user are input into the trained speaker speech extraction model only when the voice identifier of the target user is registered locally, so that the speech data corresponding to the target user, that is, the target speech data, is obtained, and then the target speech data is subjected to speech recognition, so that the original speech data is extracted by the speaker extraction, and the speech data of the target user in the original speech data is extracted by the trained speaker speech extraction model, so that the target speech data can be obtained.
Based on the same inventive concept, an embodiment of the present application provides a voice data recognition device, and fig. 6 is a schematic structural diagram of an alternative voice data recognition device provided in the embodiment of the present application, as shown in fig. 6, the voice data recognition device may include:
a first extraction module 61, configured to perform extraction processing on the original voice data, so as to obtain a voice identifier of a target user of the original voice data;
the second extraction module 62 is configured to input, in a case where the voice identifier of the target user is registered locally, the original voice data and the voice identifier of the target user into a pre-trained speaker voice extraction model, to obtain target voice data corresponding to the target user;
the recognition module 63 is configured to perform voice recognition on the target voice data, so as to obtain a recognition result of the target voice data.
In an alternative embodiment, the device is further adapted to:
acquiring voice data of a target user;
extracting speaker identification from voice data of a target user to obtain a voice identification of the target user;
the voice identification of the target user is registered locally.
In an alternative embodiment, the device acquires voice data of the target user, including:
Acquiring first voice data of a target user and second voice data of the target user;
and carrying out weighted summation on the first voice data and the second voice data to obtain the voice data of the target user.
In an alternative embodiment, the device is further adapted to:
acquiring voice data of a target user;
extracting speaker identification from voice data of a target user to obtain a current voice identification of the target user;
when the voice identification of the target user is registered to the local, deleting the voice identification of the target user registered to the local, and registering the current voice identification of the target user to the local.
In an alternative embodiment, the device is further adapted to:
acquiring a training data set from the acquired sample data set; wherein the sample dataset is: the relevant data of the mixed voice and the voice data of the user corresponding to the relevant data of the mixed voice, wherein the relevant data of the mixed voice comprises: mixing voice data and voice identification of a user in the mixed voice data;
inputting the training data set into a preset speaker voice extraction model for training to obtain a trained speaker voice extraction model;
and determining the trained speaker voice extraction model based on the trained speaker voice extraction model.
In an alternative embodiment, the apparatus determines a trained speaker speech extraction model based on the trained speaker speech extraction model, comprising:
obtaining a validation data set from the sample data set;
inputting the related data of the mixed voice in the verification data set into a trained speaker voice extraction model to obtain a voice data set of a user corresponding to the related data of the mixed voice;
when the voice data of the user corresponding to the obtained related data of the mixed voice is the same as the voice data of the user corresponding to the related data of the mixed voice in the verification data set, determining the trained speaker voice extraction model as the trained speaker voice extraction model;
when the obtained voice data of the user corresponding to the related data of the mixed voice is different from the voice data of the user corresponding to the related data of the mixed voice in the verification data set, re-selecting a training data set from the sample set, inputting the training data set into a trained speaker voice extraction model for training so as to re-obtain the trained speaker voice extraction model, and returning to execute the training based on the trained speaker voice extraction model to determine the trained speaker voice extraction model.
In an alternative embodiment, the device is further adapted to:
when the voice identifier of the target user is not registered locally, generating prompt information; the prompt information is used for prompting that the voice identifier of the target user is unregistered.
In an alternative embodiment, the identification module 63 is specifically configured to:
extracting features of the target voice data to obtain voiceprint features of the target voice data;
inputting voiceprint features into a trained voice recognition model to obtain a recognition result;
when the identification result exists in a preset word stock, determining that the identification result is correct;
and when the recognition result does not exist in the preset word stock, returning to execute feature extraction on the target voice data to obtain voiceprint features of the target voice data.
In an alternative embodiment, the voiceprint features include one or more of the following:
frequency domain characteristics, power spectrum characteristics, logarithmic power spectrum characteristics.
In practical applications, the first extraction module 61, the second extraction module 62 and the recognition module 63 may be implemented by a processor located on a recognition device of voice data, specifically, a central processing unit (Central Processing Unit, CPU), a microprocessor (Microprocessor Unit, MPU), a digital signal processor (Digital Signal Processing, DSP) or a field programmable gate array (Field Programmable Gate Array, FPGA) or the like.
An embodiment of the present application provides a chip, fig. 7 is a schematic structural diagram of an alternative chip provided in the embodiment of the present application, as shown in fig. 7, a chip 700 includes: processor 71 for calling and running a computer program from memory 72, causing a device on which chip 700 is mounted to perform the method of recognizing speech data as described in one or more of the embodiments above.
Optionally, as shown in fig. 7, chip 700 may also include memory 72. Wherein the processor 71 may call and run a computer program from the memory 72 for implementing the method in the embodiments of the present application. The memory 72 may be a separate device independent of the processor 71 or may be integrated in the processor 71. Optionally, the chip 700 may further comprise an input interface 73. The processor 71 may control the input interface 73 to communicate with other devices or chips, and in particular, may acquire information or data sent by the other devices or chips. Optionally, the chip 700 may further comprise an output interface 73. The processor 71 may control the output interface 74 to communicate with other devices or chips, and in particular, may output information or data to other devices or chips.
An electronic device is provided in the embodiment of the present application, fig. 8 is a schematic structural diagram of an alternative electronic device provided in the embodiment of the present application, and as shown in fig. 8, an electronic device 800 includes: a processor 81 and a storage medium 82 storing instructions executable by the processor; the instructions, when executed by the processor, cause the electronic device 800 to perform the method of recognizing speech data as described in the processor-side execution of one or more embodiments described above.
In practical use, the components in the terminal are coupled together via the communication bus 83. It is understood that the communication bus 83 is used to enable connected communication between these components. The communication bus 83 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as communication bus 83 in fig. 8.
Embodiments of the present application provide a computer storage medium storing executable instructions that, when executed by one or more processors, perform the method for recognizing voice data according to one or more embodiments described above.
The computer readable storage medium may be a magnetic random access Memory (ferromagnetic random access Memory, FRAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (Compact Disc Read-Only Memory, CD-ROM).
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application.

Claims (13)

1. A method for recognizing voice data, comprising:
extracting speaker identification from original voice data to obtain a voice identification of a target user of the original voice data;
Under the condition that the voice identifier of the target user is registered locally, inputting the original voice data and the voice identifier of the target user into a pre-trained speaker voice extraction model to obtain target voice data corresponding to the target user;
and carrying out voice recognition on the target voice data to obtain a recognition result of the target voice data.
2. The method according to claim 1, wherein the method further comprises:
acquiring voice data of the target user;
extracting speaker identification from the voice data of the target user to obtain the voice identification of the target user;
registering the voice identification of the target user to the local.
3. The method of claim 2, wherein the obtaining the voice data of the target user comprises:
acquiring first voice data of the target user and second voice data of the target user;
and carrying out weighted summation on the first voice data and the second voice data to obtain the voice data of the target user.
4. The method according to claim 1, wherein the method further comprises:
Acquiring voice data of the target user;
extracting speaker identification from the voice data of the target user to obtain the current voice identification of the target user;
when the voice identifier of the target user is registered to the local, deleting the voice identifier of the target user registered to the local, and registering the current voice identifier of the target user to the local.
5. The method according to claim 1, wherein the method further comprises:
acquiring a training data set from the acquired sample data set; wherein the sample dataset is: the method comprises the steps of mixing relevant data of voice and voice data of a user corresponding to the relevant data of the mixed voice, wherein the relevant data of the mixed voice comprises the following steps: mixing voice data and voice identification of a user in the mixed voice data;
inputting the training data set into a preset speaker voice extraction model for training to obtain a trained speaker voice extraction model;
and determining the trained speaker voice extraction model based on the trained speaker voice extraction model.
6. The method of claim 5, wherein the determining the trained speaker speech extraction model based on the trained speaker speech extraction model comprises:
Obtaining a validation data set from the sample data set;
inputting the related data of the mixed voice in the verification data set into a trained speaker voice extraction model to obtain a voice data set of a user corresponding to the related data of the mixed voice;
when the obtained voice data of the user corresponding to the related data of the mixed voice is the same as the voice data of the user corresponding to the related data of the mixed voice in the verification data set, determining the trained speaker voice extraction model as the trained speaker voice extraction model;
when the obtained voice data of the user corresponding to the related data of the mixed voice is different from the voice data of the user corresponding to the related data of the mixed voice in the verification data set, a training data set is reselected from the sample set, the training data set is input into the trained speaker voice extraction model for training, the trained speaker voice extraction model is obtained again, the trained speaker voice extraction model is executed in a returning mode, and the trained speaker voice extraction model is determined.
7. The method according to claim 1, wherein the method further comprises:
when the voice identifier of the target user is not registered locally, generating prompt information; the prompt information is used for prompting that the voice identifier of the target user is unregistered.
8. The method according to claim 1, wherein performing the voice recognition on the target voice data to obtain a recognition result of the target voice data includes:
extracting features of the target voice data to obtain voiceprint features of the target voice data;
inputting the voiceprint features into a trained voice recognition model to obtain a recognition result;
when the identification result exists in a preset word stock, determining that the identification result is correct;
and when the recognition result does not exist in the preset word stock, returning to execute the feature extraction of the target voice data to obtain voiceprint features of the target voice data.
9. The method of claim 8, wherein the voiceprint features comprise one or more of:
frequency domain characteristics, power spectrum characteristics, logarithmic power spectrum characteristics.
10. A voice data recognition apparatus, comprising:
The first extraction module is used for extracting and processing the original voice data to obtain the voice identifier of the target user of the original voice data;
the second extraction module is used for inputting the original voice data and the voice identification of the target user into a pre-trained speaker voice extraction model under the condition that the voice identification of the target user is registered locally, so as to obtain target voice data corresponding to the target user;
and the recognition module is used for carrying out voice recognition on the target voice data to obtain a recognition result of the target voice data.
11. A chip, comprising: a processor for calling and running a computer program from a memory, so that a device on which the chip is mounted performs the recognition method of voice data according to any one of claims 1 to 9.
12. An electronic device, comprising:
a processor and a storage medium storing instructions executable by the processor, which when executed by the processor, cause the electronic device to perform the method of recognizing speech data according to any one of claims 1 to 9.
13. A computer storage medium storing executable instructions which, when executed by one or more processors, perform the method of recognizing speech data according to any one of claims 1 to 9.
CN202310155884.7A 2023-02-22 2023-02-22 Voice data recognition method and device, chip and electronic equipment Pending CN116312503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310155884.7A CN116312503A (en) 2023-02-22 2023-02-22 Voice data recognition method and device, chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310155884.7A CN116312503A (en) 2023-02-22 2023-02-22 Voice data recognition method and device, chip and electronic equipment

Publications (1)

Publication Number Publication Date
CN116312503A true CN116312503A (en) 2023-06-23

Family

ID=86791643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310155884.7A Pending CN116312503A (en) 2023-02-22 2023-02-22 Voice data recognition method and device, chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN116312503A (en)

Similar Documents

Publication Publication Date Title
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
CN111508474B (en) Voice interruption method, electronic equipment and storage device
Aloufi et al. Emotionless: Privacy-preserving speech analysis for voice assistants
CN107977183A (en) voice interactive method, device and equipment
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN113168832A (en) Alternating response generation
Prasad et al. Intelligent chatbot for lab security and automation
CN108899033B (en) Method and device for determining speaker characteristics
CN111210829A (en) Speech recognition method, apparatus, system, device and computer readable storage medium
CN113330511A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
US20240029739A1 (en) Sensitive data control
CN111583936A (en) Intelligent voice elevator control method and device
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN105679323B (en) A kind of number discovery method and system
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN109065026B (en) Recording control method and device
CN114708869A (en) Voice interaction method and device and electric appliance
CN113112992B (en) Voice recognition method and device, storage medium and server
CN108989551B (en) Position prompting method and device, storage medium and electronic equipment
CN109064720B (en) Position prompting method and device, storage medium and electronic equipment
CN116312503A (en) Voice data recognition method and device, chip and electronic equipment
CN113724693B (en) Voice judging method and device, electronic equipment and storage medium
KR20200070783A (en) Method for controlling alarm of user terminal and method for determining alarm off mission of server
CN112466287B (en) Voice segmentation method, device and computer readable storage medium
CN112037772B (en) Response obligation detection method, system and device based on multiple modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination