CN113409774A - Voice recognition method and device and electronic equipment - Google Patents

Voice recognition method and device and electronic equipment Download PDF

Info

Publication number
CN113409774A
CN113409774A CN202110817296.6A CN202110817296A CN113409774A CN 113409774 A CN113409774 A CN 113409774A CN 202110817296 A CN202110817296 A CN 202110817296A CN 113409774 A CN113409774 A CN 113409774A
Authority
CN
China
Prior art keywords
voice
voice recognition
target
speech recognition
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110817296.6A
Other languages
Chinese (zh)
Inventor
曾亮
常乐
涂贤玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202110817296.6A priority Critical patent/CN113409774A/en
Publication of CN113409774A publication Critical patent/CN113409774A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present disclosure provides a voice recognition method, a voice recognition device and an electronic device, wherein the method comprises the following steps: extracting target voiceprint characteristics of the voice to be recognized; acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the plurality of voice recognition models respectively correspond to a plurality of geographic areas; and carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result. The embodiment of the disclosure can improve the voice recognition effect.

Description

Voice recognition method and device and electronic equipment
Technical Field
The invention relates to the field of artificial intelligence, in particular to a voice recognition method and device and electronic equipment.
Background
With the development of economic technology, the degree of intellectualization of the equipment is higher and higher, and the application of the voice recognition function in the intelligent equipment is wider and wider. The speech recognition can convert speech into text, and at present, when speech recognition is carried out, due to different mastering degrees of speakers on the mandarin, the speech of the speakers in different geographic areas possibly has great difference, so that the effect of adopting a universal speech recognition model for speech recognition is poor.
Disclosure of Invention
The embodiment of the disclosure provides a voice recognition method, a voice recognition device and electronic equipment, so as to solve the problem that the effect of performing voice recognition by adopting a universal voice recognition model in the prior art is poor.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present disclosure provides a speech recognition method, where the method includes:
extracting target voiceprint characteristics of the voice to be recognized;
acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the plurality of voice recognition models respectively correspond to a plurality of geographic areas;
and carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result.
In a second aspect, an embodiment of the present disclosure provides a speech recognition apparatus, including:
the first extraction module is used for extracting the target voiceprint characteristics of the voice to be recognized;
a first obtaining module, configured to obtain a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, where the plurality of voice recognition models correspond to a plurality of geographic areas, respectively;
and the first recognition module is used for carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: memory, a processor and a program stored on the memory and executable on the processor, which program, when executed by the processor, carries out the steps in the speech recognition method according to the first aspect.
In a fourth aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the speech recognition method according to the first aspect.
In the embodiment of the disclosure, the target voiceprint characteristics of the voice to be recognized are extracted; acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the plurality of voice recognition models respectively correspond to a plurality of geographic areas; and carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result. Therefore, the voice to be recognized is subjected to voice recognition through the target voice recognition model corresponding to the target voiceprint characteristic in the plurality of voice recognition models, so that the voice of speakers in different geographic areas can be recognized by adopting the voice recognition model corresponding to the geographic area, and the voice recognition effect can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flow chart of a speech recognition method provided by an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 3 is a second schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 4 is a third schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 5 is a fourth schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure;
fig. 6 is a fifth schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure;
fig. 7 is a sixth schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all, embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the disclosed embodiment, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted mobile terminal, a wearable device, a pedometer, and the like.
Referring to fig. 1, fig. 1 is a flowchart of a speech recognition method provided by an embodiment of the present disclosure, as shown in fig. 1, including the following steps:
step 101, extracting target voiceprint characteristics of the voice to be recognized.
The voiceprint characteristics can be a sound wave frequency spectrum carrying speech information, and can be represented in a characteristic sequence mode. Taking the application of the voice recognition method to a conference scene as an example, the voice to be recognized may be the speech of conference participants, and the target voiceprint feature may be a voiceprint feature of any conference participant.
Step 102, obtaining a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the plurality of voice recognition models respectively correspond to a plurality of geographic areas.
Each of the plurality of speech recognition models may correspond to one or more of a plurality of geographic regions, and for example, to improve a speech recognition effect, the plurality of speech recognition models may correspond to the plurality of geographic regions one to one, respectively. The speech recognition model may include a convolutional neural network, or may include a cyclic neural network, or may include a long-short term memory neural network, etc., and any network structure that can be used for speech recognition may be used as the network structure of the speech recognition model.
In addition, the training process of the pre-trained speech recognition models may be as follows: voice samples corresponding to the plurality of geographic areas can be obtained; and respectively inputting the voice sample corresponding to each of the plurality of geographic areas into the voice recognition model corresponding to each of the geographic areas, and training the voice recognition model corresponding to each of the geographic areas.
It should be noted that correspondence between voiceprint features corresponding to a plurality of objects and the plurality of speech recognition models may be stored, and a target speech recognition model corresponding to the target voiceprint feature may be acquired from a plurality of speech recognition models trained in advance according to the stored correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of speech recognition models. For example, a record table of correspondence relationships among identity information, voiceprint feature identifiers, and voice recognition model identifiers of a plurality of objects may be stored, where a voiceprint feature identifier is used to identify a voiceprint feature, and a voice recognition model identifier is used to identify a voice recognition model, and a voice recognition model identifier corresponding to a voiceprint feature identifier of a target voiceprint feature is searched in the record table through the voiceprint feature identifier of the target voiceprint feature, so that a target voice recognition model corresponding to the target voiceprint feature can be obtained.
In addition, the obtaining process of the correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of speech recognition models may be as follows: respectively collecting voice samples of a plurality of objects; respectively extracting the voiceprint characteristics and the acoustic characteristics of the voice sample of each object in the plurality of objects; respectively inputting the acoustic features corresponding to each object into a pre-trained regional speech recognition model, wherein the regional speech recognition model is used for recognizing the geographic region to which the speech belongs; and determining the corresponding relation between the voiceprint features corresponding to the plurality of objects and the plurality of voice recognition models based on the output result of the regional voice recognition model.
Taking the application of the voice recognition method to the voice transcription device as an example, the voice transcription device may execute a process of acquiring correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of voice recognition models; or other electronic equipment may execute the process of obtaining the correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of speech recognition models, and send the correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of speech recognition models to the speech transcription equipment.
And 103, carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result.
The voice recognition result may be text content corresponding to the voice to be recognized. And performing voice recognition on the voice to be recognized through the target voice recognition model, and converting the voice to be recognized into text content.
It should be noted that there are seven dialect regions in chinese: official dialect, Wu dialect, Jiang dialect, Xiang dialect, Min dialect, Guangdong dialect and Hakka dialect. Each dialect area may be further divided into a number of dialect areas, such as official dialects, which may be divided into northeast dialects, northwest dialects, southwest dialects, and so on. In the process of learning and mastering mandarin, people in the dialect area usually have certain pronunciation defects, such as indistinguishable front and back nasal vowels, indistinguishable flat and warped tongues, and the like. People in the same dialect area say mandarin and often have some common problems. In the embodiment of the disclosure, for the voices of speakers in different geographic areas, the voice recognition model corresponding to the geographic area can be adopted for voice recognition, so that the voice recognition effect can be improved.
Taking the example that the voice recognition method in the embodiment of the present disclosure is applied to the voice transcription device for generating the conference recording text, a plurality of voice recognition models corresponding to a plurality of dialect areas one to one may be established before the conference, voice samples of conference participants are collected, and voiceprint features of the conference participants and a voice recognition model most matched with the voiceprint features are determined based on the voice samples of the conference participants. In the conference process, the voice of the speaker is collected, the voiceprint feature of the speaker is utilized to confirm the identity of the speaker, and the voice of the speaker is recognized by using a voice recognition model matched with the voiceprint feature of the speaker. Therefore, for each speaker, the voice recognition model which is most matched with the pronunciation characteristics of the speaker is adopted for voice recognition, and the accuracy rate of voice recognition of the voice transcription equipment for the user with dialect characteristics and pronunciation defects to speak can be improved.
In addition, the voice recognition method in the embodiment of the disclosure can also be applied to a voice transcription device for generating complaint, suggestion or reservation recording text, and the application scene of the voice transcription device can be other places needing to identify the opposite party, such as a public office hall, a school or a hospital reservation hall. The method can establish a plurality of speech recognition models which are in one-to-one correspondence with a plurality of dialect areas, collect speech samples of personnel in each area, and determine voiceprint characteristics of the personnel in each area and the speech recognition model which is most matched with the voiceprint characteristics based on the speech samples of the personnel in each area. In the process of collecting complaints, suggestions or appointments of a user by using the voice transcription device, the voice of a speaker can be collected, voiceprint features of the speaker are extracted based on the voice of the speaker, and the voice of the speaker is recognized by using a voice recognition model matched with the voiceprint features of the speaker. Therefore, for each speaker, the voice recognition model which is most matched with the pronunciation characteristics of the speaker is adopted for voice recognition, and the accuracy rate of voice recognition of the voice transcription equipment for the user with dialect characteristics and pronunciation defects to speak can be improved.
The obtaining of the target speech recognition model corresponding to the target voiceprint feature from the plurality of pre-trained speech recognition models may include: and under the condition that the voiceprint features matched with the target voiceprint features exist in a plurality of prestored voiceprint features, acquiring a target voice recognition model corresponding to the target voiceprint features from a plurality of pre-trained voice recognition models, wherein the voiceprint features correspond to the voice recognition models. The method may further comprise: under the condition that the voiceprint features matched with the target voiceprint features do not exist in the prestored voiceprint features, performing voice recognition on the voice to be recognized by adopting a preset voice recognition model to obtain a voice recognition result; or, in the case that there is no voiceprint feature matching the target voiceprint feature in the prestored voiceprint features, the acoustic feature of the speech to be recognized may be input into a pre-trained regional speech recognition model, where the regional speech recognition model is used to recognize a geographic region to which the speech belongs, determine a speech recognition model corresponding to the target voiceprint feature, perform speech recognition on the speech to be recognized by using the speech recognition model corresponding to the target voiceprint feature to obtain the speech recognition result, and add the target voiceprint feature to the voiceprint features. The geographical region to which the voice to be recognized belongs can be recognized through the regional voice recognition model, so that the voice recognition model corresponding to the voice to be recognized can be determined, and the voice recognition model corresponding to the target voiceprint feature can be recorded.
In the embodiment of the disclosure, the target voiceprint characteristics of the voice to be recognized are extracted; acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the plurality of voice recognition models respectively correspond to a plurality of geographic areas; and carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result. Therefore, the voice to be recognized is subjected to voice recognition through the target voice recognition model corresponding to the target voiceprint characteristic in the plurality of voice recognition models, so that the voice of speakers in different geographic areas can be recognized by adopting the voice recognition model corresponding to the geographic area, and the voice recognition effect can be improved.
Optionally, the obtaining a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models includes:
under the condition that a voiceprint feature matched with the target voiceprint feature exists in a plurality of prestored voiceprint features, acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the voiceprint features correspond to the voice recognition models;
the method further comprises the following steps:
and under the condition that the voiceprint features matched with the target voiceprint features do not exist in the prestored voiceprint features, performing voice recognition on the voice to be recognized by adopting a preset voice recognition model to obtain the voice recognition result.
The correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of speech recognition models may be stored, and the plurality of voiceprint features stored in advance may be the voiceprint features corresponding to the plurality of objects. A speech recognition model of the plurality of speech recognition models may correspond to a voiceprint feature of an object or objects. The target voice recognition model corresponding to the target voiceprint feature may be a voice recognition model corresponding to a voiceprint feature matched with the target voiceprint feature among a plurality of prestored voiceprint features. The preset voice recognition model can be a general voice recognition model, and the preset voice recognition model can be obtained by training voice samples corresponding to a plurality of geographic areas. For example, the speech recognition model corresponding to each geographic area may be trained by using the speech sample corresponding to each geographic area, and the preset speech recognition model may be trained by using the speech samples corresponding to a plurality of geographic areas.
In addition, the target speech recognition model may be a speech recognition model corresponding to a voiceprint feature that matches the target voiceprint feature. The voiceprint feature matched with the target voiceprint feature may be a voiceprint feature same as the target voiceprint feature, or a voiceprint feature whose similarity to the target voiceprint feature is greater than a preset similarity, where the preset similarity may be 90%, 95%, or 98%, and the like, which is not limited in this embodiment.
In this embodiment, when the target voiceprint feature exists in a plurality of previously stored voiceprint features, a target speech recognition model corresponding to the target voiceprint feature is acquired from a plurality of previously trained speech recognition models; and under the condition that the target voiceprint characteristics do not exist in the prestored voiceprint characteristics, performing voice recognition on the voice to be recognized by adopting a preset voice recognition model. In this way, for different voiceprint features, the corresponding speech recognition model can be used for performing speech recognition specifically, and the speech recognition effect can be further improved.
Optionally, before the obtaining the target speech recognition model corresponding to the target voiceprint feature from the plurality of pre-trained speech recognition models, the method further includes:
obtaining voice samples corresponding to the plurality of geographic areas respectively;
respectively inputting the voice sample corresponding to each geographic area in the plurality of geographic areas into the voice recognition model corresponding to each geographic area, and training the voice recognition model corresponding to each geographic area;
the plurality of speech recognition models are speech recognition models corresponding to the plurality of geographic areas respectively.
In addition, the plurality of geographical areas can be divided according to dialect areas, and can be divided into a geographical area corresponding to the official dialect, a geographical area corresponding to the wu dialect, a geographical area corresponding to the gan dialect, a geographical area corresponding to the xiang dialect, a geographical area corresponding to the min dialect, a geographical area corresponding to the guan dialect and a geographical area corresponding to the guest dialect. To improve the speech recognition effect, the speech recognition can be further subdivided, and for example, the geographic area corresponding to the Xiang dialect can be divided into Changde, Yueyang, and Rohde base, etc. Speech features may be extracted from the speech samples and input to a speech recognition model, which may include acoustic features such as mel-frequency cepstral coefficient features, or may also include other speech features.
In this embodiment, the voice sample corresponding to each of the plurality of geographic areas is input into the voice recognition model corresponding to each of the geographic areas, the voice recognition model corresponding to each of the geographic areas is trained, and the trained voice recognition model corresponding to each of the geographic areas can perform voice recognition on the voice of the geographic area better, so that a better voice recognition effect can be obtained.
Optionally, before extracting the target voiceprint feature of the speech to be recognized, the method further includes:
respectively collecting voice samples of a plurality of objects;
respectively extracting the voiceprint characteristics and the acoustic characteristics of the voice sample of each object in the plurality of objects;
respectively inputting the acoustic features corresponding to each object into a pre-trained regional speech recognition model, wherein the regional speech recognition model is used for recognizing the geographic region to which the speech belongs;
and determining the corresponding relation between the voiceprint features corresponding to the plurality of objects and the plurality of voice recognition models based on the output result of the regional voice recognition model.
The multiple objects may be multiple speakers, and taking a conference scene as an example, the multiple objects may be multiple conference participants. The voice samples of the plurality of objects are respectively collected, and may be that voice samples of the plurality of objects read for the preset text are respectively collected. The preset text can comprise characters or words which can embody the pronunciation characteristics of dialects. Illustratively, the preset text may include "four", "yes", and the like. The acoustic features may include phoneme features, pronunciation attributes, and the like, and may be used to identify which region the speaker's dialect belongs to.
In addition, the regional speech recognition model may include a convolutional neural network, or may include a cyclic neural network, or may include a long-short term memory neural network, and so on, as the network structure of the regional speech recognition model, any network structure that may be used to recognize the geographic region to which the speech belongs may be used.
It should be noted that the geographic region to which the voice of each object belongs can be determined through the output result of the regional voice recognition model, so that the voice recognition model matched with the voice of each object can be determined, and the corresponding relationship between the voiceprint feature corresponding to each object and the voice recognition model can be determined. According to the corresponding relation between the voiceprint features corresponding to the objects and the voice recognition models, the target voice recognition model corresponding to the target voiceprint feature can be obtained from the pre-trained voice recognition models.
In this embodiment, the acoustic features corresponding to each object are input into a pre-trained regional speech recognition model, and the correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of speech recognition models is determined based on the output result of the regional speech recognition model, so that the target speech recognition model corresponding to the target voiceprint features can be quickly and accurately recognized by the regional speech recognition model.
Optionally, the region speech recognition model includes a first region speech recognition submodel and a second region speech recognition submodel;
the respectively inputting the acoustic features corresponding to each object into a pre-trained regional speech recognition model includes:
respectively inputting the acoustic features corresponding to each object into the first region voice recognition submodel, and determining a first-level geographic region to which the voice belongs;
inputting the acoustic features corresponding to each object into a second regional voice recognition submodel corresponding to the first-level geographic region respectively;
the determining, based on the output result of the region speech recognition model, a corresponding relationship between the voiceprint features corresponding to the plurality of objects and the plurality of pre-trained speech recognition models includes:
and determining the corresponding relation between the voiceprint features corresponding to the objects and the voice recognition models based on the output result of the second regional voice recognition submodel corresponding to the first-level geographic region.
Wherein the region speech recognition model may include a first region speech recognition submodel and a plurality of second region speech recognition submodels. The first region voice recognition submodel can be used for recognizing first-level geographic regions to which voice belongs, each first-level geographic region can correspond to one second region voice recognition submodel, and the second-level geographic regions to which voice belongs can be recognized through the second region voice recognition submodels. The plurality of speech recognition models may correspond one-to-one to a plurality of second-level geographic regions, respectively. For example, the first level geographic area may be province and the second level geographic area may be city.
In this embodiment, a first-level geographic area to which a voice belongs is determined by a first-area voice recognition submodel, a correspondence between voiceprint features corresponding to the plurality of objects and the plurality of voice recognition models is determined by a second-area voice recognition submodel corresponding to the first-level geographic area, the correspondence between the voiceprint features corresponding to the plurality of objects and the plurality of voice recognition models can be more accurately determined by two-level models, and the voice recognition models in each dialect area can be trained more specifically by the dialect in each area, so that the recognition effect of the trained voice recognition models is better.
Optionally, the respectively acquiring voice samples of a plurality of objects includes:
respectively collecting voice samples and user identity information of a plurality of objects;
after the extracting the voiceprint features and the acoustic features of the voice sample of each object in the plurality of objects respectively, the method further includes:
and determining the corresponding relation between the voiceprint characteristics corresponding to the plurality of objects and the user identity information.
The user identity information may include a name, a job number or an identification number, etc. According to the corresponding relation between the voiceprint features corresponding to the multiple objects and the user identity information, the user identity information corresponding to the target voiceprint features can be obtained, and therefore the identity of the speaker can be determined quickly.
In this embodiment, by determining the correspondence between the voiceprint features corresponding to the plurality of objects and the user identification information, the identity of the speaker can be identified according to the voiceprint features of the speaker.
Optionally, after extracting the target voiceprint feature of the speech to be recognized, the method further includes:
acquiring user identity information corresponding to the target voiceprint feature;
after the voice recognition is performed on the voice to be recognized based on the target voice recognition model, the method further comprises:
and storing the corresponding relation between the user identity information corresponding to the target voiceprint characteristics and the voice recognition result.
In the embodiment, the corresponding relation between the user identity information corresponding to the target voiceprint feature and the voice recognition result is stored, so that a conference recording text can be automatically generated in the conference process, and the intelligent degree is high.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure, and as shown in fig. 2, the speech recognition apparatus 200 includes:
a first extraction module 201, configured to extract a target voiceprint feature of a speech to be recognized;
a first obtaining module 202, configured to obtain a target speech recognition model corresponding to the target voiceprint feature from a plurality of pre-trained speech recognition models, where the plurality of speech recognition models respectively correspond to a plurality of geographic areas;
and the first recognition module 203 is configured to perform voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result.
Optionally, the first obtaining module 202 is specifically configured to:
under the condition that a voiceprint feature matched with the target voiceprint feature exists in a plurality of prestored voiceprint features, acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the voiceprint features correspond to the voice recognition models;
as shown in fig. 3, the apparatus 200 further includes:
a second recognition module 204, configured to perform voice recognition on the voice to be recognized by using a preset voice recognition model under the condition that a voiceprint feature matching the target voiceprint feature does not exist in the prestored voiceprint features, so as to obtain the voice recognition result.
Optionally, as shown in fig. 4, the apparatus 200 further includes:
a second obtaining module 205, configured to obtain a voice sample corresponding to each of the multiple geographic areas;
a training module 206, configured to input a speech sample corresponding to each of the multiple geographic areas into a speech recognition model corresponding to each of the geographic areas, and train the speech recognition model corresponding to each of the geographic areas;
the plurality of speech recognition models are speech recognition models corresponding to the plurality of geographic areas respectively.
Optionally, as shown in fig. 5, the apparatus 200 further includes:
an acquisition module 207, configured to acquire voice samples of a plurality of objects respectively;
a second extraction module 208, configured to extract a voiceprint feature and an acoustic feature of the voice sample of each object in the plurality of objects respectively;
an input module 209, configured to input the acoustic features corresponding to each object into a pre-trained regional speech recognition model, where the regional speech recognition model is used to recognize a geographic region to which speech belongs;
a first determining module 210, configured to determine, based on an output result of the region speech recognition model, a correspondence between voiceprint features corresponding to the multiple objects and the multiple speech recognition models.
Optionally, the region speech recognition model includes a first region speech recognition submodel and a second region speech recognition submodel;
the input module 209 is specifically configured to:
respectively inputting the acoustic features corresponding to each object into the first region voice recognition submodel, and determining a first-level geographic region to which the voice belongs;
inputting the acoustic features corresponding to each object into a second regional voice recognition submodel corresponding to the first-level geographic region respectively;
the first determining module 210 is specifically configured to:
and determining the corresponding relation between the voiceprint features corresponding to the objects and the voice recognition models based on the output result of the second regional voice recognition submodel corresponding to the first-level geographic region.
Optionally, the acquisition module 207 is specifically configured to:
respectively collecting voice samples and user identity information of a plurality of objects;
as shown in fig. 6, the apparatus 200 further includes:
a second determining module 211, configured to determine a correspondence between voiceprint features corresponding to the multiple objects and the user identity information.
Optionally, as shown in fig. 7, the apparatus 200 further includes:
a third obtaining module 212, configured to obtain user identity information corresponding to the target voiceprint feature;
a storage module 213, configured to store a corresponding relationship between the user identity information corresponding to the target voiceprint feature and the voice recognition result.
The speech recognition device can implement each process implemented in the method embodiment of fig. 1, and can achieve the same technical effect, and is not described here again to avoid repetition.
As shown in fig. 8, an embodiment of the present invention further provides an electronic device 300, including: a memory 302, a processor 301, and a program stored in the memory 302 and capable of running on the processor 301, where the program, when executed by the processor 301, implements the processes of the foregoing speech recognition method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the details are not repeated here.
The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the foregoing speech recognition method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present disclosure.
While the disclosed embodiments have been described in connection with the appended drawings, the present invention is not limited to the specific embodiments described above, which are intended to be illustrative rather than limiting, and it will be appreciated by those of ordinary skill in the art that, in light of the teachings of the present invention, many modifications may be made without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims (10)

1. A method of speech recognition, the method comprising:
extracting target voiceprint characteristics of the voice to be recognized;
acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the plurality of voice recognition models respectively correspond to a plurality of geographic areas;
and carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result.
2. The method according to claim 1, wherein the obtaining a target speech recognition model corresponding to the target voiceprint feature from a plurality of pre-trained speech recognition models comprises:
under the condition that a voiceprint feature matched with the target voiceprint feature exists in a plurality of prestored voiceprint features, acquiring a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, wherein the voiceprint features correspond to the voice recognition models;
the method further comprises the following steps:
and under the condition that the voiceprint features matched with the target voiceprint features do not exist in the prestored voiceprint features, performing voice recognition on the voice to be recognized by adopting a preset voice recognition model to obtain the voice recognition result.
3. The method of claim 1, wherein before obtaining the target speech recognition model corresponding to the target voiceprint feature from the pre-trained plurality of speech recognition models, the method further comprises:
obtaining voice samples corresponding to the plurality of geographic areas respectively;
respectively inputting the voice sample corresponding to each geographic area in the plurality of geographic areas into the voice recognition model corresponding to each geographic area, and training the voice recognition model corresponding to each geographic area;
the plurality of speech recognition models are speech recognition models corresponding to the plurality of geographic areas respectively.
4. The method according to claim 1, wherein before extracting the target voiceprint feature of the speech to be recognized, the method further comprises:
respectively collecting voice samples of a plurality of objects;
respectively extracting the voiceprint characteristics and the acoustic characteristics of the voice sample of each object in the plurality of objects;
respectively inputting the acoustic features corresponding to each object into a pre-trained regional speech recognition model, wherein the regional speech recognition model is used for recognizing the geographic region to which the speech belongs;
and determining the corresponding relation between the voiceprint features corresponding to the plurality of objects and the plurality of voice recognition models based on the output result of the regional voice recognition model.
5. The method of claim 4, wherein the regional speech recognition model comprises a first regional speech recognition submodel and a second regional speech recognition submodel;
the respectively inputting the acoustic features corresponding to each object into a pre-trained regional speech recognition model includes:
respectively inputting the acoustic features corresponding to each object into the first region voice recognition submodel, and determining a first-level geographic region to which the voice belongs;
inputting the acoustic features corresponding to each object into a second regional voice recognition submodel corresponding to the first-level geographic region respectively;
the determining, based on the output result of the region speech recognition model, a corresponding relationship between the voiceprint features corresponding to the plurality of objects and the plurality of pre-trained speech recognition models includes:
and determining the corresponding relation between the voiceprint features corresponding to the objects and the voice recognition models based on the output result of the second regional voice recognition submodel corresponding to the first-level geographic region.
6. The method of claim 4, wherein the separately acquiring voice samples of a plurality of subjects comprises:
respectively collecting voice samples and user identity information of a plurality of objects;
after the extracting the voiceprint features and the acoustic features of the voice sample of each object in the plurality of objects respectively, the method further includes:
and determining the corresponding relation between the voiceprint characteristics corresponding to the plurality of objects and the user identity information.
7. The method according to claim 6, wherein after extracting the target voiceprint feature of the speech to be recognized, the method further comprises:
acquiring user identity information corresponding to the target voiceprint feature;
after the voice recognition is performed on the voice to be recognized based on the target voice recognition model, the method further comprises:
and storing the corresponding relation between the user identity information corresponding to the target voiceprint characteristics and the voice recognition result.
8. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:
the first extraction module is used for extracting the target voiceprint characteristics of the voice to be recognized;
a first obtaining module, configured to obtain a target voice recognition model corresponding to the target voiceprint feature from a plurality of pre-trained voice recognition models, where the plurality of voice recognition models correspond to a plurality of geographic areas, respectively;
and the first recognition module is used for carrying out voice recognition on the voice to be recognized based on the target voice recognition model to obtain a voice recognition result.
9. An electronic device, comprising: memory, processor and program stored on the memory and executable on the processor, which when executed by the processor implements the steps in the speech recognition method according to any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the speech recognition method as claimed in any one of the claims 1 to 7.
CN202110817296.6A 2021-07-20 2021-07-20 Voice recognition method and device and electronic equipment Pending CN113409774A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110817296.6A CN113409774A (en) 2021-07-20 2021-07-20 Voice recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110817296.6A CN113409774A (en) 2021-07-20 2021-07-20 Voice recognition method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113409774A true CN113409774A (en) 2021-09-17

Family

ID=77687031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110817296.6A Pending CN113409774A (en) 2021-07-20 2021-07-20 Voice recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113409774A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110150270A1 (en) * 2009-12-22 2011-06-23 Carpenter Michael D Postal processing including voice training
CN105930035A (en) * 2016-05-05 2016-09-07 北京小米移动软件有限公司 Interface background display method and apparatus
CN109545218A (en) * 2019-01-08 2019-03-29 广东小天才科技有限公司 A kind of audio recognition method and system
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN111785275A (en) * 2020-06-30 2020-10-16 北京捷通华声科技股份有限公司 Voice recognition method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110150270A1 (en) * 2009-12-22 2011-06-23 Carpenter Michael D Postal processing including voice training
CN105930035A (en) * 2016-05-05 2016-09-07 北京小米移动软件有限公司 Interface background display method and apparatus
CN109545218A (en) * 2019-01-08 2019-03-29 广东小天才科技有限公司 A kind of audio recognition method and system
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition
CN111785275A (en) * 2020-06-30 2020-10-16 北京捷通华声科技股份有限公司 Voice recognition method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment
CN115831094B (en) * 2022-11-08 2023-08-15 北京数美时代科技有限公司 Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
JP6394709B2 (en) SPEAKER IDENTIFYING DEVICE AND FEATURE REGISTRATION METHOD FOR REGISTERED SPEECH
CN105096940B (en) Method and apparatus for carrying out speech recognition
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN107945805B (en) A kind of across language voice identification method for transformation of intelligence
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
CN106057206B (en) Sound-groove model training method, method for recognizing sound-groove and device
CN109410664B (en) Pronunciation correction method and electronic equipment
TWI396184B (en) A method for speech recognition on all languages and for inputing words using speech recognition
CN111402862B (en) Speech recognition method, device, storage medium and equipment
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN102411932B (en) Methods for extracting and modeling Chinese speech emotion in combination with glottis excitation and sound channel modulation information
CN111785275A (en) Voice recognition method and device
CN111048095A (en) Voice transcription method, equipment and computer readable storage medium
CN111986675A (en) Voice conversation method, device and computer readable storage medium
CN111402892A (en) Conference recording template generation method based on voice recognition
CN110797032A (en) Voiceprint database establishing method and voiceprint identification method
Charisma et al. Speaker recognition using mel-frequency cepstrum coefficients and sum square error
WO2014203328A1 (en) Voice data search system, voice data search method, and computer-readable storage medium
CN109273012B (en) Identity authentication method based on speaker recognition and digital voice recognition
CN108665901B (en) Phoneme/syllable extraction method and device
Mary et al. Analysis and detection of mimicked speech based on prosodic features
CN114125506B (en) Voice auditing method and device
CN113409774A (en) Voice recognition method and device and electronic equipment
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
Akinrinmade et al. Creation of a Nigerian voice corpus for indigenous speaker recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination