CN113592559B

CN113592559B - Method and device for establishing accent recognition model, storage medium and electronic equipment

Info

Publication number: CN113592559B
Application number: CN202110888963.XA
Authority: CN
Inventors: 徐延广; 颜瑞; 解传栋
Original assignee: Seashell Housing Beijing Technology Co Ltd
Current assignee: Seashell Housing Beijing Technology Co Ltd
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2022-06-07
Anticipated expiration: 2041-08-03
Also published as: CN113592559A

Abstract

The embodiment of the invention provides an accent recognition model establishing method and device, a storage medium and electronic equipment. The method comprises the following steps: respectively training the single city accent recognition model of the corresponding first city by adopting the user voice training sample set of each first city to obtain the single city accent recognition model of each first city; respectively inputting the user voice test sample set of each second city into the single city accent recognition model of each first city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value divides a first city and a second city corresponding to the selected model into the same accent area; and aiming at each accent area, training the regional accent recognition model of the accent area by adopting the user voice training sample sets of all cities in the accent area to obtain the regional accent recognition model of the accent area. The embodiment of the invention improves the recognition rate of the accent recognition.

Description

Method and device for establishing accent recognition model, storage medium and electronic equipment

Technical Field

The embodiment of the invention relates to an accent recognition model establishing method and device, a readable storage medium and electronic equipment.

Background

At present, in house trading, VR (Virtual Reality) technology is used online to bring a customer to a house for viewing, for convenience. In this scenario, the broker needs to bring a house source to the client as the VR footage changes, and during the bringing process, the broker and the client will exchange, that is, both the broker and the client will speak. Since customers may be from different territories, there may be situations where the broker cannot fully understand the customer's speech.

At present, some models for recognizing speech as text appear, and the text corresponding to the speech can be obtained by inputting the speech of the user in any region into the models, but the recognition rate cannot be ensured because the accent difference between some regions is large.

Disclosure of Invention

The embodiment of the invention provides an accent recognition model establishing method and device, a readable storage medium and electronic equipment, so as to improve the recognition rate of accent recognition.

The technical scheme of the embodiment of the invention is realized as follows:

a method for establishing a mouth sound identification model comprises the following steps:

acquiring a user voice training sample set of each first city in a first city set;

respectively training a single city accent recognition model of a corresponding first city by adopting a user voice training sample set of each first city in a first city set to obtain the single city accent recognition model of each first city;

respectively inputting the user voice test sample set of each second city in the second city set into the single city accent recognition model of each first city in the first city set, and calculating the recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value is divided into a first city and a second city corresponding to the selected model; wherein the second city set comprises the first city set;

and aiming at each accent area, training the area accent recognition model of the accent area by adopting the user voice training sample sets of all cities in the accent area to obtain the area accent recognition model of the accent area.

After obtaining the regional accent recognition model of the accent region, the method further includes:

receiving the voice of the first user, determining an accent area corresponding to the city where the first user is located, and inputting the voice of the first user into the accent recognition model of the area of the accent area to obtain a text corresponding to the voice of the first user.

After the first city and the second city corresponding to the selected model are divided into the same accent region and before the user speech training sample sets of all cities in the accent region are adopted to train the regional accent recognition model of the accent region for each accent region, the method further includes:

selecting at least two accent areas from all accent areas each time for combination until all combinations are obtained, taking each combination as an accent area set, adopting user voice training sample sets of all cities in the accent area set to train an area set accent recognition model of the accent area set to obtain an area set accent recognition model of the accent area set, adopting user voice test sample sets of multiple cities in the accent area set to test the area set accent recognition model of the accent area set and single city accent recognition models corresponding to the multiple cities respectively, and if the test result shows that: and if the difference value between the recognition rate of the accent recognition model of the region set and the recognition rate of the accent recognition model of each city is in a preset range, determining that all accent regions in the accent region set are fused into a new accent region.

The loss functions adopted by the single city accent recognition model, the regional accent recognition model and the regional set accent recognition model are as follows: weighted summation of cross entropy loss function and discriminative loss function;

the structures of the single city accent recognition model, the region accent recognition model and the region set accent recognition model are as follows: a combination of a time-delay neural network and a long-short term memory network.

for each accent area, adopting user voice training sample sets of all cities in the accent area, and training the regional accent distinguishing model of the accent area on the basis of the regional accent recognition model of the accent area to obtain the regional accent distinguishing model of the accent area; the structure of the regional accent distinguishing model is the same as that of the regional accent recognition model, and the loss function adopted by the regional accent distinguishing model is as follows: a discriminative loss function;

receiving the voice of the first user, determining an accent area corresponding to the city where the first user is located, inputting the voice of the first user into the accent distinguishing model of the area of the accent area, and obtaining a text corresponding to the voice of the first user.

for each accent region, dividing the user speech training sample set of all cities in the accent region into a plurality of training sample subsets, sequentially adopting each training sample subset to train the regional accent distinguishing model of the accent region, obtaining the regional accent distinguishing model of the accent region, wherein,

the regional accent distinguishing model corresponding to the first training sample subset is trained on the basis of the regional accent recognition model of the accent region;

the regional accent distinguishing model corresponding to each subsequent training sample subset is trained on the basis of the regional accent distinguishing model corresponding to the previous training sample subset;

the structure of the regional accent distinguishing model is the same as that of the regional accent recognition model;

the loss function adopted by the regional accent distinguishing model is as follows: a discriminative loss function.

After obtaining the regional accent distinguishing model of the accent region, the method further includes:

acquiring a user voice test sample set of a third city; wherein the third city is not included in any accent regions that have been divided;

and respectively inputting the user voice test sample set of the third city into the regional accent distinguishing model of each accent region for testing, selecting the model with the highest recognition rate and the recognition rate larger than a second threshold value, and fusing the third city into the accent region corresponding to the selected model.

if the user voice training sample set of any accent area is detected to be updated, training the area accent recognition model of the accent area by adopting the updated user voice training sample set of the accent area on the basis of the universal accent recognition model to obtain the updated area accent recognition model of the accent area;

and training the regional accent distinguishing model of the accent region by adopting the updated user voice training sample set of the accent region on the basis of the updated regional accent recognition model of the accent region to obtain the updated regional accent distinguishing model of the accent region.

The training of the single city accent recognition model of the corresponding first city by respectively adopting the user voice training sample set of each first city in the first city set comprises the following steps:

respectively adopting a user voice training sample set of each first city in the first city set to train a single city accent recognition model of the corresponding first city on the basis of the obtained general accent recognition model;

wherein the general accent recognition model is obtained by the following processes:

and training the universal accent recognition model by adopting the user voice training sample sets of all cities to obtain the universal accent recognition model.

The first threshold is:

and inputting the user voice test sample set of the second city into a universal accent recognition model for testing, and calculating the obtained recognition rate according to the test result.

and if the user voice training sample set of any accent area is detected to be updated, training the area accent recognition model of the accent area by adopting the updated user voice training sample set of the accent area on the basis of the universal accent recognition model to obtain the updated area accent recognition model of the accent area.

A computer program product comprising a computer program or instructions which, when executed by a processor, carry out the steps of the accent recognition model building method as claimed in any one of the preceding claims.

An apparatus for establishing a mouth-sound recognition model, the apparatus comprising:

the single city oral sound recognition model establishing module is used for acquiring a user voice training sample set of each first city in the first city set; respectively training a single city accent recognition model of a corresponding first city by adopting a user voice training sample set of each first city in a first city set to obtain the single city accent recognition model of each first city;

the region division module is used for respectively inputting the user voice test sample set of each second city in the second city set into the single city accent recognition model of each first city in the first city set and calculating the recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value is divided into a first city and a second city corresponding to the selected model; wherein the second city set comprises the first city set;

and the regional accent recognition model establishing module is used for adopting the user voice training sample sets of all cities in the accent region to train the regional accent recognition model of the accent region aiming at each accent region to obtain the regional accent recognition model of the accent region.

A non-transitory computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the steps of the accent recognition model building method of any one of the above.

An electronic device comprising a non-transitory computer-readable storage medium as described above, and the processor having access to the non-transitory computer-readable storage medium.

In the embodiment of the invention, the single city accent recognition model of each city is established, and then a plurality of cities are fused into one accent area according to the recognition rate of the user voice of each city on the single city accent recognition models of other cities, so that the recognition rate of accent recognition is improved, the number of the area accent recognition models is reduced as much as possible, and the occupation of computing resources during model operation is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of a method for establishing an accent recognition model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for establishing an accent recognition model according to another embodiment of the present invention;

fig. 3 is a schematic structural diagram of an accent recognition model building apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail with specific examples. Several of the following embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

The embodiment of the invention provides a method for establishing a voice recognition model, which comprises the steps of acquiring a user voice training sample set of each first city in a first city set; respectively training a single city accent recognition model of a corresponding first city by adopting a user voice training sample set of each first city in a first city set to obtain the single city accent recognition model of each first city; respectively inputting the user voice test sample set of each second city in the second city set into the single city accent recognition model of each first city in the first city set, and calculating the recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value is divided into a first city and a second city corresponding to the selected model; wherein the second city set comprises the first city set; and aiming at each accent area, training the area accent recognition model of the accent area by adopting the user voice training sample sets of all cities in the accent area to obtain the area accent recognition model of the accent area. The embodiment of the invention firstly establishes the single city accent recognition model of each city, and then fuses a plurality of cities into one accent area according to the recognition rate of the user voice of each city on the single city accent recognition models of other cities, thereby not only improving the recognition rate of accent recognition, but also reducing the number of the regional accent recognition models as much as possible and reducing the occupation of computing resources during the model operation.

Fig. 1 is a flowchart of a method for establishing an accent recognition model according to an embodiment of the present invention, which includes the following steps:

step 101: and acquiring a user voice training sample set of each first city in the first city set.

The first city set comprises a plurality of first cities, and different first cities respectively represent different cities.

Step 102: and respectively training the single city accent recognition model of the corresponding first city by adopting the user voice training sample set of each first city in the first city set to obtain the single city accent recognition model of each first city.

Step 103: respectively inputting the user voice test sample set of each second city in the second city set into the single city accent recognition model of each first city in the first city set, and calculating the recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value is divided into a first city and a second city corresponding to the selected model; wherein the second set of cities comprises the first set of cities.

The second city set comprises a plurality of second cities, and different second cities respectively represent different cities.

Step 104: and aiming at each accent area, training the area accent recognition model of the accent area by adopting the user voice training sample sets of all cities in the accent area to obtain the area accent recognition model of the accent area.

After obtaining the regional accent recognition models of a plurality of accent regions, the models can be used for accent recognition, which is specifically as follows:

in an alternative embodiment, step 104 is followed by further comprising: receiving the voice of the first user, determining an accent area corresponding to the city where the first user is located, and inputting the voice of the first user into the accent recognition model of the area of the accent area to obtain a text corresponding to the voice of the first user.

Because the number of the regional accent recognition models is too much, a large amount of computing resources are consumed when the models are operated, and therefore, in order to further reduce the number of the regional accent recognition models, the following optimization scheme is provided:

after the city is divided into accent regions in step 103 and before step 104 is executed, the method further includes:

selecting at least two accent areas from all the accent areas obtained in step 103 each time to combine until all the combinations are obtained, taking each combination as an accent area set, for each accent area set, training the accent recognition models of the accent area set by using the user voice training sample sets of all cities in the accent area set to obtain the accent recognition models of the accent area set, testing the accent recognition models of the accent area set and the single city accent recognition models corresponding to the cities by using the user voice test sample sets of the cities in the accent area set, and if the test results show that: and if the difference value between the recognition rate of the accent recognition model of the region set and the recognition rate of the accent recognition model of each city is in a preset range, determining that all accent regions in the accent region set are fused into a new accent region.

Wherein, at least two accent regions are selected from all the accent regions obtained in step 103 each time to be combined until all the combinations are obtained. Examples are as follows:

all accent regions are assumed to include: accent regions A, B, C and D, the resulting combination is: AB. AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD. In the above embodiment, by selectively fusing all the accent regions obtained in step 103, the number of regional accent recognition models is reduced without significantly reducing the accent recognition rate, thereby reducing the occupation of computing resources when the models run.

In an alternative embodiment, the loss functions used by the single city accent recognition model, the regional accent recognition model, and the regional set accent recognition model are: a weighted sum of the cross entropy loss function and the discriminative loss function. Wherein, the value of the weight can be determined according to a plurality of tests.

In an alternative embodiment, the single city accent recognition model, the regional accent recognition model, and the regional collective accent recognition model are structured as a combination of a Time Delay Neural Network (TDNN) and a Long Short-Term Memory Network (LSTM). For example, the structures of the single city accent recognition model, the region accent recognition model and the region set accent recognition model are all as follows: 7-layer TDNN + 3-layer LSTM networks.

In order to further improve the recognition rate of accent recognition, the invention provides the following optimization scheme:

in an alternative embodiment, after obtaining the regional accent recognition models of the individual accent regions through step 104, the method further includes:

and for each accent area, training the regional accent distinguishing model of the accent area by adopting the user voice training sample sets of all cities in the accent area on the basis of the regional accent recognition model of the accent area to obtain the regional accent distinguishing model of the accent area.

The user voice training sample sets of all cities in the accent area participating in training refer to the user voice training sample sets of all cities in the accent area obtained by the training starting time.

In an alternative embodiment, the single city accent recognition model, the regional set accent recognition model, and the regional accent differentiation model are input as follows: the FBank (Filter Bank) feature of 80 dimensions extracted from the user voice, preferably, the FBank feature can be normalized and then input into the model;

the output of the single city accent recognition model, the regional set accent recognition model and the regional accent distinguishing model is 3296-dimensional acoustic state.

In practical application, the output of the single city accent recognition model, the region set accent recognition model and the region accent distinguishing model is a triphone state sequence, the triphone state sequence can be converted into a single phoneme state sequence through the context correlation model, the phonemes can be converted into corresponding words through a dictionary file, and a plurality of single words can be stringed into a word sequence through the language model, so that a finally recognized text sequence is obtained. And comparing the text sequence finally recognized by the model with the correct text sequence corresponding to the input user voice by taking the character as a unit, so as to obtain the recognition rate of the model.

One definition of the discriminative loss function is as follows:

where θ represents the model parameters: weights (weights) and biases (biases),

a feature sequence representing the mth speech sample of the input model,

is shown asThe correct text sequence for the m speech samples,

the method comprises the steps that a forcibly aligned acoustic state sequence is output by a model aiming at an M-th speech sample, M is the total number of speech samples input into the model, s in a denominator represents all possible acoustic state sequences possibly output by the model, w in the denominator represents all possible text sequences which can be decoded by all the possible acoustic state sequences output by the model, and k is a preset acoustic scaling coefficient;

P(o^m|s^m(ii) a θ) represents the model parameter as θ and the model input as o^mThe model output is s^mProbability of (d), P (w)^m) Decoding of a sequence of acoustic states representing the output of a model into w^mP (w) represents the probability that the acoustic state sequence output by the model is decoded into w.

In an alternative embodiment, when the number of samples in the user speech training sample sets of all cities in an accent region is large, the user speech training sample sets of all cities in the accent region are divided into a plurality of training sample subsets, and the regional accent classification model of the accent region is obtained by sequentially training the regional accent classification model of the accent region with each training sample subset, wherein,

the regional accent distinguishing model corresponding to each subsequent training sample subset is trained on the basis of the regional accent distinguishing model corresponding to the previous training sample subset.

For example: dividing a user voice training sample set of all cities in an accent area into 3 sample subsets, training a regional accent distinguishing model of the accent area by using the training sample subset 1 on the basis of the regional accent recognition model of the accent area to obtain the regional accent distinguishing model 1 of the accent area; then training the regional accent distinguishing model of the accent region by adopting the training sample subset 2 on the basis of the regional accent distinguishing model 1 of the accent region to obtain the regional accent distinguishing model 2 of the accent region; and then training the regional accent distinguishing model of the accent region by adopting the training sample subset 3 on the basis of the regional accent distinguishing model 2 of the accent region to obtain the regional accent distinguishing model 3 of the accent region.

After the regional accent distinguishing models of the accent regions are obtained, when the voice of the first user is received, the accent region corresponding to the city where the first user is located is determined, the voice of the first user is input into the regional accent distinguishing models of the accent regions, and the text corresponding to the voice of the first user is obtained.

In an optional embodiment, after obtaining the regional accent differentiation model of each accent region, if it is found that any third city has not been classified into any accent region, then: and respectively inputting the user voice test sample set of the third city into the regional accent distinguishing model of each accent region for testing, selecting the model with the highest recognition rate and the recognition rate larger than a second threshold value, and fusing the third city into the accent region corresponding to the model.

Consider that: in practical applications, there will usually already be a generic accent recognition model, which is obtained by the following process: and training the universal accent recognition model by adopting user voice training sample sets of all cities to obtain the universal accent recognition model, wherein the structure of the universal accent recognition model is the same as that of the single city accent recognition model. Then: in order to improve the recognition rate of the single city accent recognition model, the following optimization processing can be carried out:

in an optional embodiment, in step 102, the training of the single city accent recognition model of the corresponding first city by using the user speech training sample set of each first city in the first city set includes:

and respectively training the single city accent recognition model of the corresponding first city by adopting the user voice training sample set of each first city in the first city set on the basis of the obtained general accent recognition model.

And, the first threshold in step 103 is: and inputting the user voice test sample set of the second city into a universal accent recognition model for testing, and calculating the obtained recognition rate according to the test result.

In an optional embodiment, after the third city is merged into the accent region corresponding to the model, the method further includes:

and if the user voice training sample set of the third city is detected to be updated, adding the updated user voice training sample set into the user voice training sample set of the corresponding accent area of the third city.

When the user voice training sample set of any accent area is detected to be updated, training the area accent recognition model of the accent area by adopting the updated user voice training sample set of the accent area on the basis of the universal accent recognition model to obtain the updated area accent recognition model of the accent area; then, the updated user voice training sample set of the accent area is adopted, and on the basis of the updated regional accent recognition model of the accent area, the regional accent distinguishing model of the accent area is continuously trained, so that the updated regional accent distinguishing model of the accent area is obtained.

Fig. 2 is a flowchart of a method for establishing an accent recognition model according to another embodiment of the present invention, which includes the following steps:

step 201: and acquiring a user voice training sample set of each first city in the first city set.

Step 202: on the basis of the general accent recognition model, the user voice training sample set of each first city in the first city set is adopted to train the single-city accent recognition model of the corresponding first city, and the single-city accent recognition model of each first city is obtained.

The general accent recognition model is obtained by the following processes: and training the universal accent recognition model by adopting the user voice training sample sets of all cities to obtain the universal accent recognition model.

The structure of the single city accent recognition model is the same as that of the general accent recognition model.

On the basis of the general accent recognition model, the recognition rate of the single city accent recognition model obtained through training is higher.

In an alternative embodiment, the input of the single city accent recognition model is: the FBank (Filter Bank) feature of 80 dimensions extracted from the user voice, preferably, the FBank feature can be normalized and then input into the model; the output of the single city accent recognition model is 3296-dimensional acoustic states.

Step 203: respectively inputting the user voice test sample set of each second city in the second city set into the single city accent recognition model of each first city in the first city set, and calculating the recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than that of the single city accent recognition model of the second city is used, and the first city and the second city corresponding to the selected model are divided into the same accent area; wherein the second set of cities comprises the first set of cities.

For example: if 10 cities exist in the first city set, respectively obtaining a single city accent recognition model for each city, and obtaining 10 single city accent recognition models in total;

setting any one second city in the second city set as a city A, respectively inputting a user voice test sample set of the city A into 10 single-city accent recognition models for testing, respectively calculating the recognition rate of each single-city accent recognition model to the city A according to a test result, selecting the highest recognition rate, and setting the highest recognition rate as alpha;

inputting the user voice test sample set of the city A into the universal accent recognition model for testing, and obtaining the recognition rate beta of the universal accent recognition model to the city A according to the test result;

then: if α > β, the city a and the first city corresponding to the model of α are classified into the same accent region.

Step 204: selecting at least two accent regions from all the accent regions obtained in step 203 each time to combine, taking each combination as an accent region set, for each accent region set, training the accent recognition model of the accent region set by using the user voice training sample sets of all cities in the accent region set to obtain the accent recognition model of the accent region set, and testing the accent recognition model of the accent region set and the single city accent recognition models corresponding to the cities by using the user voice testing sample sets of the cities in the accent region set, if the testing result shows that: and if the difference value between the recognition rate of the accent recognition model of the region set and the recognition rate of the accent recognition model of each single city of each city is within a preset range, determining to fuse all regions in the accent region set into a new accent region. For example: if one accent region set X contains 3 accent regions and the 3 accent regions have 8 cities, training the accent recognition model of the accent region set X by using the user voice training sample sets of the 8 cities to obtain the accent recognition model of the accent region set X;

then, testing the accent recognition model of the regional set of the accent regional set by adopting the user voice test sample set of the 8 cities, and setting the recognition rate as a;

then, aiming at each city in the 8 cities, respectively adopting the user voice test sample set of each city to test the single city accent recognition model of the city, and respectively obtaining the recognition rate of the single city accent recognition model of each city to the city, thereby obtaining 8 recognition rates in total;

if the difference value between a and the 8 recognition rates is within a preset range, determining that the 3 accent regions can be further fused into an accent region set X.

The structure and the input characteristics of the regional collective accent recognition model are the same as those of the single city accent recognition model.

Step 205: and aiming at each accent area, training the area accent recognition model of the accent area by adopting the user voice training sample sets of all cities in the accent area to obtain the area accent recognition model of the accent area.

Step 206: and for each accent area, training the regional accent distinguishing model of the accent area by adopting the user voice training sample sets of all cities in the accent area on the basis of the regional accent recognition model of the accent area to obtain the regional accent distinguishing model of the accent area.

The structure and the input characteristics of the regional accent distinguishing model are the same as those of the regional accent recognition model, and the loss function adopted by the regional accent distinguishing model is as follows: a discriminative loss function.

Step 207: and if a third city is found not to be divided into any accent areas, then: and respectively inputting the user voice test sample set of the third city into the regional accent distinguishing model of each accent region for testing, selecting the model with the highest recognition rate and the recognition rate larger than a second threshold value, and fusing the third city into the accent region corresponding to the model.

Step 208: when the user voice training sample set of any accent area is detected to be updated, training the area accent recognition model of the accent area by adopting the updated user voice training sample set of the accent area on the basis of the universal accent recognition model to obtain the updated area accent recognition model of the accent area; then, the updated user voice training sample set of the accent area is adopted, and on the basis of the updated regional accent recognition model of the accent area, the regional accent distinguishing model of the accent area is continuously trained, so that the updated regional accent distinguishing model of the accent area is obtained.

Fig. 3 is a schematic structural diagram of an accent recognition model building apparatus provided in an embodiment of the present invention, and the apparatus mainly includes:

the single city oral sound recognition model establishing module 31 is used for acquiring a user voice training sample set of each first city in the first city set; and respectively training the single city accent recognition model of the corresponding first city by adopting the user voice training sample set of each first city in the first city set to obtain the single city accent recognition model of each first city.

The region dividing module 32 is configured to input the user voice test sample set of each second city in the second city set into the single city voice recognition model of each first city in the first city set respectively according to the single city voice recognition model of each first city obtained by the single city voice recognition model establishing module 31, and calculate a recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value is divided into a first city and a second city corresponding to the selected model; wherein the second set of cities comprises the first set of cities.

The regional accent recognition model establishing module 33 is configured to train, according to each accent region obtained by the region division module 32, the regional accent recognition model of the accent region by using the user speech training sample sets of all cities in the accent region, so as to obtain the regional accent recognition model of the accent region.

In an optional embodiment, the apparatus further comprises: and the accent recognition module is used for receiving the voice of the first user, determining an accent area corresponding to the city where the first user is located, and inputting the voice of the first user into the area accent recognition model of the accent area to obtain a text corresponding to the voice of the first user.

In an optional embodiment, before the module 33 for establishing a regional accent recognition model trains the regional accent recognition model of each accent region by using the user speech training sample sets of all cities in the accent region, the method further includes: selecting at least two accent areas from all accent areas obtained by division by the area division module 32 for combination each time, taking each combination as an accent area set, for each accent area set, training the area set accent recognition model of the accent area set by adopting the user voice training sample sets of all cities in the accent area set to obtain the area set accent recognition model of the accent area set, and testing the area set accent recognition model of the accent area set and the single city accent recognition models corresponding to the cities by adopting the user voice test sample sets of the cities in the accent area set, wherein if the test result shows that: and if the difference value between the recognition rate of the accent recognition model of the region set and the recognition rate of the accent recognition model of each city is in a preset range, determining that all accent regions in the accent region set are fused into a new accent region.

In an optional embodiment, the structures of the single-city accent recognition model building module 31, the regional accent recognition model of the regional accent recognition model building module 33, and the regional accent recognition model are combinations of TDNN and LSTM networks, for example, 7-layer TDNN + 3-layer LSTM networks;

the loss functions adopted by the single city accent recognition model, the region accent recognition model and the region set accent recognition model are as follows: a weighted sum of a cross-entropy loss function and a discriminative loss function.

In an optional embodiment, after the module 33 for establishing a regional accent recognition model obtains the regional accent recognition model of the accent region, the method further includes: for each accent area, adopting user voice training sample sets of all cities in the accent area, and training the regional accent distinguishing model of the accent area on the basis of the regional accent recognition model of the accent area to obtain the regional accent distinguishing model of the accent area; the structure of the regional accent distinguishing model is the same as that of the regional accent recognition model; the loss function adopted by the regional accent distinguishing model is as follows: a discriminative loss function.

In an optional embodiment, after the regional accent recognition model building module 33 obtains the regional accent recognition model of the accent region, the method further includes: for each accent region, dividing user speech training sample sets of all cities in the accent region into a plurality of training sample subsets, and sequentially adopting each training sample subset to train the regional accent distinguishing model of the accent region to obtain the regional accent distinguishing model of the accent region, wherein: the regional accent distinguishing model corresponding to the first training sample subset is trained on the basis of the regional accent recognition model of the accent region; the regional accent distinguishing model corresponding to each subsequent training sample subset is trained on the basis of the regional accent distinguishing model corresponding to the previous training sample subset; the structure of the regional accent distinguishing model is the same as that of the regional accent recognition model; the loss function adopted by the regional accent distinguishing model is as follows: a discriminative loss function.

In an optional embodiment, after the regional accent recognition model building module 33 obtains the regional accent distinguishing model of the accent region, the method further includes: receiving the voice of the first user, determining an accent area corresponding to the city where the first user is located, inputting the voice of the first user into the accent distinguishing model of the area of the accent area, and obtaining a text corresponding to the voice of the first user.

In an optional embodiment, after the regional accent recognition model building module 33 obtains the regional accent distinguishing model of the accent region, the method further includes: acquiring a user voice test sample set of a third city; wherein the third city is not included in any accent regions that have been divided; and respectively inputting the user voice test sample set of the third city into the regional accent distinguishing model of each accent region for testing, selecting the model with the highest recognition rate and the recognition rate larger than a second threshold value, and fusing the third city into the accent region corresponding to the selected model.

In an optional embodiment, the single city accent recognition model building module 31 separately trains the single city accent recognition model of the first city corresponding to the user speech training sample set of each first city in the first city set, including: respectively adopting a user voice training sample set of each first city in the first city set to train a single city accent recognition model of the corresponding first city on the basis of the obtained general accent recognition model; wherein the general accent recognition model is obtained by the following processes: and training the universal accent recognition model by adopting the user voice training sample sets of all cities to obtain the universal accent recognition model. Wherein the first threshold is: and inputting the user voice test sample set of the second city into a universal accent recognition model for testing, and calculating the obtained recognition rate according to the test result.

In an optional embodiment, after the module 33 for establishing a regional accent recognition model obtains the regional accent recognition model of the accent region, the method further includes: if the user voice training sample set of any accent area is detected to be updated, training the area accent recognition model of the accent area by adopting the updated user voice training sample set of the accent area on the basis of the universal accent recognition model to obtain the updated area accent recognition model of the accent area; then, the updated user speech training sample set of the accent region is adopted, and on the basis of the updated regional accent recognition model of the accent region, the regional accent distinguishing model of the accent region is continuously trained, so that the updated regional accent distinguishing model of the accent region is obtained.

The method is applied to a user voice acquisition platform, and finally 35 cities are divided into four accent areas: guangdong area (19 cities), northeast area (7 cities), Shandong area (7 cities), Sichuan area (2 cities);

after a 35 city user voice test sample set is used for testing, the weighted average recognition rate of the embodiment of the invention is found to be 94.11%, and the weighted average recognition rate of the existing universal accent recognition model is 93.04%;

if the regional accent recognition models of all the accent regions are adopted to respectively test the user voice test sample sets of the corresponding regions, the following results are found: the weighted average recognition rate is improved by 0.51 percent relative to the universal accent recognition rate;

if the regional accent distinguishing models of all the accent regions are adopted to respectively test the user voice test sample sets of the corresponding regions, the following results are found: the weighted average recognition rate is improved by 0.56 percent relative to the universal accent recognition rate.

In addition, in the experiment, when the regional accent distinguishing model is trained on the basis of the regional accent recognition model, if the data volume of the training samples is less than 1000 hours, all the training samples are directly adopted to perform one-time regional accent distinguishing model training; if the data volume of the training samples is equal to or greater than 1000 hours, all the training samples can be randomly divided into a plurality of batches, each batch is trained on the basis of the regional accent distinguishing models obtained from the previous batch, and when the batch is divided into three batches, the recognition rate of the obtained regional accent distinguishing models is the highest.

The present application further provides a computer program product, which includes a computer program or instructions, and when the computer program or instructions is executed by a processor, the steps of the method for establishing an accent recognition model according to any one of the above method embodiments are implemented.

Embodiments of the present application also provide a computer-readable storage medium storing instructions, which when executed by a processor, may perform the steps in the accent recognition model building method as described above. In practical applications, the computer readable medium may be included in each device/apparatus/system of the above embodiments, or may exist separately and not be assembled into the device/apparatus/system. Wherein instructions are stored in a computer readable storage medium, which stored instructions, when executed by a processor, may perform the steps in the accent recognition model creation method as described above.

According to embodiments disclosed herein, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example and without limitation: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing, without limiting the scope of the present disclosure. In the embodiments disclosed herein, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

As shown in fig. 4, an embodiment of the present invention further provides an electronic device. As shown in fig. 4, it shows a schematic structural diagram of an electronic device according to an embodiment of the present invention, specifically:

the electronic device may include a processor 41 of one or more processing cores, memory 42 of one or more computer-readable storage media, and a computer program stored on the memory and executable on the processor. The above-described accent recognition model building method may be implemented when the program of the memory 42 is executed.

Specifically, in practical applications, the electronic device may further include a power supply 43, an input/output unit 44, and the like. Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 4 is not intended to be limiting of the electronic device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. Wherein:

the processor 41 is a control center of the electronic device, connects various parts of the entire electronic device by various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 42 and calling data stored in the memory 42, thereby performing overall monitoring of the electronic device.

The memory 42 may be used to store software programs and modules, i.e., the computer-readable storage media described above. The processor 41 executes various functional applications and data processing by executing software programs and modules stored in the memory 42. The memory 42 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 42 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 42 may also include a memory controller to provide processor 41 access to memory 42.

The electronic device further comprises a power supply 43 for supplying power to each component, and the power supply 43 can be logically connected with the processor 41 through a power management system, so that functions of charging, discharging, power consumption management and the like can be managed through the power management system. The power supply 43 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may also include an input-output unit 44, the input-unit output 44 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. The input unit output 44 may also be used to display information input by or provided to the user, as well as various graphical user interfaces, which may be made up of graphics, text, icons, video, and any combination thereof.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments disclosed herein. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not explicitly recited in the present application. In particular, the features recited in the various embodiments and/or claims of the present application may be combined and/or coupled in various ways, all of which fall within the scope of the present disclosure, without departing from the spirit and teachings of the present application.

The principles and embodiments of the present invention are explained herein using specific examples, which are provided only to help understanding the method and the core idea of the present invention, and are not intended to limit the present application. It will be appreciated by those skilled in the art that changes may be made in this embodiment and its broader aspects and without departing from the principles, spirit and scope of the invention, and that all such modifications, equivalents, improvements and equivalents as may be included within the scope of the invention are intended to be protected by the claims.

Claims

1. A method for establishing a voice recognition model is characterized by comprising the following steps:

respectively inputting the user voice test sample set of each second city in the second city set into the single city accent recognition model of each first city in the first city set, and calculating the recognition rate of the model of each first city to each second city; for each second city, selecting among the single city accent recognition models for all first cities: the model with the highest recognition rate and the recognition rate larger than a first threshold value is divided into a first city and a second city corresponding to the selected model, wherein the first city and the second city are in the same accent area; wherein the second city set comprises the first city set;

2. The method of claim 1, wherein after obtaining the regional accent recognition model for the accent region, the method further comprises:

3. The method of claim 1, wherein after the classifying the first city and the second city corresponding to the selected model into the same accent region and before the training, for each accent region, a set of user speech training samples of all cities in the accent region to train a region accent recognition model for the accent region, further comprises:

4. The method according to claim 1 or 3, wherein the single city accent recognition model, the regional accent recognition model and the regional set accent recognition model use a loss function of: weighted summation of cross entropy loss function and discriminative loss function;

the structures of the single city accent recognition model, the region accent recognition model and the region set accent recognition model are as follows: the combination of the time delay neural network TDNN and the long-short term memory network LSTM.

5. The method of claim 4, wherein after obtaining the regional accent recognition model for the accent region, the method further comprises:

6. The method of claim 4, wherein after obtaining the regional accent recognition model for the accent region, the method further comprises:

7. The method according to claim 5 or 6, wherein after obtaining the regional accent differentiation model of the accent region, the method further comprises:

8. The method according to claim 5 or 6, wherein after obtaining the regional accent differentiation model of the accent region, the method further comprises:

9. The method of claim 1, wherein the training with the single city accent recognition model of the first city corresponding to the user speech training sample set of each first city in the first city set comprises:

10. The method of claim 9, wherein the first threshold is:

11. The method according to claim 9 or 10, wherein after obtaining the regional accent recognition model of the accent region, the method further comprises:

and if the user voice training sample set of any accent area is detected to be updated, training the area accent recognition model of the accent area by adopting the updated user voice training sample set of the accent area on the basis of the universal accent recognition model, and obtaining the updated area accent recognition model of the accent area.

12. A computer program product comprising a computer program or instructions for implementing the steps of the method for establishing an accent recognition model according to any one of claims 1 to 10 when executed by a processor.