CN116312487A

CN116312487A - Language model training method and device and dialogue voice recognition method and device

Info

Publication number: CN116312487A
Application number: CN202111531961.1A
Authority: CN
Inventors: 杨麒弘; 周绍钧; 唐俊杰
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-06-23

Abstract

The embodiment of the disclosure relates to a training method and device of a language model and a dialogue voice recognition method and device. In at least one embodiment of the present disclosure, interest points for describing feature elements are collected, where the feature elements may be any location point in the real world, such as a building, a shop, a restaurant, a bus stop board, etc., that is, the interest points express locations in the real world, so that, by modifying event places in a training corpus with the interest points, a new corpus including the interest points can be generated, and at the same time, the meaning of the corpus is not changed, and compared with a general corpus, the new corpus includes dialogue elements that frequently occur in scenes such as a network bus, etc., and the interest points are in an impulse; and a language model is obtained by training a new corpus comprising the interest points, so that the language model can improve the accuracy of dialogue voice recognition under special scenes, such as recognition of the interest points in dialogue voice under a network taxi scene.

Description

Language model training method and device and dialogue voice recognition method and device

Technical Field

The embodiment of the disclosure relates to the technical field of shared traveling, in particular to a training method and device of a language model and a dialogue voice recognition method and device.

Background

With the development of automatic speech recognition (Automatic Speech Recognition, ASR) technology, the accuracy of speech recognition has improved. The training of the acoustic model and the language model used in the ASR technology depends on a large amount of training corpus, and the current training corpus is mainly obtained by labeling based on general corpus (such as customer service dialogue, labeling personnel recording and the like).

However, the general corpus contains less information in a special scene, for example, real world ground object elements often appear in conversations of travel scenes, and the existing general corpus contains less ground object elements, wherein the ground object elements can be a building, a shop, a scenic spot and the like in a real world map. Therefore, the current ASR technology has the problem of low speech recognition accuracy in special scenes, for example, the problem of low recognition accuracy aiming at ground feature elements appearing in travel scene conversations.

In addition, optimization iterations of acoustic models and language models used in ASR techniques also require a large number of training samples, however, the manual labeling of high quality training samples is costly.

The above description of the discovery process of the problem is merely for aiding in understanding the technical solution of the present disclosure, and does not represent an admission that the above is prior art.

Disclosure of Invention

To solve at least one problem in the prior art, at least one embodiment of the present disclosure provides a training method and apparatus for a language model, and a method and apparatus for recognizing conversational speech.

In a first aspect, an embodiment of the present disclosure proposes a training method of a language model, where the language model is used for recognition of conversational speech, the method including:

collecting one or more interest points for describing the ground object elements in advance;

acquiring a first training corpus used for training a language model, wherein the first training corpus comprises events and event places;

selecting interest points from the collected one or more interest points, and modifying event places in the first training corpus based on the selected interest points to obtain a second training corpus;

the language model is trained based at least on the second training corpus.

In some embodiments, pre-collecting one or more points of interest for describing a surface element includes:

acquiring more than one section of original voice and/or more than one original corpus;

recognizing more than one section of original voice to obtain an original corpus corresponding to the original voice;

and carrying out text analysis on the original corpus to determine one or more interest points.

In some embodiments, obtaining a first training corpus for training a language model includes:

selecting a first training corpus from one or more original corpora; and/or the number of the groups of groups,

and acquiring the corpus comprising the events and the occurrence places of the events as a first training corpus.

In some embodiments, selecting a point of interest from the one or more collected points of interest includes:

the points of interest are selected from the collected one or more points of interest by random sampling.

In some embodiments, modifying the place of occurrence of the event in the first training corpus based on the selected point of interest, the obtaining the second training corpus comprises:

and replacing the event occurrence place in the first training corpus with the selected interest point to obtain a second training corpus.

In some embodiments, training the language model based at least on the second training corpus comprises: the language model is trained based on the second training corpus and the first training corpus.

In some embodiments, after collecting one or more points of interest for describing the feature element in advance, the training method of the language model further includes:

constructing an interest point word list based on the collected interest points;

accordingly, selecting a point of interest from the collected one or more points of interest includes: and selecting the interest points from the interest point word list.

In a second aspect, an embodiment of the present disclosure further provides a conversational speech recognition method, including:

the method comprises the steps of obtaining dialogue voice, and carrying out sound channel segmentation on the dialogue voice to obtain first sound channel voice and second sound channel voice;

extracting voice characteristics of the first sound channel voice and the second sound channel voice to obtain first sound channel voice characteristics and second sound channel voice characteristics;

identifying the first sound channel voice feature and the second sound channel voice feature based on an acoustic model, and outputting text probability distribution;

and correcting the text probability distribution based on a language model, and outputting a text corresponding to the dialogue speech, wherein the language model is a language model trained based on the method according to any embodiment of the first aspect.

In a third aspect, an embodiment of the present disclosure further provides a training apparatus for a language model, where the language model is used for recognition of conversational speech, the apparatus includes:

the collecting unit is used for collecting one or more interest points for describing the ground object elements in advance;

the acquisition unit is used for acquiring a first training corpus used for training the language model, wherein the first training corpus comprises events and event places;

The corpus processing unit is used for selecting interest points from the collected one or more interest points, and modifying event places in the first training corpus based on the selected interest points to obtain a second training corpus;

and the training unit is used for training the language model at least based on the second training corpus.

In a fourth aspect, an embodiment of the present disclosure further provides a conversational speech recognition apparatus, including:

the voice processing unit is used for acquiring dialogue voice and carrying out channel segmentation on the dialogue voice to obtain first channel voice and second channel voice;

the feature extraction unit is used for extracting the voice features of the first channel voice and the second channel voice to obtain the first channel voice feature and the second channel voice feature;

the voice recognition unit is used for recognizing the first sound channel voice characteristics and the second sound channel voice characteristics based on an acoustic model and outputting text probability distribution;

and the text correction unit is used for correcting the text probability distribution based on a language model and outputting the text corresponding to the dialogue voice, wherein the language model is a language model trained based on the method according to any embodiment of the first aspect.

In a fifth aspect, embodiments of the present disclosure further provide an electronic device, including: a processor and a memory; the processor is configured to perform the steps of the method according to any of the embodiments of the first aspect or the steps of the method according to the second aspect by invoking a program or instruction stored in the memory.

In a sixth aspect, the disclosed embodiments also propose a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores a program or instructions that cause a computer to perform the steps of the method according to any of the embodiments of the first aspect or the steps of the method according to the second aspect.

In a seventh aspect, embodiments of the present disclosure also propose a computer program product, wherein the computer program product comprises a computer program stored in a non-transitory computer readable storage medium, at least one processor of the computer reading and executing the computer program from the storage medium, such that the computer performs the steps of the method according to any of the embodiments of the first aspect or the steps of the method according to the second aspect.

It can be seen that, in at least one embodiment of the present disclosure, interest points for describing feature elements are collected, where the feature elements may be any location point in the real world (such as a building, a shop, a restaurant, a bus stop board, etc.), that is, the interest points express locations in the real world, so that, by modifying an event place in a training corpus with the interest points, a new corpus including the interest points can be generated, and meanwhile, the meaning of the corpus is not changed, and compared with a general corpus, the new corpus includes dialogue elements (interest points) that often occur in scenes such as a network bus, etc.; and a language model is obtained by training a new corpus comprising the interest points, so that the language model can improve the accuracy of dialogue voice recognition under special scenes, such as recognition of the interest points in dialogue voice under a network taxi scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings to those of ordinary skill in the art.

FIG. 1 is an exemplary application scenario diagram of a shared trip provided by an embodiment of the present disclosure;

FIG. 2 is a block diagram of a training apparatus for language models provided by an embodiment of the present disclosure;

FIG. 3 is a block diagram of a conversational speech recognition apparatus provided by embodiments of the disclosure;

FIG. 4 is an exemplary block diagram of an electronic device provided by an embodiment of the present disclosure;

FIG. 5 is a flow chart of a method of training a language model provided by an embodiment of the present disclosure;

FIG. 6 is a flow chart of a conversational speech recognition method provided by embodiments of the present disclosure;

FIG. 7 is a flow chart of generating a training corpus provided by an embodiment of the present disclosure;

FIG. 8 is an application scenario diagram based on one of the generated training corpora of FIG. 7;

fig. 9 is an application scenario diagram of conversational speech recognition provided by an embodiment of the disclosure.

Detailed Description

In order that the above-recited objects, features and advantages of the present disclosure may be more clearly understood, a more particular description of the disclosure will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is to be understood that the described embodiments are some, but not all, of the embodiments of the present disclosure. The specific embodiments described herein are to be considered in an illustrative rather than a restrictive sense. All other embodiments derived by a person of ordinary skill in the art based on the described embodiments of the present disclosure fall within the scope of the present disclosure.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The embodiment of the disclosure provides a training method and device of a language model, and a dialogue voice recognition method and device, wherein the language model is used for recognizing dialogue voices, and the dialogue voices can be dialogue voices in a shared travel scene or dialogue voices in other application scenes. In at least one embodiment of the present disclosure, interest points for describing feature elements are collected, where the feature elements may be any location point in the real world (such as a building, a shop, a restaurant, a bus stop board, etc.), that is, the interest points express locations in the real world, so that, by modifying event places in a training corpus with the interest points, a new corpus including the interest points can be generated, and at the same time, the meaning of the corpus is not changed, and compared with a general corpus, the new corpus includes dialogue elements (interest points) that often occur in scenes such as network bus, etc.; and a language model is obtained by training a new corpus comprising the interest points, so that the language model can improve the accuracy of dialogue voice recognition under special scenes, such as recognition of the interest points in dialogue voice under a network taxi scene. It should be understood that the application scenario of the embodiments of the present disclosure is merely some examples or embodiments of the present disclosure, and it is possible for those of ordinary skill in the art to apply the present disclosure to other shared travel scenarios or the like without undue effort.

Fig. 1 is an exemplary application scenario diagram of a shared trip provided by an embodiment of the present disclosure. As shown in fig. 1, the scene includes: at least one passenger side 11, at least one driver side 12 and a service side 14, wherein the passenger side 11 and the driver side 12 interact data with the service side 14 through a network 13.

The passenger side 11 can be understood as an electronic device installed with travel service passenger side software for use by passengers. The electronic device may be a portable mobile device such as a smart phone or a tablet computer. In some embodiments, the passenger may enter a starting location (i.e., a starting point) and a destination location (i.e., an ending point) in the network appointment interface provided by the passenger side 11, and may then click on a control in the network appointment interface for requesting network appointment services to initiate a network appointment vehicle request. The control for requesting the network taxi service may be any form of control, for example, a submission control displaying a text of "network taxi" so that a passenger can intuitively know that the control can perform "network taxi". In some embodiments, the passenger side 11 responds to the triggering operation, such as clicking operation, of the control for requesting the network taxi service, and generates a network taxi service request message, where the message includes the starting position and the destination position, and then sends the message to the server side 14 through the network 13, so as to implement a network taxi request.

The server 14 may be understood as a background server of the travel service platform. The background server may be a server or a server group. The server groups may be centralized or distributed. The server side 14 is configured to generate at least a network restraint vehicle order and distribute the network restraint vehicle order to at least one driver side 12. In some embodiments, the service end 14 receives the network vehicle service request message sent by the passenger end 11 through the network 13, and extracts the starting position and the destination position included in the message, so as to generate the network vehicle order including the starting position and the destination position. In some embodiments, the server 14 may distribute the network vehicle order to at least one driver 12 using different order distribution strategies, for example, network vehicle order pushing is performed by all drivers within a preset range of the starting position, and the driver receiving the network vehicle order may rob the order. In some embodiments, after the driver side 12 receives the order, the service side 14 may share the relevant data, such as positioning data, of the passenger side 11 and the driver side 12 in the same network order, so that the passenger side 11 and the driver side 12 can know the status of each other.

The driver side 12 may be understood as an electronic device with travel service driver side software installed for use by the driver. The driver side 12 can receive the network order from the server side 14 via the network 13, and display the network order on a user interface provided by the driver side 12, so that the driver can perform preset order operations such as order robbing, order receiving, refusing, etc. In some embodiments, the driver's side 12 sends an order receipt message to the server side 14 over the network 13 in response to the order receipt operation of the driver, so that the server side 14 monitors the behavior of the passenger side 11 and the driver's side 12 under the same network order.

Fig. 2 is a block diagram of a training apparatus 20 for language models provided in an embodiment of the present disclosure. The language model is used for recognizing dialogue voices, and the dialogue voices can be dialogue voices in a shared travel scene; the dialogue speech may also be dialogue speech in other application scenarios. In some embodiments, the training apparatus 20 of the language model may be implemented as part of the server 14 in fig. 1. In some embodiments, the training device 20 of the language model may be implemented as a device independent of the server 14 in fig. 1, and the training device 20 of the language model may be a software device, a hardware device, or a combination of software and hardware devices. For example, the training device 20 of the language model is a software device running on an operating system, and the electronic hardware system is a hardware system supporting the running of the operating system.

As shown in fig. 2, the training apparatus 20 of the language model may be divided into a plurality of units, and may include, for example: the collecting unit 21, the acquiring unit 22, the corpus processing unit 23, the training unit 24 and other units assisting in the training of the language model, such as a data storage unit, are used for storing data in the training process.

A collecting unit 21 for collecting one or more points of interest (poi). A point of interest may be understood as a data in a map, where the point of interest describes a feature element in the real world, and the feature element may be an object that may be used for navigation, such as a certain entity, a place name, a location, etc. in the real world. The ground object elements are, for example, a building, a shop, a scenic spot, a bus station, etc. in the map. The map can be a high-precision map used by an automatic driving vehicle or an intelligent driving vehicle, and also can be a standard-precision map provided by a travel service platform, and compared with the standard-precision map, the high-precision map is rich in described ground feature elements and higher in coordinate accuracy.

In some embodiments, the collecting unit 21 may collect one or more points of interest for describing the feature element in advance, that is, the collection is a preparation before training the language model, and training the language model is performed after collecting the points of interest. In some embodiments, after the collection unit 21 collects the points of interest, an interest point word list may be constructed based on the collected points of interest, so that the points of interest are managed by maintaining the interest point word list, and maintaining at least includes operations of deleting the points of interest, adding the points of interest, etc., where the number of the points of interest is large, the management efficiency of the points of interest may be improved, and also the searching and recycling of the points of interest may be facilitated.

In some embodiments, the collection unit 21 may collect the original corpus in a variety of ways. The original corpus, i.e., the original language material, can be used to train a language model. In some embodiments, the collection unit 21 may obtain more than one original corpus. For example, the collection unit 21 may obtain the original corpus through social media or news web pages, wherein the social media, the news web pages, etc. are related to ground object elements, such as travel related, traffic related, etc.

In some embodiments, the collecting unit 21 may obtain more than one segment of original speech, and further identify the obtained original speech, so as to obtain an original corpus corresponding to the original speech. The voice acquisition method can be various, for example, a voice input by a person is received, and for example, the voice is acquired through a voice acquisition device. For example, the collecting unit 21 may obtain the original voice broadcasted by the vehicle radio in real time, and further identify the original voice, so as to obtain an original corpus corresponding to the original voice.

In some embodiments, the collecting unit 21 may perform text analysis on the obtained original corpus (including the directly obtained original corpus and the original corpus obtained by identifying the original speech), and determine the point of interest. In some embodiments, the collection unit 21 may perform text analysis on the directly obtained raw corpus (e.g., raw corpus obtained through social media or news web pages) to determine one or more points of interest. In some embodiments, the collection unit 21 may perform text analysis on the original corpus corresponding to the original speech to determine one or more points of interest.

In some embodiments, the collection unit 21 may perform text analysis on the original corpus in a variety of ways. Means of text analysis include, for example, but are not limited to, one or more of the following: word segmentation, named entity and template matching. The text analysis modes can be freely combined, so that the extraction of the interest points is satisfied, and the extraction accuracy is improved. For example, the collection unit 21 may perform text analysis on the original corpus by using one or more of word segmentation, naming of the entity and template matching, and mark the feature elements in the original corpus, thereby obtaining interest points for describing the feature elements. The word segmentation, the naming entity and the template matching are all mature technologies, and are not repeated.

For example, the collection unit 21 obtains the original corpus by social media as "today and friends going to restaurant a have a lot of time, the service is very in place, and praise. The collection unit 21 further uses the named entity to identify "restaurant a" following "go" in the original corpus as the point of interest.

For another example, the collecting unit 21 obtains the original voice broadcasted in real time, and recognizes the original voice, and the corpus corresponding to the obtained original voice is "9 points 30 minutes", and the traffic accident occurs in the bridge B, and the traffic is slow. And the collecting unit 21 adopts template matching to extract a bridge B behind 9 points and 30 minutes in the original corpus as interest points.

The obtaining unit 22 is configured to obtain a first training corpus for training a language model, where the first training corpus includes events and places where the events occur. In some embodiments, the acquisition unit 22 may acquire the first training corpus in a variety of ways. The free combination of the modes can meet the collection of the first training corpus and ensure the diversity of the training corpus.

For example, the obtaining unit 22 may select the first training corpus from one or more original corpora, where the original corpora is obtained by the collecting unit 21 through social media or news web pages.

For another example, the obtaining unit 22 may select the first training corpus from the original corpus corresponding to the original speech, where the original speech is the original speech collected by the collecting unit 21, and the original corpus corresponding to the original speech is the corpus obtained by the collecting unit 21 by recognizing the original speech.

Also for example, the acquisition unit 22 may acquire a corpus including events and places where the events occur as the first training corpus. For example, the acquiring unit 22 may receive manually input corpus including events and places where the events occur, and use the received corpus as the first training corpus.

The corpus processing unit 23 is configured to select an interest point, and further enhance the first training corpus based on the selected interest point, so as to obtain a second training corpus. The second training corpus is a corpus directly used for language model training, that is, the second training corpus is input into the language model to train the language model. Enhancing the first training corpus may be understood as increasing the number of training samples by some processing based on the first training corpus. It can be seen that the corpus processing unit 23 does not need additional manual labeling to generate the second training corpus, so that the cost of generating the second training corpus is reduced.

In some embodiments, corpus processing unit 23 may select points of interest from one or more points of interest collected by collection unit 21. Further, if the collecting unit 21 collects the points of interest and constructs a point of interest vocabulary based on the collected points of interest, the corpus processing unit 23 may select the points of interest from the point of interest vocabulary. For example, the corpus processing unit 23 may select one point of interest from among the collected one or more points of interest by random sampling. Further, the corpus processing unit 23 may randomly sample all the points of interest in the vocabulary of points of interest to obtain one point of interest. Thus, by sampling multiple times, multiple points of interest can be obtained.

In some embodiments, the corpus processing unit 23 may modify the event places in the first training corpus based on the selected points of interest to obtain the second training corpus. In some embodiments, the corpus processing unit 23 replaces the event places in the first training corpus with the selected interest points to obtain the second training corpus. Therefore, the selected interest points are different, different second training corpuses can be generated, and the generated second training corpuses are effective corpuses, so that the interest points are effectively included in the second training corpuses. In this way, the corpus processing unit 23 does not need additional manual labeling to generate the second training corpus, so that the cost for generating the second training corpus is reduced, and meanwhile, the diversity and the effectiveness of the training corpus can be ensured.

In some embodiments, the flowchart for generating the training corpus shown in fig. 7 is explained based on the function of the corpus processing unit 23: the corpus processing unit 23 selects points of interest from the point of interest word list 701 and analyzes the first training corpus 702, where the analysis includes, but is not limited to, one or more of the following: the word segmentation, named entity, template matching and combination can result in a first event place 703 in the first training corpus 702. The corpus processing unit 23 then replaces the first event place 703 with the selected point of interest, resulting in a second event place 704. The corpus processing unit 23 thus combines the second event place 704 with the first training corpus 702, resulting in a second training corpus 705, the second training corpus 705 excluding the first event place 703.

In some embodiments, FIG. 8 is an application scenario diagram based on one of FIG. 7 that generates a training corpus. In fig. 8, a plurality of points of interest are included in a point of interest vocabulary 801: and baking fish in the cottage mountain, wanda square, jialefu, etc. The corpus processing unit 23 selects the point of interest from the point of interest word list 801 as "wan da square".

The first training corpus 802 is "Master you, I are at the straight gate bus stop in the West". The corpus processing unit 23 analyzes the first training corpus 802 to obtain that the first event occurrence place 803 in the first training corpus 802 is "siesta bus stop".

The corpus processing unit 23 replaces the first event occurrence place 803 with the selected interest point "wanda square" to obtain the second event occurrence place 804 as "wanda square".

The corpus processing unit 23 combines the second event place 804 with the first training corpus 802 to obtain a second training corpus 805 as "master your good, i am here in the wan da square".

A training unit 24 for training the language model based at least on the second training corpus. The language model may be a statistical language model, such as an n-gram model; the language model may also be a neural network language model, such as a feedforward neural network language model, a recurrent neural network language model, or the like. In this embodiment, the output of the language model is text. It should be noted that, in the embodiment of the present disclosure, only the input of the training language model is defined as the second training corpus, and the output is text, and the process of training the language model is not limited, and the process may follow the prior art.

In some embodiments, considering that not all the first training corpus is processed by the corpus processing unit 23, the unprocessed first training corpus may include points of interest, and thus the training unit 24 trains the language model based on the second training corpus and the first training corpus. In this embodiment, the second training corpus and the first training corpus are used to train the language model, so that not only can the recognition of the interest points of the special scene be satisfied, but also the recognition of the texts in the general scene can be satisfied.

In some embodiments, the division of each unit in the training device 20 of the language model is only one logic function division, and other division modes can be adopted in actual implementation, for example, a plurality of units can be implemented as one unit; one unit may also be divided into a plurality of sub-units. It is understood that each unit or sub-unit can be implemented in electronic hardware, or in combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art can implement the described functionality using different methods for each particular application.

Fig. 3 is a block diagram of a conversational speech recognition device 30 provided by embodiments of the disclosure.

The dialogue voice can be the dialogue voice under the shared travel scene; the dialogue speech may also be dialogue speech in other application scenarios. In some embodiments, the conversational speech recognition device 30 may be implemented as part of the server 14 of fig. 1. As shown in fig. 3, the conversational speech recognition device 30 may be divided into a plurality of units, which may include, for example: the speech processing unit 31, the feature extraction unit 32, the speech recognition unit 33, the text modification unit 34 and some other units assisting in the speech recognition of the dialog, such as a data storage unit, are used for storing data during the speech recognition of the dialog.

The speech processing unit 31 may acquire dialogue speech. The dialogue speech is, for example, between the driver and the passenger. The dialogue speech may be collected by at least one of the passenger side 11 and the driver side 12 in fig. 1 and transmitted to the speech processing unit 31.

In some embodiments, the voice processing unit 31 may perform channel segmentation on the acquired dialogue voice to obtain the first channel voice and the second channel voice. The first sound channel is a sound channel of a driver, and the second sound channel is a sound channel of a passenger; alternatively, the first channel is the passenger's channel and the second channel is the driver's channel. In this embodiment, since the dialogue is binaural, one is the passenger's voice, and the other is the driver's voice, and the mixture is unrecognizable, the voice processing unit 31 discriminates the driver's voice and the passenger's voice by channel slicing, and prepares for the next recognition.

The feature extraction unit 32 is configured to perform speech feature extraction on the first channel speech and the second channel speech, so as to obtain a first channel speech feature and a second channel speech feature. In some embodiments, the feature extraction unit 32 may perform noise reduction processing on the first channel voice and the second channel voice, and further perform feature extraction on the noise-reduced voice. The noise reduction is a common technology in the field of speech recognition technology, and will not be described herein.

In some embodiments, the speech features extracted by the feature extraction unit 32 may be any of the following: LPC (Linear Prediction Coefficients, linear prediction analysis) features, PLP (Perceptual Linear Predictive, perceptual linear prediction coefficients) features, MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficients) features, etc. The extraction mode of the features belongs to a mature technology in the technical field of voice recognition and is not repeated.

The voice recognition unit 33 is configured to recognize the first channel voice feature and the second channel voice feature based on the acoustic model, and output a text probability distribution. The acoustic model may be an existing acoustic model, for example, an acoustic model based on deep learning, which will not be described in detail.

For example, the passenger's voice is "at siemens, and the voice recognition unit 33 may output, based on the acoustic model, a text probability distribution in which the numbers represent probabilities:

first word: at 0.7, this 0.3;

the second word: 0.2 part of Xixi, 0.3 part of radix achyranthis bidentatae, 0.3 part of selenium and 0.2 part of selenium;

third word: straight 0.4, day 0.6;

fourth word: 0.4, door 0.3;

fifth word: this is 0.35, 0.3 Zhejiang.

The text correction unit 34 is configured to correct the text probability distribution based on a language model, and output a text corresponding to the dialogue speech, where the language model is a model obtained by training any embodiment of the training method based on the language model. In this embodiment, the language model may output, based on the corrected text probability distribution, a text with the highest joint probability as a text corresponding to the dialogue speech.

For example, the voice of the passenger is "at siemens", but the joint probability of "at siemens" in the text probability distribution outputted from the voice recognition unit 33 is not the maximum, so that the text correction unit 34 corrects the text probability distribution based on the language model so that the joint probability of "at siemens" is the maximum, and outputs the text corresponding to the voice of the passenger "at siemens".

In some embodiments, the application scenario diagram of conversational speech recognition shown in fig. 9 is explained based on the functionality of conversational speech recognition device 30. In fig. 9, the dialogue speech recognition device 30 acquires dialogue speech, and performs channel segmentation 901 on the dialogue speech to obtain first channel speech and second channel speech; further, performing voice feature extraction 902 on the first channel voice and the second channel voice to obtain a first channel voice feature and a second channel voice feature; inputting 903 the first and second vocal tract speech features to the acoustic model, identifying by the acoustic model and outputting a text probability distribution; finally, the text probability distribution is input to the language model 904, the language model corrects the text probability distribution and the text corresponding to the dialogue speech is output 905.

In some embodiments, the conversational speech recognition device 30 may utilize a language model based on interest points to improve conversational speech recognition accuracy in special scenarios, such as recognition of interest points in conversational speech in a network-bound scene, and reduce word error rates (Character Error Rate, CER). Table 1 is the recognition result of dialogue recordings based on two language models, where LM is the language model based on generic corpus training and poi_LM is the language model based on points of interest. Table 2 is CER test results on the test set based on two language models.

Table 1 recognition results of dialogue recordings based on two language models

TABLE 2 CER test results on test set based on two language models

Model name	CER
		Acoustic model +LM	0.318
Acoustic model +poi_LM	0.302

It can be seen that the language model scheme based on the interest points provided by the embodiment of the disclosure can reduce the word error rate CER.

In some embodiments, the division of each unit in the dialogue speech recognition device 30 is only one logic function division, and there may be another division manner in actual implementation, for example, a plurality of units may be implemented as one unit; one unit may also be divided into a plurality of sub-units. It is understood that each unit or sub-unit can be implemented in electronic hardware, or in combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art can implement the described functionality using different methods for each particular application.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure. In some embodiments, the electronic device may be implemented as the server 14 in FIG. 1. As shown in fig. 4, the electronic device includes: at least one processor 401, at least one memory 402, and at least one communication interface 403. The various components in the electronic device are coupled together by a bus system 404. A communication interface 403 for information transmission with an external device. It is appreciated that the bus system 404 serves to facilitate connected communications between these components. The bus system 404 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 404 in fig. 4 for clarity of illustration.

It will be appreciated that the memory 402 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

In some implementations, the memory 402 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system and application programs.

The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic tasks and processing hardware-based tasks. Applications, including various applications such as Media players (Media players), browsers (browses), etc., are used to implement various application tasks. A program implementing the training method of the language model or a program implementing the dialogue speech recognition method provided by the embodiment of the present disclosure may be included in the application program.

In the embodiment of the present disclosure, the processor 401 is configured to execute the steps of the embodiments of the training method of the language model or the steps of the embodiments of the dialogue speech recognition method according to the embodiments of the present disclosure by calling the program or the instructions stored in the memory 402, specifically, the program or the instructions stored in the application program.

The training method of the language model or the dialogue speech recognition method provided by the embodiment of the disclosure may be applied to the processor 401 or implemented by the processor 401. The processor 401 may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 401 or by instructions in the form of software. The processor 401 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of the language model training method or the steps of the dialogue speech recognition method provided by the embodiment of the disclosure may be directly embodied as the execution completion of the hardware decoding processor, or the execution completion of the combination execution of the hardware and software units in the decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 402 and the processor 401 reads the information in the memory 402 and in combination with its hardware performs the steps of the method.

Fig. 5 is an exemplary flowchart of a training method of a language model according to an embodiment of the present disclosure. The language model is used for recognizing dialogue voices, and the dialogue voices can be dialogue voices in a shared travel scene; the dialogue speech may also be dialogue speech in other application scenarios. The execution main body of the method is electronic equipment, and the electronic equipment can be realized as a server, namely a background server of the travel service platform. For convenience of description, the following embodiment describes a flow of a training method of a language model with an electronic device as an execution subject.

As shown in fig. 5, in step 501, the electronic device collects one or more points of interest for describing a feature element in advance. A point of interest may be understood as a data in a map, where the point of interest describes a feature element in the real world, and the feature element may be an object that may be used for navigation, such as a certain entity, a place name, a location, etc. in the real world. The ground object elements are, for example, a building, a shop, a scenic spot, a bus station, etc. in the map. The map can be a high-precision map used by an automatic driving vehicle or an intelligent driving vehicle, and also can be a standard-precision map provided by a travel service platform, and compared with the standard-precision map, the high-precision map is rich in described ground feature elements and higher in coordinate accuracy.

In some embodiments, the electronic device enriches the sources and the types of the interest points in a plurality of ways to collect the interest points, so that the language model obtained by final training is suitable for the recognition requirements of different scenes.

For example, the electronic device may obtain more than one original corpus. For example, the electronic device obtains one or more original corpus through social media or news web pages, wherein the social media, the news web pages, etc. are related to ground object elements, such as travel related, traffic related, etc.; and further, text analysis is carried out on one or more original corpus, and one or more interest points are determined.

For another example, the electronic device may obtain more than one piece of original speech; and further, recognizing one or more sections of original voice to obtain an original corpus corresponding to the original voice, and performing text analysis on the original corpus corresponding to the original voice to determine one or more interest points. The voice acquisition method can be various, for example, a voice input by a person is received, and for example, the voice is acquired through a voice acquisition device. In some embodiments, the electronic device may collect, by using the voice collection device, an original voice broadcasted by the vehicle radio in real time, and further identify the original voice, so as to obtain an original corpus corresponding to the original voice, and perform text analysis on the original corpus corresponding to the original voice, so as to determine one or more points of interest.

In some embodiments, the electronic device can perform text analysis on the original corpus in a variety of ways. Means of text analysis include, for example, but are not limited to, one or more of the following: word segmentation, named entity and template matching. The text analysis modes can be freely combined, so that the extraction of the interest points is satisfied, and the extraction accuracy is improved. For example, the electronic device may perform text analysis on the original corpus by using one or more of word segmentation, naming of an entity, and template matching, and mark the feature elements in the original corpus, thereby obtaining interest points for describing the feature elements. The word segmentation, the naming entity and the template matching are all mature technologies, and are not repeated.

In some embodiments, after the electronic device collects the points of interest, an interest point word list may be constructed based on the collected points of interest, so that the points of interest are managed by maintaining the interest point word list, and maintaining at least includes operations of deleting the points of interest, adding the points of interest, etc., when the number of the points of interest is large, the management efficiency of the points of interest may be improved, and the retrieval and the reuse of the points of interest are also facilitated.

In step 502, the electronic device obtains a first training corpus for training a language model, where the first training corpus includes events and places where the events occur. In some embodiments, the electronic device may obtain the first training corpus in a variety of ways. The free combination of the modes can meet the collection of the first training corpus and ensure the diversity of the training corpus.

For example, the electronic device may select a first training corpus from one or more original corpora, where the original corpora is obtained by the electronic device through social media or news web pages.

For another example, the electronic device may select the first training corpus from the original corpus corresponding to the original speech, where the original speech is the original speech collected by the electronic device, and the original corpus corresponding to the original speech is the corpus obtained by the electronic device recognizing the original speech.

For example, the electronic device may obtain a corpus including events and places where the events occur as the first training corpus. For example, the electronic device may receive manually input corpus including events and places where the events occur, and take the received corpus as the first training corpus.

In step 503, the electronic device selects an interest point from the collected one or more interest points, and modifies an event place in the first training corpus based on the selected interest point, to obtain a second training corpus. Wherein the second training corpus is a corpus directly used for language model training, that is, the second training corpus is input into the language model to train the language model. Enhancing the first training corpus may be understood as increasing the number of training samples by some processing based on the first training corpus. Therefore, the electronic device does not need additional manual labeling to generate the second training corpus, so that the cost for generating the second training corpus is reduced.

In some embodiments, the electronic device may select one point of interest from among the collected one or more points of interest by random sampling. Furthermore, if the electronic device collects the points of interest and constructs the point of interest vocabulary based on the collected points of interest, the electronic device may randomly sample all the points of interest in the point of interest vocabulary to obtain one point of interest. Thus, by sampling multiple times, multiple points of interest can be obtained.

In some embodiments, the electronic device replaces the event place in the first corpus with the selected interest point to obtain the second corpus. Therefore, the selected interest points are different, different second training corpuses can be generated, and the generated second training corpuses are effective corpuses, so that the interest points are effectively included in the second training corpuses. Therefore, the electronic equipment does not need additional manual labeling to generate the second training corpus, the cost for generating the second training corpus is reduced, and meanwhile, the diversity and the effectiveness of the training corpus can be ensured.

In step 504, the electronic device trains a language model based on at least the second training corpus. The language model may be a statistical language model, such as an n-gram model; the language model may also be a neural network language model, such as a feedforward neural network language model, a recurrent neural network language model, or the like. In this embodiment, the output of the language model is text. It should be noted that, in the embodiment of the present disclosure, only the input of the training language model is defined as the second training corpus, and the output is text, and the process of training the language model is not limited, and the process may follow the prior art.

In some embodiments, the electronic device trains the language model based on the second training corpus and the first training corpus, considering that not all the first training corpus is processed, and that points of interest may be included in the unprocessed first training corpus. In this embodiment, the second training corpus and the first training corpus are used to train the language model, so that not only can the recognition of the interest points of the special scene be satisfied, but also the recognition of the texts in the general scene can be satisfied.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but those skilled in the art can appreciate that the disclosed embodiments are not limited by the order of actions described, as some steps may occur in other orders or concurrently in accordance with the disclosed embodiments. In addition, those skilled in the art will appreciate that the embodiments described in the specification are all alternatives.

Fig. 6 is an exemplary flowchart of a conversational speech recognition method provided by embodiments of the present disclosure. The execution main body of the method is electronic equipment, and the electronic equipment can be realized as a server, namely a background server of the travel service platform. For convenience of description, the following embodiment uses the server as the execution subject to describe the flow of the dialogue speech recognition method.

As shown in fig. 6, in step 601, the server acquires a dialogue voice, and performs channel segmentation on the dialogue voice to obtain a first channel voice and a second channel voice. The dialogue speech is, for example, between the driver and the passenger. The dialogue speech may be collected by at least one of the passenger side 11 and the driver side 12 in fig. 1 and transmitted to the speech processing unit 31. The first sound channel is a sound channel of a driver, and the second sound channel is a sound channel of a passenger; alternatively, the first channel is the passenger's channel and the second channel is the driver's channel. In this embodiment, because the dialogue voice is a binaural channel, one channel is the voice of the passenger, and the other channel is the voice of the driver, and the voice cannot be recognized together, the server distinguishes the voices of the driver and the passenger through the vocal tract segmentation, so as to prepare for the next recognition.

In step 602, the server performs speech feature extraction on the first channel speech and the second channel speech to obtain a first channel speech feature and a second channel speech feature. In some embodiments, the server may perform noise reduction processing on the first channel voice and the second channel voice, and further perform feature extraction on the noise-reduced voice. The noise reduction is a common technology in the field of speech recognition technology, and will not be described herein. In some embodiments, the server-side extracted voice features may be any of the following: LPC features, PLP features, MFCC features, etc. The extraction mode of the features belongs to a mature technology in the technical field of voice recognition and is not repeated.

In step 603, the service end identifies the first vocal tract speech feature and the second vocal tract speech feature based on the acoustic model, outputting a text probability distribution. The acoustic model may be an existing acoustic model, for example, an acoustic model based on deep learning, which will not be described in detail.

For example, the passenger's voice is "at siemens" and the service end based on the acoustic model can output a text probability distribution where the numbers represent probabilities:

first word: at 0.7, this 0.3;

third word: straight 0.4, day 0.6;

fourth word: 0.4, door 0.3;

fifth word: this is 0.35, 0.3 Zhejiang.

In step 604, the service end corrects the text probability distribution based on a language model, and outputs the text corresponding to the dialogue speech, where the language model is a model obtained by training any embodiment of the training method based on the language model. In this embodiment, the language model may output, based on the corrected text probability distribution, a text with the highest joint probability as a text corresponding to the dialogue speech.

For example, the voice of the passenger is "in the siemens" but the joint probability of the text probability distribution obtained by the service end is not the maximum, so that the service end corrects the text probability distribution based on the language model so that the joint probability of the passenger is the maximum, and outputs the text corresponding to the voice of the passenger.

The embodiments of the present disclosure further provide a non-transitory computer readable storage medium storing a program or instructions that cause a computer to perform steps of embodiments of a training method such as a language model or steps of embodiments of a conversational speech recognition method, and are not described herein in detail to avoid repetitive description.

The embodiments of the present disclosure also provide a computer program product, where the computer program product includes a computer program, where the computer program is stored in a non-transitory computer readable storage medium, and at least one processor of the computer reads and executes the computer program from the storage medium, so that the computer performs steps of embodiments of a training method such as a language model or steps of embodiments of a conversational speech recognition method, and for avoiding repetitive descriptions, a description will not be repeated.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

Those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments.

Those skilled in the art will appreciate that the descriptions of the various embodiments are each focused on, and that portions of one embodiment that are not described in detail may be referred to as related descriptions of other embodiments.

Although embodiments of the present disclosure have been described with reference to the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method of training a language model for recognition of conversational speech, the method comprising:

acquiring a first training corpus used for training the language model, wherein the first training corpus comprises events and event places;

The language model is trained based at least on the second training corpus.

2. The method of claim 1, wherein the pre-collecting one or more points of interest for describing a surface element comprises:

3. The method of claim 2, wherein the obtaining a first training corpus for training the language model comprises:

selecting the first training corpus from one or more of the original corpora; and/or the number of the groups of groups,

and acquiring corpus comprising the event and the event place as the first training corpus.

4. The method of claim 1, wherein the selecting a point of interest from the one or more collected points of interest comprises:

and selecting the interest points from the collected one or more interest points through random sampling.

5. The method of claim 4, wherein modifying the event venue in the first training corpus based on the selected points of interest to obtain a second training corpus comprises:

6. The method of claim 2, wherein the training the language model based at least on the second training corpus comprises: the language model is trained based on the second training corpus and the first training corpus.

7. The method of claim 1, wherein after the pre-collecting one or more points of interest for describing the surface element, the method further comprises:

accordingly, the selecting the interest point from the collected one or more interest points includes: and selecting the interest points from the interest point word list.

8. A conversational speech recognition method, comprising:

Correcting the text probability distribution based on a language model, and outputting a text corresponding to the dialogue speech, wherein the language model is a language model trained based on the method of any one of claims 1 to 7.

9. A training apparatus for a language model for recognition of conversational speech, the apparatus comprising:

10. A conversational speech recognition device, comprising:

and the text correction unit is used for correcting the text probability distribution based on a language model and outputting the text corresponding to the dialogue voice, wherein the language model is a language model trained based on the method of any one of claims 1 to 7.