WO2018205389A1

WO2018205389A1 - Voice recognition method and system, electronic apparatus and medium

Info

Publication number: WO2018205389A1
Application number: PCT/CN2017/091353
Authority: WO
Inventors: 王健宗; 程宁; 查高密; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-05-10
Filing date: 2017-06-30
Publication date: 2018-11-15
Also published as: TWI636452B; CN107204184A; TW201901661A; CN107204184B

Abstract

A voice recognition method and system, an electronic apparatus and a medium. The method comprises: obtaining information texts of specific types from previously determined data sources (S10); performing sentence segmentation on the obtained information texts to obtain several sentences, performing word segmentation processing on the sentences to obtain corresponding words, and forming first mapping corpora from the sentences and the corresponding words (S20); according to the obtained first mapping corpora, training a first language model of a preset type, and performing voice recognition on the basis of the trained first language model (S30). The present solution effectively increases voice recognition accuracy, and effectively reduces voice recognition costs.

Description

Speech recognition method, system, electronic device and medium

Priority claim

This application is based on the priority of the Paris Convention, which is hereby incorporated by reference. In the application.

Technical field

The present invention relates to the field of computer technologies, and in particular, to a voice recognition method, system, electronic device, and medium.

Background technique

The language model plays an important role in the speech recognition task. In the existing speech recognition, the language model is generally established by using the annotated dialogue text, and the probability of each word is determined by the language model. However, in the prior art, the manner in which the language model is built using the labeled dialog text is too small because the current user needs to use the voice recognition technology in daily life (for example, the more common scenes are voice search, voice control, etc.) ), and the types and scopes of corpus that can be collected are too concentrated, which makes the following two shortcomings: one is expensive to purchase and the cost is high; the other is that it is difficult to obtain a sufficient amount of corpus to obtain an annotated dialogue. The text is difficult, and the timeliness and accuracy of the upgrade and expansion are difficult to guarantee, which in turn affects the training effect and recognition accuracy of the language model, thus affecting the accuracy of speech recognition.

Therefore, how to use existing corpus resources to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition has become a technical problem to be solved.

Summary of the invention

The main object of the present invention is to provide a speech recognition method, system, electronic device and medium, which aim to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition.

To achieve the above objective, a first aspect of the present application provides a voice recognition method, where the method includes the following steps:

A. Obtaining a specific type of information text from a predetermined data source;

B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;

C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.

A second aspect of the present application provides a voice recognition system, where the voice recognition system includes:

An obtaining module, configured to obtain a specific type of information text from a predetermined data source;

The word segmentation module is used for segmenting the obtained information texts to obtain a plurality of sentences, and performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;

And a training identification module, configured to train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.

A third aspect of the present application provides an electronic device, including a processing device, a storage device, and a voice recognition system, the voice recognition system being stored in the storage device, including at least one computer readable instruction, the at least one computer readable instruction The processing device executes to:

A fourth aspect of the present application provides a computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:

The speech recognition method, system, electronic device and medium provided by the invention perform segmentation of a specific type of information text acquired from a predetermined data source, and perform word segmentation processing on each segmented sentence to obtain each segmentation. And the first mapping corpus of the corresponding participle, training the first language model of the preset type according to the first mapping corpus, and performing speech recognition based on the first language model of the training. Since the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.

DRAWINGS

1 is a schematic diagram of an application environment of a preferred embodiment of a voice recognition method according to the present invention;

2 is a schematic flow chart of a first embodiment of a voice recognition method according to the present invention;

3 is a schematic flow chart of a second embodiment of a voice recognition method according to the present invention;

4 is a schematic diagram of functional modules of an embodiment of a speech recognition system of the present invention.

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

detailed description

In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer and clearer The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to FIG. 1 , it is a schematic diagram of an application environment of a preferred embodiment of the speech recognition method of the present invention. The application environment diagram includes an electronic device 1 and a terminal device 2. The electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.

The terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, or an individual. Digital Assistant (PDA), game console, Internet Protocol Television (IPTV), smart wearable device, etc.

The electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance. The electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing, A super virtual computer consisting of a loosely coupled set of computers.

In the present embodiment, the electronic device 1 includes, but is not limited to, a storage device 11, a processing device 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.

The storage device 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1. A storage device, such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like. In this embodiment, the readable storage medium of the storage device 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the voice recognition system 10 in an embodiment of the present application. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.

Processing device 12 may, in some embodiments, include one or more microprocessors, microcontrollers, digital processors, and the like. The processing device 12 is generally used to control the operation of the electronic device 1, for example, to perform control and processing related to data interaction or communication with the terminal device 2. In the present embodiment, the processing device 12 is configured to run program code or process data stored in the storage device 11, such as running the speech recognition system 10 or the like.

The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.

The speech recognition system 10 includes at least one computer readable instruction stored in the storage device 11, The at least one computer readable instruction can be executed by processing device 12 to implement a method of picture recognition for embodiments of the present application. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.

In an embodiment, when the speech recognition system 10 is executed by the processing device 12, the following operations are performed: first, acquiring a specific type of information text from a predetermined data source; and performing segmentation of the obtained information text to obtain a plurality of statements, Each sentence is processed by word segmentation to obtain a corresponding segmentation word, and each sentence and corresponding word segmentation constitute a first mapping corpus; then, according to each obtained first mapping corpus, a first language model of a preset type is trained, and the terminal device 2 is received. After the sent voice is to be recognized, the voice to be recognized is input into the trained first language model for identification, and the recognition result is fed back to the terminal device 2 for display on the terminal device 2 to the terminal user.

In an embodiment, the speech recognition system 10 is stored in the storage device 11 and includes at least one computer readable instruction stored in the storage device 11, the at least one computer readable instruction being executable by the processing device 12 to implement the present application. A method of picture recognition of each embodiment. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.

The invention provides a speech recognition method.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of a first embodiment of a voice recognition method according to the present invention.

In a first embodiment, the speech recognition method comprises:

Step S10: Acquire a specific type of information text from a predetermined data source.

In this embodiment, before training the language model, a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.). For example, specific types of information (eg, news headline information, index information, profile information, etc.) may be obtained from a predetermined data source (eg, major news websites, forums, etc.) in real time or by means of tools such as web crawlers.

Step S20, performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.

After obtaining each information text of a specific type from a plurality of predetermined data sources, the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks. Then, word segmentation is performed on each segmented sentence. For example, a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting The number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation. Word segmentation can also be used to classify each segmented sentence. Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words. You can also use statistical segmentation to enter the sentences of each segmentation. Line word segmentation processing, from the current user's historical search record or the public user's historical search record, according to the statistics of the phrase, it will be counted that some two adjacent words appear more frequently, then the two adjacent words can be As a phrase to perform word segmentation.

After the word segmentation processing is completed on the obtained segmented sentences, the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained. By obtaining information text from a plurality of predetermined data sources and generating a large number of sentences by segmenting the information text to perform word segmentation processing, a plurality of data sources can be obtained, and the corpus types are rich, the scope is wide, and the number is large. Corpus resources.

Step S30: Train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.

Based on the first mapping corpus, a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.

In this embodiment, a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word. A first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training. Since the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.

Further, in other embodiments, the foregoing step S20 may include:

Clean and denoise each obtained information text. For example, for the microblog content, the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog. There is a large amount of Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out. The filtering method is to delete all the contents including "forwarding" or "http". Microblog content; filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.

Sentence segmentation of each information text after cleaning and denoising, for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc." as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).

As shown in FIG. 3, a second embodiment of the present invention provides a voice recognition method, in the above embodiment. Based on the above, the above step S30 is replaced by:

Step S40: Train a preset first language model according to each of the obtained first mapping corpora.

Step S50, training a preset second language model according to each predetermined sample sentence and a second mapping corpus of the corresponding segmentation. For example, a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to A second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.

Step S60, mixing the trained first language model and the second language model according to a predetermined model mixing formula to obtain a mixed language model, and performing speech recognition based on the obtained mixed language model. The predetermined model mixing formula can be:

M=a*M1+b*M2

Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.

In this embodiment, based on the first mapping model obtained by training the first mapping corpus obtained from the plurality of data sources, the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation. The second language model, for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech. The trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type. The range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.

Further, in other embodiments, the training process of the preset type of the first language model or the second language model is as follows:

A. Divide each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);

B. training the first language model or the second language model by using the training set;

C. Using the verification set to verify the accuracy of the trained first language model or the second language model, if the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.

Further, in other embodiments, the preset type of the first language model and/or the second language model is an n-gram language model. The n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM). The Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), The sentence with the highest probability can be calculated, thereby realizing the automatic conversion to the Chinese character, avoiding the problem of the heavy code of many Chinese characters corresponding to the same pinyin (or stroke string, number string). N-gram is a statistical language model used to predict the nth item based on the first (n-1) items. At the application level, these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.

The n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, but not to any other words. The probability of the whole sentence is the product of the probability of occurrence of each word. Probability can be obtained by counting the number of simultaneous occurrences of n words directly from the mapping corpus. For a sentence T, assuming that T is composed of the word sequences W1, W2, ..., Wn, then the probability of occurrence of the sentence T P(T) = P(W1W2...Wn) = P(W1)P(W2|W1)P (W3|W1W2)...P(Wn|W1W2...Wn-1). In this embodiment, in order to solve the n-gram with an appearance probability of 0, in the training of the first language model and/or the second language model, the present embodiment adopts a maximum likelihood estimation method, namely:

P(Wn|W1W2...Wn-1)=C(W1W2...Wn)/C(W1W2...Wn-1)

That is to say, in the language model training process, by the number of occurrences of the statistical sequence W1W2...Wn and the number of occurrences of W1W2...Wn-1, the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.

Further, in other embodiments, the step of performing word segmentation processing on each segmented statement in the above step S20 may include:

According to the forward maximum matching method, the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;

According to the inverse maximum matching method, the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library) Matching is performed to obtain a second matching result. The first matching result includes a first number of first phrases, and the second matching result includes a second number of second phrases; the first matching result includes a third number of words The second matching result includes a fourth number of words.

If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;

If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.

In this embodiment, the two-way matching method is used to perform segmentation processing on the obtained segmented sentences, and the segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the sentences to be processed in each segmented sentence. The stickiness of the combined content, because the probability that the phrase can represent the core viewpoint information is usually greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy.

Please refer to FIG. 4, which is a functional block diagram of a preferred embodiment of the speech recognition system 10 of the present invention. In this embodiment, the speech recognition system 10 may be divided into one or more modules, the one or more modules being stored in the memory 11 and being executed by one or more processors (this implementation) For example, the processor 12) is executed to complete the present invention. For example, in FIG. 4, the speech recognition system 10 can be divided into an acquisition module 01, a word segmentation module 02, and a training recognition module 03. A module referred to in the present invention refers to a series of computer program instructions that are capable of performing a particular function, and are more suitable than the program to describe the execution of the speech recognition system 10 in the electronic device 1. The following description will specifically describe the functions of the acquisition module 01, the word segmentation module 02, and the training recognition module 03.

The obtaining module 01 is configured to obtain a specific type of information text from a predetermined data source.

The word segmentation module 02 is configured to perform segmentation of the obtained information texts to obtain a plurality of sentences, perform word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.

After obtaining each information text of a specific type from a plurality of predetermined data sources, the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks. Then, word segmentation is performed on each segmented sentence. For example, a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting The number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation. Word segmentation can also be used to classify each segmented sentence. Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words. Statistical segmentation can also be used to process word segmentation of each segmented sentence. From the historical search record of the current user or the historical search record of the public user, according to the statistics of the phrase, the frequency of occurrence of some two adjacent words will be compared. If you have more, you can use these two adjacent words as a phrase to perform word segmentation.

After the word segmentation processing is completed on the obtained segmented sentences, the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained. By extracting information text from a plurality of predetermined data sources, and segmenting the information text to generate a large number of sentences for word segmentation processing, multiple data can be obtained The source has access to a corpus resource with a rich corpus type, a wide range, and a large number.

The training identification module 03 is configured to train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.

Further, in other embodiments, the word segmentation module 02 is further configured to:

Further, in other embodiments, the training identification module 03 is further configured to:

A first language model of a preset type is trained according to each of the obtained first mapping corpora.

A second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation. For example, a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to A second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.

The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model. The predetermined model mixing formula can be:

M=a*M1+b*M2

Further, in other embodiments, the preset type of the first language model and/or the second language model is an n-gram language model. The n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM). The Chinese language model uses the collocation information between adjacent words in the context. When it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), the maximum probability can be calculated. Sentences, thus achieving automatic conversion to Chinese characters, avoiding the problem of repetitive codes in which many Chinese characters correspond to the same pinyin (or stroke string, number string). N-gram is a statistical language model used to predict the nth item based on the first (n-1) items. At the application level, these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.

The n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, and is not related to any other words. The probability of the whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of n words directly from the mapped corpus. For a sentence T, assuming that T is composed of the word sequences W1, W2, ..., Wn, then the probability of occurrence of the sentence T P(T) = P(W1W2...Wn) = P(W1)P(W2|W1)P (W3|W1W2)...P(Wn|W1W2...Wn-1). In this embodiment, in order to solve the n-gram with an appearance probability of 0, in the training of the first language model and/or the second language model, the present embodiment adopts a maximum likelihood estimation method, namely:

P(Wn|W1W2...Wn-1)=C(W1W2...Wn)/C(W1W2...Wn-1)

In this embodiment, the two-way matching method is adopted to perform word segmentation processing on each segmented sentence obtained, and the word segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the viscosity of the combined content in the string to be processed of each segmented sentence, since usually In the case where the phrase can represent the core viewpoint information, the probability is greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy

Moreover, the present invention also provides a computer readable storage medium storing a speech recognition system, the speech recognition system being executable by at least one processing device to cause the at least one processing device to perform The steps of the speech recognition method in the above embodiment, the language The specific implementation processes of steps S10, S20, and S30 of the tone recognition method are as described above, and are not described herein again.

It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device comprising a series of elements includes those elements. It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

The preferred embodiments of the present invention have been described above with reference to the drawings, and are not intended to limit the scope of the invention. The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments. Additionally, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.

A person skilled in the art can implement the invention in various variants without departing from the scope and spirit of the invention. For example, the features of one embodiment can be used in another embodiment to obtain a further embodiment. Any modifications, equivalent substitutions and improvements made within the technical concept of the invention are intended to be included within the scope of the invention.

Claims

A speech recognition method, characterized in that the method comprises the following steps:

A. Obtaining a specific type of information text from a predetermined data source;

B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;

C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
The speech recognition method according to claim 1, wherein said step C is replaced by:

Training a first language model of a preset type according to each of the obtained first mapping corpora;

Training a second language model of a preset type according to each predetermined sample sentence and a second mapping corpus of the corresponding word segment;

The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
The speech recognition method according to claim 2, wherein said predetermined model mixing formula is:

M=a*M1+b*M2

Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
The speech recognition method according to claim 2 or 3, wherein the first language model and/or the second language model of the preset type is an n-gram language model, and the first language of the preset type The training process for the model or the second language model is as follows:

S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;

S2, training the first language model or the second language model by using the training set;

S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
The speech recognition method according to claim 1, wherein the step of performing word segmentation processing on each of the segmented sentences comprises:

When a segmented statement is selected for word segmentation processing, the segmented statement is matched with a predetermined word dictionary library according to a forward maximum matching method to obtain a first matching result, and the first matching result The matching result includes a first number of first phrases and a third number of words;

Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;

If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;

If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
A speech recognition system, characterized in that the speech recognition system comprises:

An obtaining module, configured to obtain a specific type of information text from a predetermined data source;

The word segmentation module is used for segmenting the obtained information texts to obtain a plurality of sentences, and performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;

And a training identification module, configured to train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
The speech recognition system according to claim 6, wherein the training recognition module is further configured to:

Training a first language model of a preset type according to each of the obtained first mapping corpora;

Training a second language model of a preset type according to each predetermined sample sentence and a second mapping corpus of the corresponding word segment;

The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
The speech recognition system of claim 7 wherein said predetermined model blending formula is:

M=a*M1+b*M2

Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
The speech recognition system according to claim 7 or 8, wherein the first language model and/or the second language model of the preset type is an n-gram language model, and the first type of the preset type The training process of the language model or the second language model is as follows:

S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;

S2, training the first language model or the second language model by using the training set;

S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
The speech recognition system according to claim 6, wherein said word segmentation module is further configured to:

When a segmented statement is selected for word segmentation processing, the segmented statement is matched with a predetermined word dictionary library according to a forward maximum matching method to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of words;

Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;

If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;

If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
An electronic device comprising a processing device, a storage device, and a voice recognition system, the voice recognition system being stored in the storage device, comprising at least one computer readable instruction executable by the processing device, To achieve the following:

A. Obtaining a specific type of information text from a predetermined data source;

B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;

C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
The electronic device of claim 11 wherein said at least one computer readable instruction is further executable by said processing device to:

Training a first language model of a preset type according to each of the obtained first mapping corpora;

Training presets according to each predetermined sample sentence and the second mapping corpus of the corresponding word segmentation a second language model of the type;

The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
The electronic device according to claim 12, wherein said predetermined model mixing formula is:

M=a*M1+b*M2

Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
The electronic device according to claim 12 or 13, wherein the first language model and/or the second language model of the preset type is an n-gram language model, and the first language model of the preset type Or the training process of the second language model is as follows:

S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;

S2, training the first language model or the second language model by using the training set;

S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
The electronic device according to claim 11, wherein said word segmentation processing for each segmented sentence comprises:

When a segmented statement is selected for word segmentation processing, the segmented statement is matched with a predetermined word dictionary library according to a forward maximum matching method to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of words;

Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;

If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;

If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
A computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:

A. Obtaining a specific type of information text from a predetermined data source;

B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;

C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
The computer readable storage medium of claim 16 wherein said at least one computer readable instruction is further executable by said processing device to:

Training a first language model of a preset type according to each of the obtained first mapping corpora;

Training a second language model of a preset type according to each predetermined sample sentence and a second mapping corpus of the corresponding word segment;

The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
The computer readable storage medium of claim 17 wherein said predetermined model blending formula is:

M=a*M1+b*M2

Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
The computer readable storage medium according to claim 17 or 18, wherein the preset type of the first language model and/or the second language model is an n-gram language model, the preset type of The training process of a language model or a second language model is as follows:

S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;

S2, training the first language model or the second language model by using the training set;

S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
The computer readable storage medium of claim 16 wherein said word segmentation of each segmented statement comprises:

When a segmented statement is selected for word segmentation, the slice is cut according to the forward maximum matching method. The segmentation statement is matched with a predetermined word dictionary library to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of words;

Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;

If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;

If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;

If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.