WO2018205389A1 - Voice recognition method and system, electronic apparatus and medium - Google Patents

Voice recognition method and system, electronic apparatus and medium Download PDF

Info

Publication number
WO2018205389A1
WO2018205389A1 PCT/CN2017/091353 CN2017091353W WO2018205389A1 WO 2018205389 A1 WO2018205389 A1 WO 2018205389A1 CN 2017091353 W CN2017091353 W CN 2017091353W WO 2018205389 A1 WO2018205389 A1 WO 2018205389A1
Authority
WO
WIPO (PCT)
Prior art keywords
language model
segmented
training
preset
word segmentation
Prior art date
Application number
PCT/CN2017/091353
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
程宁
查高密
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018205389A1 publication Critical patent/WO2018205389A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a voice recognition method, system, electronic device, and medium.
  • the language model plays an important role in the speech recognition task.
  • the language model is generally established by using the annotated dialogue text, and the probability of each word is determined by the language model.
  • the manner in which the language model is built using the labeled dialog text is too small because the current user needs to use the voice recognition technology in daily life (for example, the more common scenes are voice search, voice control, etc.) ), and the types and scopes of corpus that can be collected are too concentrated, which makes the following two shortcomings: one is expensive to purchase and the cost is high; the other is that it is difficult to obtain a sufficient amount of corpus to obtain an annotated dialogue.
  • the text is difficult, and the timeliness and accuracy of the upgrade and expansion are difficult to guarantee, which in turn affects the training effect and recognition accuracy of the language model, thus affecting the accuracy of speech recognition.
  • the main object of the present invention is to provide a speech recognition method, system, electronic device and medium, which aim to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition.
  • a first aspect of the present application provides a voice recognition method, where the method includes the following steps:
  • a second aspect of the present application provides a voice recognition system, where the voice recognition system includes:
  • An obtaining module configured to obtain a specific type of information text from a predetermined data source
  • the word segmentation module is used for segmenting the obtained information texts to obtain a plurality of sentences, and performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
  • a training identification module configured to train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  • a third aspect of the present application provides an electronic device, including a processing device, a storage device, and a voice recognition system, the voice recognition system being stored in the storage device, including at least one computer readable instruction, the at least one computer readable instruction
  • the processing device executes to:
  • a fourth aspect of the present application provides a computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:
  • the speech recognition method, system, electronic device and medium provided by the invention perform segmentation of a specific type of information text acquired from a predetermined data source, and perform word segmentation processing on each segmented sentence to obtain each segmentation. And the first mapping corpus of the corresponding participle, training the first language model of the preset type according to the first mapping corpus, and performing speech recognition based on the first language model of the training.
  • the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
  • FIG. 1 is a schematic diagram of an application environment of a preferred embodiment of a voice recognition method according to the present invention
  • FIG. 2 is a schematic flow chart of a first embodiment of a voice recognition method according to the present invention.
  • FIG. 3 is a schematic flow chart of a second embodiment of a voice recognition method according to the present invention.
  • FIG. 4 is a schematic diagram of functional modules of an embodiment of a speech recognition system of the present invention.
  • FIG. 1 it is a schematic diagram of an application environment of a preferred embodiment of the speech recognition method of the present invention.
  • the application environment diagram includes an electronic device 1 and a terminal device 2.
  • the electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
  • the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, or an individual.
  • Digital Assistant (PDA) game console, Internet Protocol Television (IPTV), smart wearable device, etc.
  • the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing, A super virtual computer consisting of a loosely coupled set of computers.
  • the electronic device 1 includes, but is not limited to, a storage device 11, a processing device 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the storage device 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
  • a storage device such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like.
  • SMC smart memory card
  • SD Secure Digital
  • the readable storage medium of the storage device 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the voice recognition system 10 in an embodiment of the present application. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • Processing device 12 may, in some embodiments, include one or more microprocessors, microcontrollers, digital processors, and the like.
  • the processing device 12 is generally used to control the operation of the electronic device 1, for example, to perform control and processing related to data interaction or communication with the terminal device 2.
  • the processing device 12 is configured to run program code or process data stored in the storage device 11, such as running the speech recognition system 10 or the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
  • the speech recognition system 10 includes at least one computer readable instruction stored in the storage device 11, The at least one computer readable instruction can be executed by processing device 12 to implement a method of picture recognition for embodiments of the present application. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the speech recognition system 10 when executed by the processing device 12, the following operations are performed: first, acquiring a specific type of information text from a predetermined data source; and performing segmentation of the obtained information text to obtain a plurality of statements, Each sentence is processed by word segmentation to obtain a corresponding segmentation word, and each sentence and corresponding word segmentation constitute a first mapping corpus; then, according to each obtained first mapping corpus, a first language model of a preset type is trained, and the terminal device 2 is received. After the sent voice is to be recognized, the voice to be recognized is input into the trained first language model for identification, and the recognition result is fed back to the terminal device 2 for display on the terminal device 2 to the terminal user.
  • the speech recognition system 10 is stored in the storage device 11 and includes at least one computer readable instruction stored in the storage device 11, the at least one computer readable instruction being executable by the processing device 12 to implement the present application.
  • the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the invention provides a speech recognition method.
  • FIG. 2 is a schematic flowchart of a first embodiment of a voice recognition method according to the present invention.
  • the speech recognition method comprises:
  • Step S10 Acquire a specific type of information text from a predetermined data source.
  • a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.).
  • specific types of information eg, news headline information, index information, profile information, etc.
  • a predetermined data source eg, major news websites, forums, etc.
  • Step S20 performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.
  • the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks.
  • word segmentation is performed on each segmented sentence.
  • a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting
  • the number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation.
  • Word segmentation can also be used to classify each segmented sentence.
  • Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words. You can also use statistical segmentation to enter the sentences of each segmentation. Line word segmentation processing, from the current user's historical search record or the public user's historical search record, according to the statistics of the phrase, it will be counted that some two adjacent words appear more frequently, then the two adjacent words can be As a phrase to perform word segmentation.
  • the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained.
  • the corpus types are rich, the scope is wide, and the number is large. Corpus resources.
  • Step S30 Train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  • a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.
  • a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word.
  • a first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training.
  • the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
  • step S20 may include:
  • the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog.
  • Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out.
  • the filtering method is to delete all the contents including "forwarding" or "http".
  • Microblog content filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.
  • Sentence segmentation of each information text after cleaning and denoising for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc.” as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).
  • a second embodiment of the present invention provides a voice recognition method, in the above embodiment. Based on the above, the above step S30 is replaced by:
  • Step S40 Train a preset first language model according to each of the obtained first mapping corpora.
  • Step S50 training a preset second language model according to each predetermined sample sentence and a second mapping corpus of the corresponding segmentation.
  • a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to
  • a second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
  • Step S60 mixing the trained first language model and the second language model according to a predetermined model mixing formula to obtain a mixed language model, and performing speech recognition based on the obtained mixed language model.
  • the predetermined model mixing formula can be:
  • M1 represents a first language model of a preset type
  • a represents a weighting coefficient of a preset model M1
  • M2 represents a second language model of a preset type
  • b represents a weight of a preset model M2.
  • the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation.
  • the second language model for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech.
  • the trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type.
  • the range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.
  • the training process of the preset type of the first language model or the second language model is as follows:
  • each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);
  • the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.
  • the preset type of the first language model and/or the second language model is an n-gram language model.
  • the n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM).
  • CLM Chinese Language Model
  • the Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), The sentence with the highest probability can be calculated, thereby realizing the automatic conversion to the Chinese character, avoiding the problem of the heavy code of many Chinese characters corresponding to the same pinyin (or stroke string, number string).
  • N-gram is a statistical language model used to predict the nth item based on the first (n-1) items.
  • these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.
  • the n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, but not to any other words.
  • the present embodiment adopts a maximum likelihood estimation method, namely:
  • the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.
  • the step of performing word segmentation processing on each segmented statement in the above step S20 may include:
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library)
  • Matching is performed to obtain a second matching result.
  • the first matching result includes a first number of first phrases
  • the second matching result includes a second number of second phrases
  • the first matching result includes a third number of words
  • the second matching result includes a fourth number of words.
  • the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;
  • the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.
  • the two-way matching method is used to perform segmentation processing on the obtained segmented sentences, and the segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the sentences to be processed in each segmented sentence.
  • the stickiness of the combined content because the probability that the phrase can represent the core viewpoint information is usually greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy.
  • FIG. 4 is a functional block diagram of a preferred embodiment of the speech recognition system 10 of the present invention.
  • the speech recognition system 10 may be divided into one or more modules, the one or more modules being stored in the memory 11 and being executed by one or more processors (this implementation)
  • the processor 12 is executed to complete the present invention.
  • the speech recognition system 10 can be divided into an acquisition module 01, a word segmentation module 02, and a training recognition module 03.
  • a module referred to in the present invention refers to a series of computer program instructions that are capable of performing a particular function, and are more suitable than the program to describe the execution of the speech recognition system 10 in the electronic device 1. The following description will specifically describe the functions of the acquisition module 01, the word segmentation module 02, and the training recognition module 03.
  • the obtaining module 01 is configured to obtain a specific type of information text from a predetermined data source.
  • a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.).
  • specific types of information eg, news headline information, index information, profile information, etc.
  • a predetermined data source eg, major news websites, forums, etc.
  • the word segmentation module 02 is configured to perform segmentation of the obtained information texts to obtain a plurality of sentences, perform word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.
  • the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks.
  • word segmentation is performed on each segmented sentence.
  • a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting
  • the number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation.
  • Word segmentation can also be used to classify each segmented sentence.
  • Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words.
  • Statistical segmentation can also be used to process word segmentation of each segmented sentence. From the historical search record of the current user or the historical search record of the public user, according to the statistics of the phrase, the frequency of occurrence of some two adjacent words will be compared. If you have more, you can use these two adjacent words as a phrase to perform word segmentation.
  • the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained.
  • the source has access to a corpus resource with a rich corpus type, a wide range, and a large number.
  • the training identification module 03 is configured to train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  • a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.
  • a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word.
  • a first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training.
  • the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
  • the word segmentation module 02 is further configured to:
  • the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog.
  • Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out.
  • the filtering method is to delete all the contents including "forwarding" or "http".
  • Microblog content filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.
  • Sentence segmentation of each information text after cleaning and denoising for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc.” as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).
  • the training identification module 03 is further configured to:
  • a first language model of a preset type is trained according to each of the obtained first mapping corpora.
  • a second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
  • a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to
  • a second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
  • the trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
  • the predetermined model mixing formula can be:
  • M1 represents a first language model of a preset type
  • a represents a weighting coefficient of a preset model M1
  • M2 represents a second language model of a preset type
  • b represents a weight of a preset model M2.
  • the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation.
  • the second language model for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech.
  • the trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type.
  • the range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.
  • the training process of the preset type of the first language model or the second language model is as follows:
  • each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);
  • the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.
  • the preset type of the first language model and/or the second language model is an n-gram language model.
  • the n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM).
  • CLM Chinese Language Model
  • the Chinese language model uses the collocation information between adjacent words in the context. When it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), the maximum probability can be calculated. Sentences, thus achieving automatic conversion to Chinese characters, avoiding the problem of repetitive codes in which many Chinese characters correspond to the same pinyin (or stroke string, number string).
  • N-gram is a statistical language model used to predict the nth item based on the first (n-1) items.
  • these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.
  • the n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, and is not related to any other words.
  • the present embodiment adopts a maximum likelihood estimation method, namely:
  • the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.
  • the word segmentation module 02 is further configured to:
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library)
  • Matching is performed to obtain a second matching result.
  • the first matching result includes a first number of first phrases
  • the second matching result includes a second number of second phrases
  • the first matching result includes a third number of words
  • the second matching result includes a fourth number of words.
  • the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;
  • the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.
  • the two-way matching method is adopted to perform word segmentation processing on each segmented sentence obtained, and the word segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the viscosity of the combined content in the string to be processed of each segmented sentence, since usually In the case where the phrase can represent the core viewpoint information, the probability is greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy
  • the present invention also provides a computer readable storage medium storing a speech recognition system, the speech recognition system being executable by at least one processing device to cause the at least one processing device to perform The steps of the speech recognition method in the above embodiment, the language
  • the specific implementation processes of steps S10, S20, and S30 of the tone recognition method are as described above, and are not described herein again.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

A voice recognition method and system, an electronic apparatus and a medium. The method comprises: obtaining information texts of specific types from previously determined data sources (S10); performing sentence segmentation on the obtained information texts to obtain several sentences, performing word segmentation processing on the sentences to obtain corresponding words, and forming first mapping corpora from the sentences and the corresponding words (S20); according to the obtained first mapping corpora, training a first language model of a preset type, and performing voice recognition on the basis of the trained first language model (S30). The present solution effectively increases voice recognition accuracy, and effectively reduces voice recognition costs.

Description

语音识别方法、系统、电子装置及介质Speech recognition method, system, electronic device and medium
优先权申明Priority claim
本申请基于巴黎公约申明享有2017年5月10日递交的申请号为CN2017103273748、名称为“语音识别方法及系统”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application is based on the priority of the Paris Convention, which is hereby incorporated by reference. In the application.
技术领域Technical field
本发明涉及计算机技术领域,尤其涉及一种语音识别方法、系统、电子装置及介质。The present invention relates to the field of computer technologies, and in particular, to a voice recognition method, system, electronic device, and medium.
背景技术Background technique
语言模型在语音识别任务中扮演着重要的角色,在现有的语音识别中,一般利用标注过的对话文本建立语言模型,通过该语言模型确定每个字的概率。然而,现有技术中利用标注过的对话文本建立语言模型的方式,由于目前用户在日常生活中需要用到语音识别技术的场景过少(例如,比较常见的场景是语音搜索、语音控制等领域),且能够收集的语料类型和范围过于集中,使得这种方式存在以下两个缺点:一个是购买价格昂贵、成本很高;另一个是很难获取到足够数量的语料,获取标注过的对话文本比较困难,而且升级扩容的及时性、准确性难以保障,进而影响语言模型的训练效果和识别精度,从而影响语音识别的准确性。The language model plays an important role in the speech recognition task. In the existing speech recognition, the language model is generally established by using the annotated dialogue text, and the probability of each word is determined by the language model. However, in the prior art, the manner in which the language model is built using the labeled dialog text is too small because the current user needs to use the voice recognition technology in daily life (for example, the more common scenes are voice search, voice control, etc.) ), and the types and scopes of corpus that can be collected are too concentrated, which makes the following two shortcomings: one is expensive to purchase and the cost is high; the other is that it is difficult to obtain a sufficient amount of corpus to obtain an annotated dialogue. The text is difficult, and the timeliness and accuracy of the upgrade and expansion are difficult to guarantee, which in turn affects the training effect and recognition accuracy of the language model, thus affecting the accuracy of speech recognition.
因此,如何利用现有的语料资源有效提高语音识别的精度且有效降低语音识别的成本已经成为一个亟待解决的技术问题。Therefore, how to use existing corpus resources to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition has become a technical problem to be solved.
发明内容Summary of the invention
本发明的主要目的在于提供一种语音识别方法、系统、电子装置及介质,旨在有效提高语音识别的精度且有效降低语音识别的成本。The main object of the present invention is to provide a speech recognition method, system, electronic device and medium, which aim to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition.
为实现上述目的,本申请第一方面提供一种语音识别方法,所述方法包括以下步骤:To achieve the above objective, a first aspect of the present application provides a voice recognition method, where the method includes the following steps:
A、从预先确定的数据源获取特定类型的信息文本;A. Obtaining a specific type of information text from a predetermined data source;
B、对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
C、根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
本申请第二方面提供一种语音识别系统,所述语音识别系统包括:A second aspect of the present application provides a voice recognition system, where the voice recognition system includes:
获取模块,用于从预先确定的数据源获取特定类型的信息文本;An obtaining module, configured to obtain a specific type of information text from a predetermined data source;
分词模块,用于对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料; The word segmentation module is used for segmenting the obtained information texts to obtain a plurality of sentences, and performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
训练识别模块,用于根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。And a training identification module, configured to train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
本申请第三方面提供一种电子装置,包括处理设备、存储设备及语音识别系统,该语音识别系统存储于该存储设备中,包括至少一个计算机可读指令,该至少一个计算机可读指令可被所述处理设备执行,以实现以下操作:A third aspect of the present application provides an electronic device, including a processing device, a storage device, and a voice recognition system, the voice recognition system being stored in the storage device, including at least one computer readable instruction, the at least one computer readable instruction The processing device executes to:
A、从预先确定的数据源获取特定类型的信息文本;A. Obtaining a specific type of information text from a predetermined data source;
B、对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
C、根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
本申请第四方面提供一种计算机可读存储介质,其上存储有至少一个可被处理设备执行以实现以下操作的计算机可读指令:A fourth aspect of the present application provides a computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:
A、从预先确定的数据源获取特定类型的信息文本;A. Obtaining a specific type of information text from a predetermined data source;
B、对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
C、根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
本发明提出的语音识别方法、系统、电子装置及介质,通过对从预先确定的数据源获取的特定类型的信息文本进行语句切分,并对各个切分的语句进行分词处理,得到各个切分的语句与对应的分词的第一映射语料,根据该第一映射语料训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。由于可通过对从预先确定的多个数据源中获取的信息文本进行语句切分及相应的分词处理来得到语料资源,并基于该语料资源训练语言模型,无需获取标注过的对话文本,且能获取到足够数量的语料资源,能保证语言模型的训练效果和识别精度,从而有效提高语音识别的精度且有效降低语音识别的成本。The speech recognition method, system, electronic device and medium provided by the invention perform segmentation of a specific type of information text acquired from a predetermined data source, and perform word segmentation processing on each segmented sentence to obtain each segmentation. And the first mapping corpus of the corresponding participle, training the first language model of the preset type according to the first mapping corpus, and performing speech recognition based on the first language model of the training. Since the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
附图说明DRAWINGS
图1为本发明语音识别方法的较佳实施例的应用环境示意图;1 is a schematic diagram of an application environment of a preferred embodiment of a voice recognition method according to the present invention;
图2为本发明语音识别方法第一实施例的流程示意图;2 is a schematic flow chart of a first embodiment of a voice recognition method according to the present invention;
图3为本发明语音识别方法第二实施例的流程示意图;3 is a schematic flow chart of a second embodiment of a voice recognition method according to the present invention;
图4为本发明语音识别系统一实施例的功能模块示意图。4 is a schematic diagram of functional modules of an embodiment of a speech recognition system of the present invention.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.
具体实施方式detailed description
为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明 白,以下结合附图和实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer and clearer The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
参阅图1所示,是本发明实现语音识别方法的较佳实施例的应用环境示意图。应用环境示意图包括电子装置1及终端设备2。电子装置1可以通过网络、近场通信技术等适合的技术与终端设备2进行数据交互。Referring to FIG. 1 , it is a schematic diagram of an application environment of a preferred embodiment of the speech recognition method of the present invention. The application environment diagram includes an electronic device 1 and a terminal device 2. The electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
终端设备2包括,但不限于,任何一种可与用户通过键盘、鼠标、遥控器、触摸板或者声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA),游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。The terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, or an individual. Digital Assistant (PDA), game console, Internet Protocol Television (IPTV), smart wearable device, etc.
电子装置1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。电子装置1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。The electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance. The electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing, A super virtual computer consisting of a loosely coupled set of computers.
在本实施例中,电子装置1包括,但不仅限于,可通过系统总线相互通信连接的存储设备11、处理设备12、及网络接口13。需要指出的是,图1仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。In the present embodiment, the electronic device 1 includes, but is not limited to, a storage device 11, a processing device 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
其中,存储设备11包括内存及至少一种类型的可读存储介质。内存为电子装置1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置1的内部存储单元,例如该电子装置1的硬盘;在另一些实施例中,该非易失性存储介质也可以是电子装置1的外部存储设备,例如电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储设备11的可读存储介质通常用于存储安装于电子装置1的操作系统和各类应用软件,例如本申请一实施例中的语音识别系统10的程序代码等。此外,存储设备11还可以用于暂时地存储已经输出或者将要输出的各类数据。The storage device 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1. A storage device, such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like. In this embodiment, the readable storage medium of the storage device 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the voice recognition system 10 in an embodiment of the present application. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
处理设备12在一些实施例中可以包括一个或者多个微处理器、微控制器、数字处理器等。该处理设备12通常用于控制电子装置1的运行,例如执行与终端设备2进行数据交互或者通信相关的控制和处理等。在本实施例中,处理设备12用于运行存储设备11中存储的程序代码或者处理数据,例如运行语音识别系统10等。Processing device 12 may, in some embodiments, include one or more microprocessors, microcontrollers, digital processors, and the like. The processing device 12 is generally used to control the operation of the electronic device 1, for example, to perform control and processing related to data interaction or communication with the terminal device 2. In the present embodiment, the processing device 12 is configured to run program code or process data stored in the storage device 11, such as running the speech recognition system 10 or the like.
网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在电子装置1与其他电子设备之间建立通信连接。本实施例中,网络接口13主要用于将电子装置1与一个或多个终端设备2相连,在电子装置1与一个或多个终端设备2之间建立数据传输通道和通信连接。The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
语音识别系统10包括至少一个存储在存储设备11中的计算机可读指令, 该至少一个计算机可读指令可被处理设备12执行,以实现本申请各实施例的图片识别的方法。如后续所述,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。The speech recognition system 10 includes at least one computer readable instruction stored in the storage device 11, The at least one computer readable instruction can be executed by processing device 12 to implement a method of picture recognition for embodiments of the present application. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
在一实施例中,语音识别系统10被处理设备12执行时,实现以下操作:首先从预先确定的数据源获取特定类型的信息文本;对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;然后根据得到的各个第一映射语料,训练预设类型的第一语言模型,并在接收到终端设备2发送的待识别的语音后,将待识别的语音输入至训练好的所述第一语言模型中进行识别,并反馈识别结果至终端设备2,以在终端设备2上显示给终端用户。In an embodiment, when the speech recognition system 10 is executed by the processing device 12, the following operations are performed: first, acquiring a specific type of information text from a predetermined data source; and performing segmentation of the obtained information text to obtain a plurality of statements, Each sentence is processed by word segmentation to obtain a corresponding segmentation word, and each sentence and corresponding word segmentation constitute a first mapping corpus; then, according to each obtained first mapping corpus, a first language model of a preset type is trained, and the terminal device 2 is received. After the sent voice is to be recognized, the voice to be recognized is input into the trained first language model for identification, and the recognition result is fed back to the terminal device 2 for display on the terminal device 2 to the terminal user.
在一实施例中,语音识别系统10存储在存储设备11中,包括至少一个存储在存储设备11中的计算机可读指令,该至少一个计算机可读指令可被处理设备12执行,以实现本申请各实施例的图片识别的方法。如后续所述,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。In an embodiment, the speech recognition system 10 is stored in the storage device 11 and includes at least one computer readable instruction stored in the storage device 11, the at least one computer readable instruction being executable by the processing device 12 to implement the present application. A method of picture recognition of each embodiment. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
本发明提供一种语音识别方法。The invention provides a speech recognition method.
参照图2,图2为本发明语音识别方法第一实施例的流程示意图。Referring to FIG. 2, FIG. 2 is a schematic flowchart of a first embodiment of a voice recognition method according to the present invention.
在第一实施例中,该语音识别方法包括:In a first embodiment, the speech recognition method comprises:
步骤S10,从预先确定的数据源获取特定类型的信息文本。Step S10: Acquire a specific type of information text from a predetermined data source.
本实施例中,在训练语言模型之前,实时或者定时从预先确定的多个数据源(例如,新浪微博、百度百科、维基百科、新浪新闻等网站)获取特定类型的信息文本(例如,词条及其解释、新闻标题、新闻摘要、微博内容等等)。例如,可通过网络爬虫等工具实时或者定时从预先确定的数据源(例如,各大新闻网站、论坛等)获取特定类型的信息(例如,新闻标题信息、索引信息、简介信息等)。In this embodiment, before training the language model, a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.). For example, specific types of information (eg, news headline information, index information, profile information, etc.) may be obtained from a predetermined data source (eg, major news websites, forums, etc.) in real time or by means of tools such as web crawlers.
步骤S20,对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料。Step S20, performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.
从预先确定的多个数据源中获取到特定类型的各个信息文本后,可对获取的各个信息文本进行语句切分,例如可根据标点符号将各个信息文本切分成一条条完整的语句。然后,对各个切分的语句进行分词处理,例如,可利用字符串匹配的分词方法对各个切分的语句进行分词处理,如正向最大匹配法,把一个切分的语句中的字符串从左至右来分词;或者,反向最大匹配法,把一个切分的语句中的字符串从右至左来分词;或者,最短路径分词法,一个切分的语句中的字符串里面要求切出的词数是最少的;或者,双向最大匹配法,正反向同时进行分词匹配。还可利用词义分词法对各个切分的语句进行分词处理,词义分词法是一种机器语音判断的分词方法,利用句法信息和语义信息来处理歧义现象来分词。还可利用统计分词法对各个切分的语句进 行分词处理,从当前用户的历史搜索记录或大众用户的历史搜索记录中,根据词组的统计,会统计有些两个相邻的字出现的频率较多,则可将这两个相邻的字作为词组来进行分词。After obtaining each information text of a specific type from a plurality of predetermined data sources, the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks. Then, word segmentation is performed on each segmented sentence. For example, a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting The number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation. Word segmentation can also be used to classify each segmented sentence. Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words. You can also use statistical segmentation to enter the sentences of each segmentation. Line word segmentation processing, from the current user's historical search record or the public user's historical search record, according to the statistics of the phrase, it will be counted that some two adjacent words appear more frequently, then the two adjacent words can be As a phrase to perform word segmentation.
对获取的各个切分的语句完成分词处理后,即可得到各个切分的语句与对应的分词所组成的第一映射语料。通过从预先确定的多个数据源中获取信息文本,并对信息文本切分生成大量的语句来进行分词处理,可从多个数据源中获取到语料类型丰富、范围较广以及数量较多的语料资源。After the word segmentation processing is completed on the obtained segmented sentences, the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained. By obtaining information text from a plurality of predetermined data sources and generating a large number of sentences by segmenting the information text to perform word segmentation processing, a plurality of data sources can be obtained, and the corpus types are rich, the scope is wide, and the number is large. Corpus resources.
步骤S30,根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。Step S30: Train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
基于所述第一映射语料,训练预设类型的第一语言模型,该第一语言模型可以是生成性模型、分析性模型、辨识性模型等。由于第一映射语料是从多个数据源中获取到的,其语料资源的语料类型丰富、范围较广且数量较多,因此,利用该第一映射语料来训练第一语言模型的训练效果较好,进而使得基于训练的所述第一语言模型进行语音识别的识别精度较高。Based on the first mapping corpus, a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.
本实施例通过对从预先确定的数据源获取的特定类型的信息文本进行语句切分,并对各个切分的语句进行分词处理,得到各个切分的语句与对应的分词的第一映射语料,根据该第一映射语料训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。由于可通过对从预先确定的多个数据源中获取的信息文本进行语句切分及相应的分词处理来得到语料资源,并基于该语料资源训练语言模型,无需获取标注过的对话文本,且能获取到足够数量的语料资源,能保证语言模型的训练效果和识别精度,从而有效提高语音识别的精度且有效降低语音识别的成本。In this embodiment, a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word. A first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training. Since the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
进一步地,在其他实施例中,上述步骤S20可以包括:Further, in other embodiments, the foregoing step S20 may include:
对获取的各个信息文本进行清洗去噪。例如,针对微博内容,所述清洗去噪的步骤包括:从微博内容中删除用户名、id等信息,只保留微博的实际内容;删除掉转发的微博内容,一般获取的微博内容中有大量转发的微博内容,重复的转发微博内容会影响到词语的频次,因此须将转发的微博内容过滤掉,过滤方法为删除掉所有包含“转发”或包含“http”的微博内容;过滤掉微博内容中的特殊符号,将微博内容中预设类型的符号全部过滤掉;繁体转简体,微博内容中有大量的繁体字符,利用预先确定的简繁对应表将所有繁体字符转变为简体字符,等等。Clean and denoise each obtained information text. For example, for the microblog content, the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog. There is a large amount of Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out. The filtering method is to delete all the contents including "forwarding" or "http". Microblog content; filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.
对清洗去噪后的各个信息文本进行语句切分,例如,将两个预设类型的断句符“例如,逗号、句号、感叹号等”之间的语句作为一个待切分的语句,并对各个切分的语句进行分词处理,以得到各个切分的语句与对应的分词(包括词组和单字)的映射语料。Sentence segmentation of each information text after cleaning and denoising, for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc." as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).
如图3所示,本发明第二实施例提出一种语音识别方法,在上述实施例 的基础上,上述步骤S30替换为:As shown in FIG. 3, a second embodiment of the present invention provides a voice recognition method, in the above embodiment. Based on the above, the above step S30 is replaced by:
步骤S40,根据得到的各个第一映射语料,训练预设类型的第一语言模型。Step S40: Train a preset first language model according to each of the obtained first mapping corpora.
步骤S50,根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型。例如,可预先确定若干样本语句,如可从预先确定的数据源中找出若干出现频率最高或最常用的样本语句,并确定每一样本语句对应的正确的分词(包括词组和单字),以根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型。Step S50, training a preset second language model according to each predetermined sample sentence and a second mapping corpus of the corresponding segmentation. For example, a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to A second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
步骤S60,根据预先确定的模型混合公式,将训练的所述第一语言模型及第二语言模型进行混合,以获得混合语言模型,并基于获得的所述混合语言模型进行语音识别。所述预先确定的模型混合公式可以为:Step S60, mixing the trained first language model and the second language model according to a predetermined model mixing formula to obtain a mixed language model, and performing speech recognition based on the obtained mixed language model. The predetermined model mixing formula can be:
M=a*M1+b*M2M=a*M1+b*M2
其中,M为混合语言模型,M1代表预设类型的第一语言模型,a代表预设的模型M1的权重系数,M2代表预设类型的第二语言模型,b代表预设的模型M2的权重系数。Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
本实施例中,在根据从多个数据源中获取到的第一映射语料训练得到第一语言模型的基础上,还根据各个预先确定的样本语句与对应的分词的第二映射语料,训练得到第二语言模型,例如该预先确定的样本语句可以为预设的最常用且正确无误的若干语句,因此,训练得到的该第二语言模型能正确识别常用的语音。将训练的所述第一语言模型及第二语言模型按预设的不同权重比例进行混合得到混合语言模型,并基于获得的所述混合语言模型进行语音识别,既能保证语音识别的类型丰富、范围较广,又能保证正确识别常用的语音,进一步地提高语音识别的精度。In this embodiment, based on the first mapping model obtained by training the first mapping corpus obtained from the plurality of data sources, the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation. The second language model, for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech. The trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type. The range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.
进一步地,在其他实施例中,所述预设类型的第一语言模型或第二语言模型的训练过程如下:Further, in other embodiments, the training process of the preset type of the first language model or the second language model is as follows:
A、将各个第一映射语料或者各个第二映射语料分为第一比例(例如,70%)的训练集和第二比例(例如,30%)的验证集;A. Divide each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);
B、利用所述训练集训练所述第一语言模型或者第二语言模型;B. training the first language model or the second language model by using the training set;
C、利用所述验证集验证训练的第一语言模型或者第二语言模型的准确率,若准确率大于或者等于预设准确率,则训练结束,或者,若准确率小于预设准确率,则增加第一映射语料或者第二映射语料的数量并重新执行步骤A、B、C,直至训练的所述第一语言模型或者第二语言模型的准确率大于或者等于预设准确率。C. Using the verification set to verify the accuracy of the trained first language model or the second language model, if the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.
进一步地,在其他实施例中,所述预设类型的第一语言模型及/或第二语言模型为n-gram语言模型。n-gram语言模型是大词汇连续语音识别中常用的一种语言模型,对中文而言,称之为汉语语言模型(CLM,Chinese Language Model)。汉语语言模型利用上下文中相邻词间的搭配信息,在需要把连续无空格的拼音、笔划,或代表字母或笔划的数字,转换成汉字串(即句子)时, 可以计算出具有最大概率的句子,从而实现到汉字的自动转换,避开了许多汉字对应一个相同的拼音(或笔划串、数字串)的重码问题。n-gram是一种统计语言模型,用来根据前(n-1)个item来预测第n个item。在应用层面,这些item可以是音素(语音识别应用)、字符(输入法应用)、词(分词应用)或碱基对(基因信息),可以从大规模文本或音频语料库生成n-gram模型。Further, in other embodiments, the preset type of the first language model and/or the second language model is an n-gram language model. The n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM). The Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), The sentence with the highest probability can be calculated, thereby realizing the automatic conversion to the Chinese character, avoiding the problem of the heavy code of many Chinese characters corresponding to the same pinyin (or stroke string, number string). N-gram is a statistical language model used to predict the nth item based on the first (n-1) items. At the application level, these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.
n-gram语言模型基于这样一种假设,第n个词的出现只与前面n-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现的概率的乘积,这些概率可以通过直接从映射语料中统计n个词同时出现的次数得到。对于一个句子T,假设T是由词序列W1,W2,…,Wn组成的,那么句子T出现的概率P(T)=P(W1W2…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。本实施例中,为了解决出现概率为0的n-gram,在所述第一语言模型及/或第二语言模型的训练中,本实施例采用了最大似然估计方法,即:The n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, but not to any other words. The probability of the whole sentence is the product of the probability of occurrence of each word. Probability can be obtained by counting the number of simultaneous occurrences of n words directly from the mapping corpus. For a sentence T, assuming that T is composed of the word sequences W1, W2, ..., Wn, then the probability of occurrence of the sentence T P(T) = P(W1W2...Wn) = P(W1)P(W2|W1)P (W3|W1W2)...P(Wn|W1W2...Wn-1). In this embodiment, in order to solve the n-gram with an appearance probability of 0, in the training of the first language model and/or the second language model, the present embodiment adopts a maximum likelihood estimation method, namely:
P(Wn|W1W2…Wn-1)=C(W1W2…Wn)/C(W1W2…Wn-1)P(Wn|W1W2...Wn-1)=C(W1W2...Wn)/C(W1W2...Wn-1)
也就是说,在语言模型训练过程中,通过统计序列W1W2…Wn出现的次数和W1W2…Wn-1出现的次数,即可算出第n个词的出现概率,以判断出所对应字的概率,实现语音识别。That is to say, in the language model training process, by the number of occurrences of the statistical sequence W1W2...Wn and the number of occurrences of W1W2...Wn-1, the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.
进一步地,在其他实施例中,上述步骤S20中对各个切分的语句进行分词处理的步骤可以包括:Further, in other embodiments, the step of performing word segmentation processing on each segmented statement in the above step S20 may include:
根据正向最大匹配法将每一切分的语句中待处理的字符串与预先确定的字词典库(例如,该字词典库可以是通用字词典库,也可以是可扩容的学习型字词典库)进行匹配,得到第一匹配结果;According to the forward maximum matching method, the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;
根据逆向最大匹配法将每一切分的语句中待处理的字符串与预先确定的字词典库(例如,该字词典库可以是通用字词典库,也可以是可扩容的学习型字词典库)进行匹配,得到第二匹配结果。其中,所述第一匹配结果中包含有第一数量的第一词组,所述第二匹配结果中包含有第二数量的第二词组;所述第一匹配结果中包含有第三数量的单字,所述第二匹配结果中包含有第四数量的单字。According to the inverse maximum matching method, the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library) Matching is performed to obtain a second matching result. The first matching result includes a first number of first phrases, and the second matching result includes a second number of second phrases; the first matching result includes a third number of words The second matching result includes a fourth number of words.
若所述第一数量与所述第二数量相等,且所述第三数量小于或者等于所述第四数量,则输出该切分的语句对应的所述第一匹配结果(包括词组和单字);If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;
若所述第一数量与所述第二数量相等,且所述第三数量大于所述第四数量,则输出该切分的语句对应的所述第二匹配结果(包括词组和单字);If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
若所述第一数量与所述第二数量不相等,且所述第一数量大于所述第二数量,则输出该切分的语句对应的所述第二匹配结果(包括词组和单字);If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
若所述第一数量与所述第二数量不相等,且所述第一数量小于所述第二数量,则输出该切分的语句对应的所述第一匹配结果(包括词组和单字)。If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.
本实施例中采用双向匹配法来对获取的各个切分的语句进行分词处理,通过正反向同时进行分词匹配来分析各个切分的语句待处理的字符串中前后 组合内容的粘性,由于通常情况下词组能代表核心观点信息的概率更大,即通过词组更能表达出核心观点信息。因此,通过正反向同时进行分词匹配找出单字数量更少,词组数量更多的分词匹配结果,以作为切分的语句的分词结果,从而提高分词的准确性,进而保证语言模型的训练效果和识别精度。In this embodiment, the two-way matching method is used to perform segmentation processing on the obtained segmented sentences, and the segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the sentences to be processed in each segmented sentence. The stickiness of the combined content, because the probability that the phrase can represent the core viewpoint information is usually greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy.
请参阅图4,是本发明语音识别系统10较佳实施例的功能模块图。在本实施例中,所述的语音识别系统10可以被分割成一个或多个模块,所述一个或者多个模块被存储于所述存储器11中,并由一个或多个处理器(本实施例为所述处理器12)所执行,以完成本发明。例如,在图4中,所述的语音识别系统10可以被分割成获取模块01、分词模块02及训练识别模块03。本发明所称的模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述所述语音识别系统10在所述电子装置1中的执行过程。以下描述将具体介绍所述获取模块01、分词模块02及训练识别模块03的功能。Please refer to FIG. 4, which is a functional block diagram of a preferred embodiment of the speech recognition system 10 of the present invention. In this embodiment, the speech recognition system 10 may be divided into one or more modules, the one or more modules being stored in the memory 11 and being executed by one or more processors (this implementation) For example, the processor 12) is executed to complete the present invention. For example, in FIG. 4, the speech recognition system 10 can be divided into an acquisition module 01, a word segmentation module 02, and a training recognition module 03. A module referred to in the present invention refers to a series of computer program instructions that are capable of performing a particular function, and are more suitable than the program to describe the execution of the speech recognition system 10 in the electronic device 1. The following description will specifically describe the functions of the acquisition module 01, the word segmentation module 02, and the training recognition module 03.
获取模块01,用于从预先确定的数据源获取特定类型的信息文本。The obtaining module 01 is configured to obtain a specific type of information text from a predetermined data source.
本实施例中,在训练语言模型之前,实时或者定时从预先确定的多个数据源(例如,新浪微博、百度百科、维基百科、新浪新闻等网站)获取特定类型的信息文本(例如,词条及其解释、新闻标题、新闻摘要、微博内容等等)。例如,可通过网络爬虫等工具实时或者定时从预先确定的数据源(例如,各大新闻网站、论坛等)获取特定类型的信息(例如,新闻标题信息、索引信息、简介信息等)。In this embodiment, before training the language model, a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.). For example, specific types of information (eg, news headline information, index information, profile information, etc.) may be obtained from a predetermined data source (eg, major news websites, forums, etc.) in real time or by means of tools such as web crawlers.
分词模块02,用于对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料。The word segmentation module 02 is configured to perform segmentation of the obtained information texts to obtain a plurality of sentences, perform word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.
从预先确定的多个数据源中获取到特定类型的各个信息文本后,可对获取的各个信息文本进行语句切分,例如可根据标点符号将各个信息文本切分成一条条完整的语句。然后,对各个切分的语句进行分词处理,例如,可利用字符串匹配的分词方法对各个切分的语句进行分词处理,如正向最大匹配法,把一个切分的语句中的字符串从左至右来分词;或者,反向最大匹配法,把一个切分的语句中的字符串从右至左来分词;或者,最短路径分词法,一个切分的语句中的字符串里面要求切出的词数是最少的;或者,双向最大匹配法,正反向同时进行分词匹配。还可利用词义分词法对各个切分的语句进行分词处理,词义分词法是一种机器语音判断的分词方法,利用句法信息和语义信息来处理歧义现象来分词。还可利用统计分词法对各个切分的语句进行分词处理,从当前用户的历史搜索记录或大众用户的历史搜索记录中,根据词组的统计,会统计有些两个相邻的字出现的频率较多,则可将这两个相邻的字作为词组来进行分词。After obtaining each information text of a specific type from a plurality of predetermined data sources, the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks. Then, word segmentation is performed on each segmented sentence. For example, a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting The number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation. Word segmentation can also be used to classify each segmented sentence. Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words. Statistical segmentation can also be used to process word segmentation of each segmented sentence. From the historical search record of the current user or the historical search record of the public user, according to the statistics of the phrase, the frequency of occurrence of some two adjacent words will be compared. If you have more, you can use these two adjacent words as a phrase to perform word segmentation.
对获取的各个切分的语句完成分词处理后,即可得到各个切分的语句与对应的分词所组成的第一映射语料。通过从预先确定的多个数据源中获取信息文本,并对信息文本切分生成大量的语句来进行分词处理,可从多个数据 源中获取到语料类型丰富、范围较广以及数量较多的语料资源。After the word segmentation processing is completed on the obtained segmented sentences, the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained. By extracting information text from a plurality of predetermined data sources, and segmenting the information text to generate a large number of sentences for word segmentation processing, multiple data can be obtained The source has access to a corpus resource with a rich corpus type, a wide range, and a large number.
训练识别模块03,用于根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。The training identification module 03 is configured to train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
基于所述第一映射语料,训练预设类型的第一语言模型,该第一语言模型可以是生成性模型、分析性模型、辨识性模型等。由于第一映射语料是从多个数据源中获取到的,其语料资源的语料类型丰富、范围较广且数量较多,因此,利用该第一映射语料来训练第一语言模型的训练效果较好,进而使得基于训练的所述第一语言模型进行语音识别的识别精度较高。Based on the first mapping corpus, a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.
本实施例通过对从预先确定的数据源获取的特定类型的信息文本进行语句切分,并对各个切分的语句进行分词处理,得到各个切分的语句与对应的分词的第一映射语料,根据该第一映射语料训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。由于可通过对从预先确定的多个数据源中获取的信息文本进行语句切分及相应的分词处理来得到语料资源,并基于该语料资源训练语言模型,无需获取标注过的对话文本,且能获取到足够数量的语料资源,能保证语言模型的训练效果和识别精度,从而有效提高语音识别的精度且有效降低语音识别的成本。In this embodiment, a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word. A first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training. Since the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
进一步地,在其他实施例中,上述分词模块02还用于:Further, in other embodiments, the word segmentation module 02 is further configured to:
对获取的各个信息文本进行清洗去噪。例如,针对微博内容,所述清洗去噪的步骤包括:从微博内容中删除用户名、id等信息,只保留微博的实际内容;删除掉转发的微博内容,一般获取的微博内容中有大量转发的微博内容,重复的转发微博内容会影响到词语的频次,因此须将转发的微博内容过滤掉,过滤方法为删除掉所有包含“转发”或包含“http”的微博内容;过滤掉微博内容中的特殊符号,将微博内容中预设类型的符号全部过滤掉;繁体转简体,微博内容中有大量的繁体字符,利用预先确定的简繁对应表将所有繁体字符转变为简体字符,等等。Clean and denoise each obtained information text. For example, for the microblog content, the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog. There is a large amount of Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out. The filtering method is to delete all the contents including "forwarding" or "http". Microblog content; filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.
对清洗去噪后的各个信息文本进行语句切分,例如,将两个预设类型的断句符“例如,逗号、句号、感叹号等”之间的语句作为一个待切分的语句,并对各个切分的语句进行分词处理,以得到各个切分的语句与对应的分词(包括词组和单字)的映射语料。Sentence segmentation of each information text after cleaning and denoising, for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc." as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).
进一步地,在其他实施例中,上述训练识别模块03还用于:Further, in other embodiments, the training identification module 03 is further configured to:
根据得到的各个第一映射语料,训练预设类型的第一语言模型。A first language model of a preset type is trained according to each of the obtained first mapping corpora.
根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型。例如,可预先确定若干样本语句,如可从预先确定的数据源中找出若干出现频率最高或最常用的样本语句,并确定每一样本语句对应的正确的分词(包括词组和单字),以根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型。 A second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation. For example, a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to A second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
根据预先确定的模型混合公式,将训练的所述第一语言模型及第二语言模型进行混合,以获得混合语言模型,并基于获得的所述混合语言模型进行语音识别。所述预先确定的模型混合公式可以为:The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model. The predetermined model mixing formula can be:
M=a*M1+b*M2M=a*M1+b*M2
其中,M为混合语言模型,M1代表预设类型的第一语言模型,a代表预设的模型M1的权重系数,M2代表预设类型的第二语言模型,b代表预设的模型M2的权重系数。Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
本实施例中,在根据从多个数据源中获取到的第一映射语料训练得到第一语言模型的基础上,还根据各个预先确定的样本语句与对应的分词的第二映射语料,训练得到第二语言模型,例如该预先确定的样本语句可以为预设的最常用且正确无误的若干语句,因此,训练得到的该第二语言模型能正确识别常用的语音。将训练的所述第一语言模型及第二语言模型按预设的不同权重比例进行混合得到混合语言模型,并基于获得的所述混合语言模型进行语音识别,既能保证语音识别的类型丰富、范围较广,又能保证正确识别常用的语音,进一步地提高语音识别的精度。In this embodiment, based on the first mapping model obtained by training the first mapping corpus obtained from the plurality of data sources, the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation. The second language model, for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech. The trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type. The range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.
进一步地,在其他实施例中,所述预设类型的第一语言模型或第二语言模型的训练过程如下:Further, in other embodiments, the training process of the preset type of the first language model or the second language model is as follows:
A、将各个第一映射语料或者各个第二映射语料分为第一比例(例如,70%)的训练集和第二比例(例如,30%)的验证集;A. Divide each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);
B、利用所述训练集训练所述第一语言模型或者第二语言模型;B. training the first language model or the second language model by using the training set;
C、利用所述验证集验证训练的第一语言模型或者第二语言模型的准确率,若准确率大于或者等于预设准确率,则训练结束,或者,若准确率小于预设准确率,则增加第一映射语料或者第二映射语料的数量并重新执行步骤A、B、C,直至训练的所述第一语言模型或者第二语言模型的准确率大于或者等于预设准确率。C. Using the verification set to verify the accuracy of the trained first language model or the second language model, if the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.
进一步地,在其他实施例中,所述预设类型的第一语言模型及/或第二语言模型为n-gram语言模型。n-gram语言模型是大词汇连续语音识别中常用的一种语言模型,对中文而言,称之为汉语语言模型(CLM,Chinese Language Model)。汉语语言模型利用上下文中相邻词间的搭配信息,在需要把连续无空格的拼音、笔划,或代表字母或笔划的数字,转换成汉字串(即句子)时,可以计算出具有最大概率的句子,从而实现到汉字的自动转换,避开了许多汉字对应一个相同的拼音(或笔划串、数字串)的重码问题。n-gram是一种统计语言模型,用来根据前(n-1)个item来预测第n个item。在应用层面,这些item可以是音素(语音识别应用)、字符(输入法应用)、词(分词应用)或碱基对(基因信息),可以从大规模文本或音频语料库生成n-gram模型。Further, in other embodiments, the preset type of the first language model and/or the second language model is an n-gram language model. The n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM). The Chinese language model uses the collocation information between adjacent words in the context. When it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), the maximum probability can be calculated. Sentences, thus achieving automatic conversion to Chinese characters, avoiding the problem of repetitive codes in which many Chinese characters correspond to the same pinyin (or stroke string, number string). N-gram is a statistical language model used to predict the nth item based on the first (n-1) items. At the application level, these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.
n-gram语言模型基于这样一种假设,第n个词的出现只与前面n-1个词相关,而与其它任何词都不相关,整句的概率就是各个词出现的概率的乘积, 这些概率可以通过直接从映射语料中统计n个词同时出现的次数得到。对于一个句子T,假设T是由词序列W1,W2,…,Wn组成的,那么句子T出现的概率P(T)=P(W1W2…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。本实施例中,为了解决出现概率为0的n-gram,在所述第一语言模型及/或第二语言模型的训练中,本实施例采用了最大似然估计方法,即:The n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, and is not related to any other words. The probability of the whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of n words directly from the mapped corpus. For a sentence T, assuming that T is composed of the word sequences W1, W2, ..., Wn, then the probability of occurrence of the sentence T P(T) = P(W1W2...Wn) = P(W1)P(W2|W1)P (W3|W1W2)...P(Wn|W1W2...Wn-1). In this embodiment, in order to solve the n-gram with an appearance probability of 0, in the training of the first language model and/or the second language model, the present embodiment adopts a maximum likelihood estimation method, namely:
P(Wn|W1W2…Wn-1)=C(W1W2…Wn)/C(W1W2…Wn-1)P(Wn|W1W2...Wn-1)=C(W1W2...Wn)/C(W1W2...Wn-1)
也就是说,在语言模型训练过程中,通过统计序列W1W2…Wn出现的次数和W1W2…Wn-1出现的次数,即可算出第n个词的出现概率,以判断出所对应字的概率,实现语音识别。That is to say, in the language model training process, by the number of occurrences of the statistical sequence W1W2...Wn and the number of occurrences of W1W2...Wn-1, the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.
进一步地,在其他实施例中,上述分词模块02还用于:Further, in other embodiments, the word segmentation module 02 is further configured to:
根据正向最大匹配法将每一切分的语句中待处理的字符串与预先确定的字词典库(例如,该字词典库可以是通用字词典库,也可以是可扩容的学习型字词典库)进行匹配,得到第一匹配结果;According to the forward maximum matching method, the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;
根据逆向最大匹配法将每一切分的语句中待处理的字符串与预先确定的字词典库(例如,该字词典库可以是通用字词典库,也可以是可扩容的学习型字词典库)进行匹配,得到第二匹配结果。其中,所述第一匹配结果中包含有第一数量的第一词组,所述第二匹配结果中包含有第二数量的第二词组;所述第一匹配结果中包含有第三数量的单字,所述第二匹配结果中包含有第四数量的单字。According to the inverse maximum matching method, the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library) Matching is performed to obtain a second matching result. The first matching result includes a first number of first phrases, and the second matching result includes a second number of second phrases; the first matching result includes a third number of words The second matching result includes a fourth number of words.
若所述第一数量与所述第二数量相等,且所述第三数量小于或者等于所述第四数量,则输出该切分的语句对应的所述第一匹配结果(包括词组和单字);If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;
若所述第一数量与所述第二数量相等,且所述第三数量大于所述第四数量,则输出该切分的语句对应的所述第二匹配结果(包括词组和单字);If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
若所述第一数量与所述第二数量不相等,且所述第一数量大于所述第二数量,则输出该切分的语句对应的所述第二匹配结果(包括词组和单字);If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
若所述第一数量与所述第二数量不相等,且所述第一数量小于所述第二数量,则输出该切分的语句对应的所述第一匹配结果(包括词组和单字)。If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.
本实施例中采用双向匹配法来对获取的各个切分的语句进行分词处理,通过正反向同时进行分词匹配来分析各个切分的语句待处理的字符串中前后组合内容的粘性,由于通常情况下词组能代表核心观点信息的概率更大,即通过词组更能表达出核心观点信息。因此,通过正反向同时进行分词匹配找出单字数量更少,词组数量更多的分词匹配结果,以作为切分的语句的分词结果,从而提高分词的准确性,进而保证语言模型的训练效果和识别精度In this embodiment, the two-way matching method is adopted to perform word segmentation processing on each segmented sentence obtained, and the word segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the viscosity of the combined content in the string to be processed of each segmented sentence, since usually In the case where the phrase can represent the core viewpoint information, the probability is greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy
此外,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质存储有语音识别系统,所述语音识别系统可被至少一个处理设备执行,以使所述至少一个处理设备执行如上述实施例中的语音识别方法的步骤,该语 音识别方法的步骤S10、S20、S30等具体实施过程如上文所述,在此不再赘述。Moreover, the present invention also provides a computer readable storage medium storing a speech recognition system, the speech recognition system being executable by at least one processing device to cause the at least one processing device to perform The steps of the speech recognition method in the above embodiment, the language The specific implementation processes of steps S10, S20, and S30 of the tone recognition method are as described above, and are not described herein again.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device comprising a series of elements includes those elements. It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
以上参照附图说明了本发明的优选实施例,并非因此局限本发明的权利范围。上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。另外,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The preferred embodiments of the present invention have been described above with reference to the drawings, and are not intended to limit the scope of the invention. The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments. Additionally, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
本领域技术人员不脱离本发明的范围和实质,可以有多种变型方案实现本发明,比如作为一个实施例的特征可用于另一实施例而得到又一实施例。凡在运用本发明的技术构思之内所作的任何修改、等同替换和改进,均应在本发明的权利范围之内。 A person skilled in the art can implement the invention in various variants without departing from the scope and spirit of the invention. For example, the features of one embodiment can be used in another embodiment to obtain a further embodiment. Any modifications, equivalent substitutions and improvements made within the technical concept of the invention are intended to be included within the scope of the invention.

Claims (20)

  1. 一种语音识别方法,其特征在于,所述方法包括以下步骤:A speech recognition method, characterized in that the method comprises the following steps:
    A、从预先确定的数据源获取特定类型的信息文本;A. Obtaining a specific type of information text from a predetermined data source;
    B、对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
    C、根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  2. 如权利要求1所述的语音识别方法,其特征在于,所述步骤C替换为:The speech recognition method according to claim 1, wherein said step C is replaced by:
    根据得到的各个第一映射语料,训练预设类型的第一语言模型;Training a first language model of a preset type according to each of the obtained first mapping corpora;
    根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型;Training a second language model of a preset type according to each predetermined sample sentence and a second mapping corpus of the corresponding word segment;
    根据预先确定的模型混合公式,将训练的所述第一语言模型及第二语言模型进行混合,以获得混合语言模型,并基于获得的所述混合语言模型进行语音识别。The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
  3. 如权利要求2所述的语音识别方法,其特征在于,所述预先确定的模型混合公式为:The speech recognition method according to claim 2, wherein said predetermined model mixing formula is:
    M=a*M1+b*M2M=a*M1+b*M2
    其中,M为混合语言模型,M1代表预设类型的第一语言模型,a代表预设的模型M1的权重系数,M2代表预设类型的第二语言模型,b代表预设的模型M2的权重系数。Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
  4. 如权利要求2或3所述的语音识别方法,其特征在于,所述预设类型的第一语言模型及/或第二语言模型为n-gram语言模型,所述预设类型的第一语言模型或第二语言模型的训练过程如下:The speech recognition method according to claim 2 or 3, wherein the first language model and/or the second language model of the preset type is an n-gram language model, and the first language of the preset type The training process for the model or the second language model is as follows:
    S1、将各个第一映射语料或者各个第二映射语料分为第一比例的训练集和第二比例的验证集;S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;
    S2、利用所述训练集训练所述第一语言模型或者第二语言模型;S2, training the first language model or the second language model by using the training set;
    S3、利用所述验证集验证训练的第一语言模型或者第二语言模型的准确率,若准确率大于或者等于预设准确率,则训练结束,或者,若准确率小于预设准确率,则增加第一映射语料或者第二映射语料的数量并重新执行步骤S1、S2、S3。S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
  5. 如权利要求1所述的语音识别方法,其特征在于,所述对各个切分的语句进行分词处理的步骤包括:The speech recognition method according to claim 1, wherein the step of performing word segmentation processing on each of the segmented sentences comprises:
    在一个切分的语句被选择进行分词处理时,根据正向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第一匹配结果,所述第一匹 配结果中包含有第一数量的第一词组和第三数量的单字;When a segmented statement is selected for word segmentation processing, the segmented statement is matched with a predetermined word dictionary library according to a forward maximum matching method to obtain a first matching result, and the first matching result The matching result includes a first number of first phrases and a third number of words;
    根据逆向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第二数量的第二词组和第四数量的单字;Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;
    若所述第一数量与所述第二数量相等,且所述第三数量小于或者等于所述第四数量,则将所述第一匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量相等,且所述第三数量大于所述第四数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量大于所述第二数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量小于所述第二数量,则将所述第一匹配结果作为该切分的语句的分词结果。If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
  6. 一种语音识别系统,其特征在于,所述语音识别系统包括:A speech recognition system, characterized in that the speech recognition system comprises:
    获取模块,用于从预先确定的数据源获取特定类型的信息文本;An obtaining module, configured to obtain a specific type of information text from a predetermined data source;
    分词模块,用于对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;The word segmentation module is used for segmenting the obtained information texts to obtain a plurality of sentences, and performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
    训练识别模块,用于根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。And a training identification module, configured to train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  7. 如权利要求6所述的语音识别系统,其特征在于,所述训练识别模块还用于:The speech recognition system according to claim 6, wherein the training recognition module is further configured to:
    根据得到的各个第一映射语料,训练预设类型的第一语言模型;Training a first language model of a preset type according to each of the obtained first mapping corpora;
    根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型;Training a second language model of a preset type according to each predetermined sample sentence and a second mapping corpus of the corresponding word segment;
    根据预先确定的模型混合公式,将训练的所述第一语言模型及第二语言模型进行混合,以获得混合语言模型,并基于获得的所述混合语言模型进行语音识别。The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
  8. 如权利要求7所述的语音识别系统,其特征在于,所述预先确定的模型混合公式为:The speech recognition system of claim 7 wherein said predetermined model blending formula is:
    M=a*M1+b*M2M=a*M1+b*M2
    其中,M为混合语言模型,M1代表预设类型的第一语言模型,a代表预设的模型M1的权重系数,M2代表预设类型的第二语言模型,b代表预设的模型M2的权重系数。Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
  9. 如权利要求7或8所述的语音识别系统,其特征在于,所述预设类型的第一语言模型及/或第二语言模型为n-gram语言模型,所述预设类型的第一 语言模型或第二语言模型的训练过程如下:The speech recognition system according to claim 7 or 8, wherein the first language model and/or the second language model of the preset type is an n-gram language model, and the first type of the preset type The training process of the language model or the second language model is as follows:
    S1、将各个第一映射语料或者各个第二映射语料分为第一比例的训练集和第二比例的验证集;S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;
    S2、利用所述训练集训练所述第一语言模型或者第二语言模型;S2, training the first language model or the second language model by using the training set;
    S3、利用所述验证集验证训练的第一语言模型或者第二语言模型的准确率,若准确率大于或者等于预设准确率,则训练结束,或者,若准确率小于预设准确率,则增加第一映射语料或者第二映射语料的数量并重新执行步骤S1、S2、S3。S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
  10. 如权利要求6所述的语音识别系统,其特征在于,所述分词模块还用于:The speech recognition system according to claim 6, wherein said word segmentation module is further configured to:
    在一个切分的语句被选择进行分词处理时,根据正向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第一匹配结果,所述第一匹配结果中包含有第一数量的第一词组和第三数量的单字;When a segmented statement is selected for word segmentation processing, the segmented statement is matched with a predetermined word dictionary library according to a forward maximum matching method to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of words;
    根据逆向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第二数量的第二词组和第四数量的单字;Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;
    若所述第一数量与所述第二数量相等,且所述第三数量小于或者等于所述第四数量,则将所述第一匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量相等,且所述第三数量大于所述第四数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量大于所述第二数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量小于所述第二数量,则将所述第一匹配结果作为该切分的语句的分词结果。If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
  11. 一种电子装置,包括处理设备、存储设备及语音识别系统,该语音识别系统存储于该存储设备中,包括至少一个计算机可读指令,该至少一个计算机可读指令可被所述处理设备执行,以实现以下操作:An electronic device comprising a processing device, a storage device, and a voice recognition system, the voice recognition system being stored in the storage device, comprising at least one computer readable instruction executable by the processing device, To achieve the following:
    A、从预先确定的数据源获取特定类型的信息文本;A. Obtaining a specific type of information text from a predetermined data source;
    B、对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
    C、根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  12. 如权利要求11所述的电子装置,其特征在于,所述至少一个计算机可读指令还可被所述处理设备执行,以实现以下操作:The electronic device of claim 11 wherein said at least one computer readable instruction is further executable by said processing device to:
    根据得到的各个第一映射语料,训练预设类型的第一语言模型;Training a first language model of a preset type according to each of the obtained first mapping corpora;
    根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设 类型的第二语言模型;Training presets according to each predetermined sample sentence and the second mapping corpus of the corresponding word segmentation a second language model of the type;
    根据预先确定的模型混合公式,将训练的所述第一语言模型及第二语言模型进行混合,以获得混合语言模型,并基于获得的所述混合语言模型进行语音识别。The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
  13. 如权利要求12所述的电子装置,其特征在于,所述预先确定的模型混合公式为:The electronic device according to claim 12, wherein said predetermined model mixing formula is:
    M=a*M1+b*M2M=a*M1+b*M2
    其中,M为混合语言模型,M1代表预设类型的第一语言模型,a代表预设的模型M1的权重系数,M2代表预设类型的第二语言模型,b代表预设的模型M2的权重系数。Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
  14. 如权利要求12或13所述的电子装置,其特征在于,所述预设类型的第一语言模型及/或第二语言模型为n-gram语言模型,所述预设类型的第一语言模型或第二语言模型的训练过程如下:The electronic device according to claim 12 or 13, wherein the first language model and/or the second language model of the preset type is an n-gram language model, and the first language model of the preset type Or the training process of the second language model is as follows:
    S1、将各个第一映射语料或者各个第二映射语料分为第一比例的训练集和第二比例的验证集;S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;
    S2、利用所述训练集训练所述第一语言模型或者第二语言模型;S2, training the first language model or the second language model by using the training set;
    S3、利用所述验证集验证训练的第一语言模型或者第二语言模型的准确率,若准确率大于或者等于预设准确率,则训练结束,或者,若准确率小于预设准确率,则增加第一映射语料或者第二映射语料的数量并重新执行步骤S1、S2、S3。S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
  15. 如权利要求11所述的电子装置,其特征在于,所述对各个切分的语句进行分词处理包括:The electronic device according to claim 11, wherein said word segmentation processing for each segmented sentence comprises:
    在一个切分的语句被选择进行分词处理时,根据正向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第一匹配结果,所述第一匹配结果中包含有第一数量的第一词组和第三数量的单字;When a segmented statement is selected for word segmentation processing, the segmented statement is matched with a predetermined word dictionary library according to a forward maximum matching method to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of words;
    根据逆向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第二数量的第二词组和第四数量的单字;Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;
    若所述第一数量与所述第二数量相等,且所述第三数量小于或者等于所述第四数量,则将所述第一匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量相等,且所述第三数量大于所述第四数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量大于所述第二数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量小于所述第二数量,则将所述第一匹配结果作为该切分的语句的分词结果。 If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
  16. 一种计算机可读存储介质,其上存储有至少一个可被处理设备执行以实现以下操作的计算机可读指令:A computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:
    A、从预先确定的数据源获取特定类型的信息文本;A. Obtaining a specific type of information text from a predetermined data source;
    B、对获取的各个信息文本进行语句切分得到若干语句,对各个语句进行分词处理得到对应的分词,由各个语句与对应的分词构成第一映射语料;B. Performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
    C、根据得到的各个第一映射语料,训练预设类型的第一语言模型,并基于训练的所述第一语言模型进行语音识别。C. Train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述至少一个计算机可读指令还可被所述处理设备执行,以实现以下操作:The computer readable storage medium of claim 16 wherein said at least one computer readable instruction is further executable by said processing device to:
    根据得到的各个第一映射语料,训练预设类型的第一语言模型;Training a first language model of a preset type according to each of the obtained first mapping corpora;
    根据各个预先确定的样本语句与对应的分词的第二映射语料,训练预设类型的第二语言模型;Training a second language model of a preset type according to each predetermined sample sentence and a second mapping corpus of the corresponding word segment;
    根据预先确定的模型混合公式,将训练的所述第一语言模型及第二语言模型进行混合,以获得混合语言模型,并基于获得的所述混合语言模型进行语音识别。The trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预先确定的模型混合公式为:The computer readable storage medium of claim 17 wherein said predetermined model blending formula is:
    M=a*M1+b*M2M=a*M1+b*M2
    其中,M为混合语言模型,M1代表预设类型的第一语言模型,a代表预设的模型M1的权重系数,M2代表预设类型的第二语言模型,b代表预设的模型M2的权重系数。Where M is a mixed language model, M1 represents a first language model of a preset type, a represents a weighting coefficient of a preset model M1, M2 represents a second language model of a preset type, and b represents a weight of a preset model M2. coefficient.
  19. 如权利要求17或18所述的计算机可读存储介质,其特征在于,所述预设类型的第一语言模型及/或第二语言模型为n-gram语言模型,所述预设类型的第一语言模型或第二语言模型的训练过程如下:The computer readable storage medium according to claim 17 or 18, wherein the preset type of the first language model and/or the second language model is an n-gram language model, the preset type of The training process of a language model or a second language model is as follows:
    S1、将各个第一映射语料或者各个第二映射语料分为第一比例的训练集和第二比例的验证集;S1, dividing each first mapping corpus or each second mapping corpus into a training set of a first ratio and a verification set of a second ratio;
    S2、利用所述训练集训练所述第一语言模型或者第二语言模型;S2, training the first language model or the second language model by using the training set;
    S3、利用所述验证集验证训练的第一语言模型或者第二语言模型的准确率,若准确率大于或者等于预设准确率,则训练结束,或者,若准确率小于预设准确率,则增加第一映射语料或者第二映射语料的数量并重新执行步骤S1、S2、S3。S3. Verify the accuracy of the first language model or the second language model of the training by using the verification set. If the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, or if the accuracy rate is less than the preset accuracy rate, The number of the first mapping corpus or the second mapping corpus is increased and steps S1, S2, S3 are re-executed.
  20. 如权利要求16所述的计算机可读存储介质,其特征在于,所述对各个切分的语句进行分词处理包括:The computer readable storage medium of claim 16 wherein said word segmentation of each segmented statement comprises:
    在一个切分的语句被选择进行分词处理时,根据正向最大匹配法将该切 分的语句与预先确定的字词典库进行匹配,得到第一匹配结果,所述第一匹配结果中包含有第一数量的第一词组和第三数量的单字;When a segmented statement is selected for word segmentation, the slice is cut according to the forward maximum matching method. The segmentation statement is matched with a predetermined word dictionary library to obtain a first matching result, where the first matching result includes a first number of first phrases and a third number of words;
    根据逆向最大匹配法将该切分的语句与预先确定的字词典库进行匹配,得到第二匹配结果,所述第二匹配结果中包含有第二数量的第二词组和第四数量的单字;Matching the segmented statement with the predetermined word dictionary library according to the inverse maximum matching method to obtain a second matching result, where the second matching result includes a second number of second phrases and a fourth number of words;
    若所述第一数量与所述第二数量相等,且所述第三数量小于或者等于所述第四数量,则将所述第一匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, the first matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量相等,且所述第三数量大于所述第四数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is equal to the second quantity, and the third quantity is greater than the fourth quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量大于所述第二数量,则将所述第二匹配结果作为该切分的语句的分词结果;If the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, the second matching result is used as a word segmentation result of the segmented statement;
    若所述第一数量与所述第二数量不相等,且所述第一数量小于所述第二数量,则将所述第一匹配结果作为该切分的语句的分词结果。 If the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, the first matching result is used as a word segmentation result of the segmented statement.
PCT/CN2017/091353 2017-05-10 2017-06-30 Voice recognition method and system, electronic apparatus and medium WO2018205389A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710327374.8 2017-05-10
CN201710327374.8A CN107204184B (en) 2017-05-10 2017-05-10 Audio recognition method and system

Publications (1)

Publication Number Publication Date
WO2018205389A1 true WO2018205389A1 (en) 2018-11-15

Family

ID=59905515

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/091353 WO2018205389A1 (en) 2017-05-10 2017-06-30 Voice recognition method and system, electronic apparatus and medium

Country Status (3)

Country Link
CN (1) CN107204184B (en)
TW (1) TWI636452B (en)
WO (1) WO2018205389A1 (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108257593B (en) * 2017-12-29 2020-11-13 深圳和而泰数据资源与云技术有限公司 Voice recognition method and device, electronic equipment and storage medium
CN108831442A (en) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 Point of interest recognition methods, device, terminal device and storage medium
CN110648657B (en) * 2018-06-27 2024-02-02 北京搜狗科技发展有限公司 Language model training method, language model building method and language model building device
CN109033082B (en) * 2018-07-19 2022-06-10 深圳创维数字技术有限公司 Learning training method and device of semantic model and computer readable storage medium
CN109344221B (en) * 2018-08-01 2021-11-23 创新先进技术有限公司 Recording text generation method, device and equipment
CN109582791B (en) * 2018-11-13 2023-01-24 创新先进技术有限公司 Text risk identification method and device
CN109377985B (en) * 2018-11-27 2022-03-18 北京分音塔科技有限公司 Speech recognition enhancement method and device for domain words
CN109582775B (en) * 2018-12-04 2024-03-26 平安科技(深圳)有限公司 Information input method, device, computer equipment and storage medium
CN109992769A (en) * 2018-12-06 2019-07-09 平安科技(深圳)有限公司 Sentence reasonability judgment method, device, computer equipment based on semanteme parsing
CN109461459A (en) * 2018-12-07 2019-03-12 平安科技(深圳)有限公司 Speech assessment method, apparatus, computer equipment and storage medium
CN109558596A (en) * 2018-12-14 2019-04-02 平安城市建设科技(深圳)有限公司 Recognition methods, device, terminal and computer readable storage medium
CN109783648B (en) * 2018-12-28 2020-12-29 北京声智科技有限公司 Method for improving ASR language model by using ASR recognition result
CN109815991B (en) * 2018-12-29 2021-02-19 北京城市网邻信息技术有限公司 Training method and device of machine learning model, electronic equipment and storage medium
CN110223674B (en) * 2019-04-19 2023-05-26 平安科技(深圳)有限公司 Speech corpus training method, device, computer equipment and storage medium
WO2020244150A1 (en) * 2019-06-06 2020-12-10 平安科技(深圳)有限公司 Speech retrieval method and apparatus, computer device, and storage medium
CN110222182B (en) * 2019-06-06 2022-12-27 腾讯科技(深圳)有限公司 Statement classification method and related equipment
CN110288980A (en) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 Audio recognition method, the training method of model, device, equipment and storage medium
CN110784603A (en) * 2019-10-18 2020-02-11 深圳供电局有限公司 Intelligent voice analysis method and system for offline quality inspection
CN113055017A (en) * 2019-12-28 2021-06-29 华为技术有限公司 Data compression method and computing device
CN111326160A (en) * 2020-03-11 2020-06-23 南京奥拓电子科技有限公司 Speech recognition method, system and storage medium for correcting noise text
CN112712794A (en) * 2020-12-25 2021-04-27 苏州思必驰信息科技有限公司 Speech recognition marking training combined system and device
CN113127621A (en) * 2021-04-28 2021-07-16 平安国际智慧城市科技股份有限公司 Dialogue module pushing method, device, equipment and storage medium
CN113658585B (en) * 2021-08-13 2024-04-09 北京百度网讯科技有限公司 Training method of voice interaction model, voice interaction method and device
CN113948065B (en) * 2021-09-01 2022-07-08 北京数美时代科技有限公司 Method and system for screening error blocking words based on n-gram model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593518A (en) * 2008-05-28 2009-12-02 中国科学院自动化研究所 The balance method of actual scene language material and finite state network language material
CN102495837A (en) * 2011-11-01 2012-06-13 中国科学院计算技术研究所 Training method and system for digital information recommending and forecasting model
CN103577386A (en) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 Method and device for dynamically loading language model based on user input scene
CN103971677A (en) * 2013-02-01 2014-08-06 腾讯科技(深圳)有限公司 Acoustic language model training method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100511248B1 (en) * 2003-06-13 2005-08-31 홍광석 An Amplitude Warping Approach to Intra-Speaker Normalization for Speech Recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593518A (en) * 2008-05-28 2009-12-02 中国科学院自动化研究所 The balance method of actual scene language material and finite state network language material
CN102495837A (en) * 2011-11-01 2012-06-13 中国科学院计算技术研究所 Training method and system for digital information recommending and forecasting model
CN103577386A (en) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 Method and device for dynamically loading language model based on user input scene
CN103971677A (en) * 2013-02-01 2014-08-06 腾讯科技(深圳)有限公司 Acoustic language model training method and device

Also Published As

Publication number Publication date
TWI636452B (en) 2018-09-21
CN107204184A (en) 2017-09-26
TW201901661A (en) 2019-01-01
CN107204184B (en) 2018-08-03

Similar Documents

Publication Publication Date Title
WO2018205389A1 (en) Voice recognition method and system, electronic apparatus and medium
US11693894B2 (en) Conversation oriented machine-user interaction
US9910886B2 (en) Visual representation of question quality
US11521603B2 (en) Automatically generating conference minutes
WO2019232991A1 (en) Method for recognizing conference voice as text, electronic device and storage medium
AU2017408800B2 (en) Method and system of mining information, electronic device and readable storable medium
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
WO2014117553A1 (en) Method and system of adding punctuation and establishing language model
US9811517B2 (en) Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text
WO2009026850A1 (en) Domain dictionary creation
CN111209363B (en) Corpus data processing method, corpus data processing device, server and storage medium
CN112347241A (en) Abstract extraction method, device, equipment and storage medium
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
JP7309811B2 (en) Data annotation method, apparatus, electronics and storage medium
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN110442696B (en) Query processing method and device
CN113254578B (en) Method, apparatus, device, medium and product for data clustering
CN113158693A (en) Uygur language keyword generation method and device based on Chinese keywords, electronic equipment and storage medium
US11989500B2 (en) Framework agnostic summarization of multi-channel communication
CN108932326B (en) Instance extension method, device, equipment and medium
CN113779990B (en) Chinese word segmentation method, device, equipment and storage medium
US20230222149A1 (en) Embedding performance optimization through use of a summary model
JP2022064137A (en) Estimation device, estimation method, and program
CN115828925A (en) Text selection method and device, electronic equipment and readable storage medium
CN117851542A (en) Information query method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17909445

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17909445

Country of ref document: EP

Kind code of ref document: A1