WO2018205389A1 - Procédé et système de reconnaissance vocale, appareil électronique et support - Google Patents

Procédé et système de reconnaissance vocale, appareil électronique et support Download PDF

Info

Publication number
WO2018205389A1
WO2018205389A1 PCT/CN2017/091353 CN2017091353W WO2018205389A1 WO 2018205389 A1 WO2018205389 A1 WO 2018205389A1 CN 2017091353 W CN2017091353 W CN 2017091353W WO 2018205389 A1 WO2018205389 A1 WO 2018205389A1
Authority
WO
WIPO (PCT)
Prior art keywords
language model
segmented
training
preset
word segmentation
Prior art date
Application number
PCT/CN2017/091353
Other languages
English (en)
Chinese (zh)
Inventor
王健宗
程宁
查高密
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018205389A1 publication Critical patent/WO2018205389A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a voice recognition method, system, electronic device, and medium.
  • the language model plays an important role in the speech recognition task.
  • the language model is generally established by using the annotated dialogue text, and the probability of each word is determined by the language model.
  • the manner in which the language model is built using the labeled dialog text is too small because the current user needs to use the voice recognition technology in daily life (for example, the more common scenes are voice search, voice control, etc.) ), and the types and scopes of corpus that can be collected are too concentrated, which makes the following two shortcomings: one is expensive to purchase and the cost is high; the other is that it is difficult to obtain a sufficient amount of corpus to obtain an annotated dialogue.
  • the text is difficult, and the timeliness and accuracy of the upgrade and expansion are difficult to guarantee, which in turn affects the training effect and recognition accuracy of the language model, thus affecting the accuracy of speech recognition.
  • the main object of the present invention is to provide a speech recognition method, system, electronic device and medium, which aim to effectively improve the accuracy of speech recognition and effectively reduce the cost of speech recognition.
  • a first aspect of the present application provides a voice recognition method, where the method includes the following steps:
  • a second aspect of the present application provides a voice recognition system, where the voice recognition system includes:
  • An obtaining module configured to obtain a specific type of information text from a predetermined data source
  • the word segmentation module is used for segmenting the obtained information texts to obtain a plurality of sentences, and performing word segmentation processing on each sentence to obtain corresponding word segments, and each sentence and corresponding word segmentation constitute a first mapping corpus;
  • a training identification module configured to train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  • a third aspect of the present application provides an electronic device, including a processing device, a storage device, and a voice recognition system, the voice recognition system being stored in the storage device, including at least one computer readable instruction, the at least one computer readable instruction
  • the processing device executes to:
  • a fourth aspect of the present application provides a computer readable storage medium having stored thereon at least one computer readable instruction executable by a processing device to:
  • the speech recognition method, system, electronic device and medium provided by the invention perform segmentation of a specific type of information text acquired from a predetermined data source, and perform word segmentation processing on each segmented sentence to obtain each segmentation. And the first mapping corpus of the corresponding participle, training the first language model of the preset type according to the first mapping corpus, and performing speech recognition based on the first language model of the training.
  • the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
  • FIG. 1 is a schematic diagram of an application environment of a preferred embodiment of a voice recognition method according to the present invention
  • FIG. 2 is a schematic flow chart of a first embodiment of a voice recognition method according to the present invention.
  • FIG. 3 is a schematic flow chart of a second embodiment of a voice recognition method according to the present invention.
  • FIG. 4 is a schematic diagram of functional modules of an embodiment of a speech recognition system of the present invention.
  • FIG. 1 it is a schematic diagram of an application environment of a preferred embodiment of the speech recognition method of the present invention.
  • the application environment diagram includes an electronic device 1 and a terminal device 2.
  • the electronic device 1 can perform data interaction with the terminal device 2 through a suitable technology such as a network or a near field communication technology.
  • the terminal device 2 includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, or an individual.
  • Digital Assistant (PDA) game console, Internet Protocol Television (IPTV), smart wearable device, etc.
  • the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing, A super virtual computer consisting of a loosely coupled set of computers.
  • the electronic device 1 includes, but is not limited to, a storage device 11, a processing device 12, and a network interface 13 that are communicably connected to each other through a system bus. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the storage device 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
  • a storage device such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like.
  • SMC smart memory card
  • SD Secure Digital
  • the readable storage medium of the storage device 11 is generally used to store an operating system installed in the electronic device 1 and various types of application software, such as program codes of the voice recognition system 10 in an embodiment of the present application. Further, the storage device 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • Processing device 12 may, in some embodiments, include one or more microprocessors, microcontrollers, digital processors, and the like.
  • the processing device 12 is generally used to control the operation of the electronic device 1, for example, to perform control and processing related to data interaction or communication with the terminal device 2.
  • the processing device 12 is configured to run program code or process data stored in the storage device 11, such as running the speech recognition system 10 or the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices 2, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices 2.
  • the speech recognition system 10 includes at least one computer readable instruction stored in the storage device 11, The at least one computer readable instruction can be executed by processing device 12 to implement a method of picture recognition for embodiments of the present application. As described later, the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the speech recognition system 10 when executed by the processing device 12, the following operations are performed: first, acquiring a specific type of information text from a predetermined data source; and performing segmentation of the obtained information text to obtain a plurality of statements, Each sentence is processed by word segmentation to obtain a corresponding segmentation word, and each sentence and corresponding word segmentation constitute a first mapping corpus; then, according to each obtained first mapping corpus, a first language model of a preset type is trained, and the terminal device 2 is received. After the sent voice is to be recognized, the voice to be recognized is input into the trained first language model for identification, and the recognition result is fed back to the terminal device 2 for display on the terminal device 2 to the terminal user.
  • the speech recognition system 10 is stored in the storage device 11 and includes at least one computer readable instruction stored in the storage device 11, the at least one computer readable instruction being executable by the processing device 12 to implement the present application.
  • the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the invention provides a speech recognition method.
  • FIG. 2 is a schematic flowchart of a first embodiment of a voice recognition method according to the present invention.
  • the speech recognition method comprises:
  • Step S10 Acquire a specific type of information text from a predetermined data source.
  • a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.).
  • specific types of information eg, news headline information, index information, profile information, etc.
  • a predetermined data source eg, major news websites, forums, etc.
  • Step S20 performing segmentation of the obtained information texts to obtain a plurality of sentences, performing word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.
  • the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks.
  • word segmentation is performed on each segmented sentence.
  • a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting
  • the number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation.
  • Word segmentation can also be used to classify each segmented sentence.
  • Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words. You can also use statistical segmentation to enter the sentences of each segmentation. Line word segmentation processing, from the current user's historical search record or the public user's historical search record, according to the statistics of the phrase, it will be counted that some two adjacent words appear more frequently, then the two adjacent words can be As a phrase to perform word segmentation.
  • the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained.
  • the corpus types are rich, the scope is wide, and the number is large. Corpus resources.
  • Step S30 Train a preset first type language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  • a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.
  • a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word.
  • a first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training.
  • the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
  • step S20 may include:
  • the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog.
  • Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out.
  • the filtering method is to delete all the contents including "forwarding" or "http".
  • Microblog content filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.
  • Sentence segmentation of each information text after cleaning and denoising for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc.” as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).
  • a second embodiment of the present invention provides a voice recognition method, in the above embodiment. Based on the above, the above step S30 is replaced by:
  • Step S40 Train a preset first language model according to each of the obtained first mapping corpora.
  • Step S50 training a preset second language model according to each predetermined sample sentence and a second mapping corpus of the corresponding segmentation.
  • a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to
  • a second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
  • Step S60 mixing the trained first language model and the second language model according to a predetermined model mixing formula to obtain a mixed language model, and performing speech recognition based on the obtained mixed language model.
  • the predetermined model mixing formula can be:
  • M1 represents a first language model of a preset type
  • a represents a weighting coefficient of a preset model M1
  • M2 represents a second language model of a preset type
  • b represents a weight of a preset model M2.
  • the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation.
  • the second language model for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech.
  • the trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type.
  • the range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.
  • the training process of the preset type of the first language model or the second language model is as follows:
  • each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);
  • the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.
  • the preset type of the first language model and/or the second language model is an n-gram language model.
  • the n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM).
  • CLM Chinese Language Model
  • the Chinese language model uses the collocation information between adjacent words in the context, and when it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), The sentence with the highest probability can be calculated, thereby realizing the automatic conversion to the Chinese character, avoiding the problem of the heavy code of many Chinese characters corresponding to the same pinyin (or stroke string, number string).
  • N-gram is a statistical language model used to predict the nth item based on the first (n-1) items.
  • these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.
  • the n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, but not to any other words.
  • the present embodiment adopts a maximum likelihood estimation method, namely:
  • the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.
  • the step of performing word segmentation processing on each segmented statement in the above step S20 may include:
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library)
  • Matching is performed to obtain a second matching result.
  • the first matching result includes a first number of first phrases
  • the second matching result includes a second number of second phrases
  • the first matching result includes a third number of words
  • the second matching result includes a fourth number of words.
  • the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;
  • the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.
  • the two-way matching method is used to perform segmentation processing on the obtained segmented sentences, and the segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the sentences to be processed in each segmented sentence.
  • the stickiness of the combined content because the probability that the phrase can represent the core viewpoint information is usually greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy.
  • FIG. 4 is a functional block diagram of a preferred embodiment of the speech recognition system 10 of the present invention.
  • the speech recognition system 10 may be divided into one or more modules, the one or more modules being stored in the memory 11 and being executed by one or more processors (this implementation)
  • the processor 12 is executed to complete the present invention.
  • the speech recognition system 10 can be divided into an acquisition module 01, a word segmentation module 02, and a training recognition module 03.
  • a module referred to in the present invention refers to a series of computer program instructions that are capable of performing a particular function, and are more suitable than the program to describe the execution of the speech recognition system 10 in the electronic device 1. The following description will specifically describe the functions of the acquisition module 01, the word segmentation module 02, and the training recognition module 03.
  • the obtaining module 01 is configured to obtain a specific type of information text from a predetermined data source.
  • a specific type of information text (for example, a word) is obtained from a predetermined plurality of data sources (for example, Sina Weibo, Baidu Encyclopedia, Wikipedia, Sina News, etc.) in real time or at a time. Articles and their explanations, news headlines, news summaries, Weibo content, etc.).
  • specific types of information eg, news headline information, index information, profile information, etc.
  • a predetermined data source eg, major news websites, forums, etc.
  • the word segmentation module 02 is configured to perform segmentation of the obtained information texts to obtain a plurality of sentences, perform word segmentation processing on the respective sentences to obtain corresponding segmentation words, and each sentence and the corresponding word segmentation constitute a first mapping corpus.
  • the obtained information texts may be segmented into sentences, for example, the information texts may be divided into complete statements according to punctuation marks.
  • word segmentation is performed on each segmented sentence.
  • a word segmentation method can be used to perform segmentation processing on each segmented sentence, such as a forward maximum matching method, and a string in a segmented statement is Left to right to word segmentation; or, reverse maximum matching method, to divide the string in a segmented statement from right to left; or, shortest path segmentation, a string in a segmented statement requires cutting
  • the number of words is the least; or, the two-way maximum matching method, the positive and negative simultaneous word segmentation.
  • Word segmentation can also be used to classify each segmented sentence.
  • Word segmentation is a segmentation method for machine speech judgment. It uses syntactic information and semantic information to deal with ambiguity phenomena to segment words.
  • Statistical segmentation can also be used to process word segmentation of each segmented sentence. From the historical search record of the current user or the historical search record of the public user, according to the statistics of the phrase, the frequency of occurrence of some two adjacent words will be compared. If you have more, you can use these two adjacent words as a phrase to perform word segmentation.
  • the first mapping corpus composed of the respective segmented sentences and the corresponding segmentation words can be obtained.
  • the source has access to a corpus resource with a rich corpus type, a wide range, and a large number.
  • the training identification module 03 is configured to train a preset first language model according to the obtained first mapping corpus, and perform speech recognition based on the trained first language model.
  • a first language model of a preset type is trained, and the first language model may be a generative model, an analytical model, an identifying model, or the like. Since the first mapping corpus is obtained from multiple data sources, the corpus of the corpus resources is rich in scope, wide in scope and large in number. Therefore, the training effect of using the first mapping corpus to train the first language model is better. Preferably, the recognition accuracy of the speech recognition based on the first language model of the training is higher.
  • a sentence segmentation is performed on a specific type of information text acquired from a predetermined data source, and word segmentation processing is performed on each segmented sentence to obtain a first mapping corpus of each segmented sentence and a corresponding segmentation word.
  • a first language model of a preset type is trained according to the first mapping corpus, and speech recognition is performed based on the first language model of the training.
  • the corpus resource can be obtained by performing segmentation and corresponding word segmentation on the information text obtained from a plurality of predetermined data sources, and training the language model based on the corpus resource, it is not necessary to obtain the labeled dialogue text, and Obtaining a sufficient number of corpus resources can ensure the training effect and recognition accuracy of the language model, thereby effectively improving the accuracy of speech recognition and effectively reducing the cost of speech recognition.
  • the word segmentation module 02 is further configured to:
  • the step of cleaning and denoising includes: deleting the user name, id, and the like from the microblog content, and retaining only the actual content of the microblog; deleting the forwarded microblog content, and generally obtaining the microblog.
  • Weibo content forwarded in the content. Repeated forwarding of Weibo content will affect the frequency of words. Therefore, the translated Weibo content must be filtered out.
  • the filtering method is to delete all the contents including "forwarding" or "http".
  • Microblog content filter out the special symbols in the microblog content, and filter out all the preset types of symbols in the microblog content; traditional to simplified, microblog content has a large number of traditional characters, using a predetermined simplified and complex correspondence table Convert all traditional characters to simplified characters, and more.
  • Sentence segmentation of each information text after cleaning and denoising for example, a statement between two preset types of break characters "for example, comma, period, exclamation point, etc.” as a statement to be segmented, and for each The segmented statements are processed by word segmentation to obtain mapping corpora for each segmented statement and corresponding segmentation (including phrases and words).
  • the training identification module 03 is further configured to:
  • a first language model of a preset type is trained according to each of the obtained first mapping corpora.
  • a second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
  • a number of sample statements can be predetermined, such as finding a number of the most frequently occurring or most commonly used sample sentences from a predetermined data source, and determining the correct word segmentation (including phrases and words) for each sample statement to
  • a second language model of a preset type is trained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding word segmentation.
  • the trained first language model and the second language model are mixed according to a predetermined model mixing formula to obtain a mixed language model, and speech recognition is performed based on the obtained mixed language model.
  • the predetermined model mixing formula can be:
  • M1 represents a first language model of a preset type
  • a represents a weighting coefficient of a preset model M1
  • M2 represents a second language model of a preset type
  • b represents a weight of a preset model M2.
  • the training is obtained according to each of the predetermined sample sentences and the second mapping corpus of the corresponding segmentation.
  • the second language model for example, the predetermined sample sentence may be a preset most commonly used and correct number of sentences, and thus the trained second language model can correctly recognize the commonly used speech.
  • the trained first language model and the second language model are mixed according to preset different weight ratios to obtain a mixed language model, and the voice recognition is performed based on the obtained mixed language model, which can ensure the richness of the voice recognition type.
  • the range is wide, and it can ensure the correct recognition of commonly used speech, and further improve the accuracy of speech recognition.
  • the training process of the preset type of the first language model or the second language model is as follows:
  • each first mapping corpus or each second mapping corpus into a training set of a first ratio (for example, 70%) and a verification set of a second ratio (for example, 30%);
  • the training ends, or if the accuracy rate is less than the preset accuracy rate, then The number of the first mapping corpus or the second mapping corpus is increased and steps A, B, and C are re-executed until the accuracy of the first language model or the second language model of the training is greater than or equal to the preset accuracy rate.
  • the preset type of the first language model and/or the second language model is an n-gram language model.
  • the n-gram language model is a commonly used language model in large vocabulary continuous speech recognition. For Chinese, it is called Chinese Language Model (CLM).
  • CLM Chinese Language Model
  • the Chinese language model uses the collocation information between adjacent words in the context. When it is necessary to convert a pinyin, a stroke, or a letter representing a letter or a stroke without a space into a Chinese character string (ie, a sentence), the maximum probability can be calculated. Sentences, thus achieving automatic conversion to Chinese characters, avoiding the problem of repetitive codes in which many Chinese characters correspond to the same pinyin (or stroke string, number string).
  • N-gram is a statistical language model used to predict the nth item based on the first (n-1) items.
  • these items can be phonemes (speech recognition applications), characters (input method applications), words (word-of-word applications) or base pairs (gene information), and n-gram models can be generated from large-scale text or audio corpora.
  • the n-gram language model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words, and is not related to any other words.
  • the present embodiment adopts a maximum likelihood estimation method, namely:
  • the probability of occurrence of the nth word can be calculated to determine the probability of the corresponding word, Speech Recognition.
  • the word segmentation module 02 is further configured to:
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library). Matching to get the first matching result;
  • the character string to be processed in each sentence is combined with a predetermined word dictionary library (for example, the word dictionary library may be a general word dictionary library, or may be a scalable learning word dictionary library)
  • Matching is performed to obtain a second matching result.
  • the first matching result includes a first number of first phrases
  • the second matching result includes a second number of second phrases
  • the first matching result includes a third number of words
  • the second matching result includes a fourth number of words.
  • the first quantity is equal to the second quantity, and the third quantity is less than or equal to the fourth quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmentation statement ;
  • the third quantity is greater than the fourth quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is greater than the second quantity, outputting the second matching result (including a phrase and a single word) corresponding to the segmented statement;
  • the first quantity is not equal to the second quantity, and the first quantity is less than the second quantity, outputting the first matching result (including a phrase and a single word) corresponding to the segmented statement.
  • the two-way matching method is adopted to perform word segmentation processing on each segmented sentence obtained, and the word segmentation matching is performed by forward and reverse simultaneous segmentation to analyze the viscosity of the combined content in the string to be processed of each segmented sentence, since usually In the case where the phrase can represent the core viewpoint information, the probability is greater, that is, the core viewpoint information can be expressed more by the phrase. Therefore, through the simultaneous matching of the word segmentation, the word segment matching result with fewer words and more phrases is used as the word segmentation result of the segmented sentence, thereby improving the accuracy of the word segmentation and ensuring the training effect of the language model. And recognition accuracy
  • the present invention also provides a computer readable storage medium storing a speech recognition system, the speech recognition system being executable by at least one processing device to cause the at least one processing device to perform The steps of the speech recognition method in the above embodiment, the language
  • the specific implementation processes of steps S10, S20, and S30 of the tone recognition method are as described above, and are not described herein again.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and can also be implemented by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un système de reconnaissance vocale, un dispositif électronique et un support. Le procédé comprend les étapes suivantes : l'obtention de textes d'informations de types spécifiques à partir de sources de données préalablement déterminées (S10) ; la réalisation d'une segmentation de phrases sur les textes d'informations obtenus pour obtenir plusieurs phrases, la réalisation d'un traitement de segmentation de mots sur les phrases pour obtenir des mots correspondants, et la formation de premiers corpus de mappage à partir des phrases et des mots correspondants (S20) ; en fonction des premiers corpus de mappage obtenus, l'apprentissage d'un premier modèle de langage d'un type prédéfini et la réalisation d'une reconnaissance vocale sur la base du premier modèle de langage entraîné (S30). La présente solution augmente efficacement la précision de la reconnaissance vocale et réduit efficacement les coûts de la reconnaissance vocale.
PCT/CN2017/091353 2017-05-10 2017-06-30 Procédé et système de reconnaissance vocale, appareil électronique et support WO2018205389A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710327374.8 2017-05-10
CN201710327374.8A CN107204184B (zh) 2017-05-10 2017-05-10 语音识别方法及系统

Publications (1)

Publication Number Publication Date
WO2018205389A1 true WO2018205389A1 (fr) 2018-11-15

Family

ID=59905515

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/091353 WO2018205389A1 (fr) 2017-05-10 2017-06-30 Procédé et système de reconnaissance vocale, appareil électronique et support

Country Status (3)

Country Link
CN (1) CN107204184B (fr)
TW (1) TWI636452B (fr)
WO (1) WO2018205389A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12019976B1 (en) * 2022-12-13 2024-06-25 Calabrio, Inc. Call tagging using machine learning model

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108257593B (zh) * 2017-12-29 2020-11-13 深圳和而泰数据资源与云技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN108831442A (zh) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 兴趣点识别方法、装置、终端设备及存储介质
CN110648657B (zh) * 2018-06-27 2024-02-02 北京搜狗科技发展有限公司 一种语言模型训练方法、构建方法和装置
CN109033082B (zh) * 2018-07-19 2022-06-10 深圳创维数字技术有限公司 语义模型的学习训练方法、装置及计算机可读存储介质
CN109344221B (zh) * 2018-08-01 2021-11-23 创新先进技术有限公司 录音文本生成方法、装置及设备
CN109582791B (zh) * 2018-11-13 2023-01-24 创新先进技术有限公司 文本的风险识别方法及装置
CN109377985B (zh) * 2018-11-27 2022-03-18 北京分音塔科技有限公司 一种领域词的语音识别增强方法和装置
CN109582775B (zh) * 2018-12-04 2024-03-26 平安科技(深圳)有限公司 信息录入方法、装置、计算机设备及存储介质
CN109992769A (zh) * 2018-12-06 2019-07-09 平安科技(深圳)有限公司 基于语义解析的语句合理性判断方法、装置、计算机设备
CN109461459A (zh) * 2018-12-07 2019-03-12 平安科技(深圳)有限公司 语音评分方法、装置、计算机设备及存储介质
CN109558596A (zh) * 2018-12-14 2019-04-02 平安城市建设科技(深圳)有限公司 识别方法、装置、终端及计算机可读存储介质
CN109783648B (zh) * 2018-12-28 2020-12-29 北京声智科技有限公司 一种利用asr识别结果改进asr语言模型的方法
CN109815991B (zh) * 2018-12-29 2021-02-19 北京城市网邻信息技术有限公司 机器学习模型的训练方法、装置、电子设备及存储介质
CN110223674B (zh) * 2019-04-19 2023-05-26 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质
CN110222182B (zh) * 2019-06-06 2022-12-27 腾讯科技(深圳)有限公司 一种语句分类方法及相关设备
CN110349568B (zh) * 2019-06-06 2024-05-31 平安科技(深圳)有限公司 语音检索方法、装置、计算机设备及存储介质
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质
CN110784603A (zh) * 2019-10-18 2020-02-11 深圳供电局有限公司 一种离线质检用智能语音分析方法及系统
CN113055017A (zh) * 2019-12-28 2021-06-29 华为技术有限公司 数据压缩方法及计算设备
CN111326160A (zh) * 2020-03-11 2020-06-23 南京奥拓电子科技有限公司 一种纠正噪音文本的语音识别方法、系统及存储介质
CN112712794A (zh) * 2020-12-25 2021-04-27 苏州思必驰信息科技有限公司 语音识别标注训练联合系统和装置
CN113127621B (zh) * 2021-04-28 2024-10-18 平安国际智慧城市科技股份有限公司 对话模块的推送方法、装置、设备及存储介质
CN113658585B (zh) * 2021-08-13 2024-04-09 北京百度网讯科技有限公司 语音交互模型的训练方法、语音交互方法及装置
CN113948065B (zh) * 2021-09-01 2022-07-08 北京数美时代科技有限公司 基于n-gram模型的错误拦截词筛选方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593518A (zh) * 2008-05-28 2009-12-02 中国科学院自动化研究所 实际场景语料和有限状态网络语料的平衡方法
CN102495837A (zh) * 2011-11-01 2012-06-13 中国科学院计算技术研究所 一种数字信息推荐预测模型的训练方法和系统
CN103577386A (zh) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 一种基于用户输入场景动态加载语言模型的方法及装置
CN103971677A (zh) * 2013-02-01 2014-08-06 腾讯科技(深圳)有限公司 一种声学语言模型训练方法和装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100511248B1 (ko) * 2003-06-13 2005-08-31 홍광석 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593518A (zh) * 2008-05-28 2009-12-02 中国科学院自动化研究所 实际场景语料和有限状态网络语料的平衡方法
CN102495837A (zh) * 2011-11-01 2012-06-13 中国科学院计算技术研究所 一种数字信息推荐预测模型的训练方法和系统
CN103577386A (zh) * 2012-08-06 2014-02-12 腾讯科技(深圳)有限公司 一种基于用户输入场景动态加载语言模型的方法及装置
CN103971677A (zh) * 2013-02-01 2014-08-06 腾讯科技(深圳)有限公司 一种声学语言模型训练方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12019976B1 (en) * 2022-12-13 2024-06-25 Calabrio, Inc. Call tagging using machine learning model

Also Published As

Publication number Publication date
CN107204184B (zh) 2018-08-03
TW201901661A (zh) 2019-01-01
CN107204184A (zh) 2017-09-26
TWI636452B (zh) 2018-09-21

Similar Documents

Publication Publication Date Title
WO2018205389A1 (fr) Procédé et système de reconnaissance vocale, appareil électronique et support
US11521603B2 (en) Automatically generating conference minutes
US11693894B2 (en) Conversation oriented machine-user interaction
US9910886B2 (en) Visual representation of question quality
AU2017408800B2 (en) Method and system of mining information, electronic device and readable storable medium
CN110457672B (zh) 关键词确定方法、装置、电子设备及存储介质
WO2014117553A1 (fr) Procédé et système d'ajout de ponctuation et d'établissement de modèle de langue
US9811517B2 (en) Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text
WO2009026850A1 (fr) Création d'un dictionnaire de domaines
CN111209363B (zh) 语料数据处理方法、装置、服务器和存储介质
CN112347241A (zh) 一种摘要提取方法、装置、设备及存储介质
WO2014114175A1 (fr) Méthode et appareil permettant de fournir des étiquettes de moteur de recherche
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
US12099539B2 (en) Embedding performance optimization through use of a summary model
JP7309811B2 (ja) データ注釈方法、装置、電子機器および記憶媒体
CN113779990B (zh) 中文分词方法、装置、设备及存储介质
CN113254578B (zh) 用于数据聚类的方法、装置、设备、介质和产品
US20210209166A1 (en) Relationship network generation method and device, electronic apparatus, and storage medium
CN113158693A (zh) 基于汉语关键词的维吾尔语关键词生成方法、装置、电子设备及存储介质
CN110704623A (zh) 基于Rasa_Nlu框架提高实体识别率的方法、装置、系统和存储介质
US11989500B2 (en) Framework agnostic summarization of multi-channel communication
CN108932326B (zh) 一种实例扩展方法、装置、设备和介质
JP7553314B2 (ja) 推定装置、推定方法及びプログラム
CN115828925A (zh) 文本选取方法、装置、电子设备与可读存储介质
CN114490976A (zh) 对话摘要训练数据的生成方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17909445

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17909445

Country of ref document: EP

Kind code of ref document: A1