CN114743551A - Method, system, device and medium for recognizing domain words in speech - Google Patents

Method, system, device and medium for recognizing domain words in speech Download PDF

Info

Publication number
CN114743551A
CN114743551A CN202210278367.4A CN202210278367A CN114743551A CN 114743551 A CN114743551 A CN 114743551A CN 202210278367 A CN202210278367 A CN 202210278367A CN 114743551 A CN114743551 A CN 114743551A
Authority
CN
China
Prior art keywords
word
probability
domain
data
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210278367.4A
Other languages
Chinese (zh)
Inventor
陈文浩
罗超
邹宇
郝竹林
张启祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Travel Information Technology Shanghai Co Ltd
Original Assignee
Ctrip Travel Information Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Travel Information Technology Shanghai Co Ltd filed Critical Ctrip Travel Information Technology Shanghai Co Ltd
Priority to CN202210278367.4A priority Critical patent/CN114743551A/en
Publication of CN114743551A publication Critical patent/CN114743551A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention discloses a method, a system, equipment and a medium for recognizing a domain word in voice. The method comprises the following steps: converting original voice data into original text data; preprocessing original text data to generate candidate new word data; calculating the degree of freedom information of the candidate new word data, and determining the probability of the first field word; generating a fusion feature vector according to the acoustic feature corresponding to the original voice data and the vector feature corresponding to the original text data, inputting the fusion feature vector into a sequence prediction model, and outputting a second domain word probability; and determining the probability value of the field word corresponding to the original voice data according to the first field word probability and the second field word probability. According to the method, the probability value of the field word is determined according to the first field word probability obtained by calculating the degree of freedom information and the second field word probability obtained by calculating the sequence prediction model, so that the accuracy of field word recognition is improved, and the prediction precision is improved.

Description

Method, system, device and medium for recognizing domain words in speech
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, a system, a device, and a medium for recognizing a field word in speech.
Background
In recent years, with the continuous development of speech recognition technology, speech recognition technology is being used in more and more scenes. However, in the speech recognition scenario, the original corpus is insufficient, so that many domain words are lacked, and the vocabulary cannot be recognized.
In the prior art, speech is generally transcribed into a text form in advance, and then domain word judgment is performed. However, in the process of converting voice into text, the phenomenon of error accumulation caused by the judgment of subsequent field words due to translation errors can occur.
Therefore, with the increase of application scenarios, the requirement of users on the domain word recognition accuracy of the speech recognition system in different domains is higher and higher.
Disclosure of Invention
The invention aims to overcome the defect of low recognition accuracy of the field words in the prior art, and provides a method, a system, equipment and a medium for recognizing the field words in the voice.
The invention solves the technical problems through the following technical scheme:
in a first aspect, the present invention provides a method for recognizing a domain word in a speech, where the method includes:
converting original voice data into original text data;
preprocessing the original text data to generate candidate new word data;
calculating the degree of freedom information of the candidate new word data, and determining the probability of the first domain word;
generating a fusion feature vector according to the acoustic features corresponding to the original voice data and the vector features corresponding to the original text data, inputting the fusion feature vector into a sequence prediction model, and outputting a second domain word probability;
determining a probability value of a domain word corresponding to the original voice data based on the first domain word probability and the second domain word probability;
the sequence prediction model is obtained by training based on a recurrent neural network according to sample voice data to be trained and labeled sample voice data, and each frame of the labeled sample voice data is provided with a domain word label or a non-domain word label.
Preferably, the step of calculating the degree of freedom information of the candidate new word data and determining the probability of the first domain word includes:
determining a degree of freedom measurement index of the candidate new word data;
calculating the cohesion degree index of the candidate new word data;
and calculating the first field word probability based on the degree of freedom measurement index and the degree of aggregation index.
Preferably, the step of preprocessing the original text data to generate candidate new word data includes:
performing word segmentation processing on the original text data to generate a plurality of original words;
and cleaning the vocabulary with preset word frequency and the tone vocabulary in the original vocabulary to generate the candidate new word data.
Preferably, the step of determining the word forming probability corresponding to the original voice data based on the first domain word probability and the second domain word probability includes:
and calculating the probability value of the domain word according to the weighted result or the summation result of the first domain word probability and the second domain word probability.
In a second aspect, the present invention provides a system for recognizing a domain word in speech, the system comprising:
the conversion module is used for converting the original voice data into original text data;
the preprocessing module is used for preprocessing the original text data to generate candidate new word data;
the calculation module is used for calculating the freedom degree information of the candidate new word data and determining the probability of the first field word;
the model prediction module is used for generating a fusion feature vector according to the acoustic feature corresponding to the original voice data and the vector feature corresponding to the original text data, inputting the fusion feature vector into a sequence prediction model and outputting the probability of a second domain word;
a determining module, configured to determine a probability value of a domain word corresponding to the original voice data based on the first domain word probability and the second domain word probability;
the sequence prediction model is obtained by training based on a recurrent neural network according to sample voice data to be trained and labeled sample voice data, and each frame of the labeled sample voice data is provided with a domain word label or a non-domain word label.
Preferably, the calculation module includes:
the determining unit is used for determining the degree of freedom measurement index of the candidate new word data;
the first calculation unit is used for calculating the cohesion degree index of the candidate new word data;
and the second calculating unit is used for calculating the probability of the first domain words based on the degree of freedom measurement index and the degree of aggregation index.
Preferably, the preprocessing module includes:
the first processing unit is used for carrying out word segmentation processing on the original text data to generate a plurality of original words;
and the second processing unit is used for cleaning the vocabulary with special word frequency and the tone vocabulary in the original vocabulary to generate the candidate new word data.
Preferably, the determining module includes:
and the third calculating unit is used for calculating the probability value of the domain word according to the weighted result or the summed result of the first domain word probability and the second domain word probability.
In a third aspect, the present invention provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the method for recognizing the domain words in the speech according to the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for recognizing a domain word in speech according to the first aspect.
The positive progress effects of the invention are as follows: the recognition method calculates the degree of freedom information to obtain the probability of a first domain word and calculates the probability value of a second domain word to jointly determine the probability value of the domain word by a sequence prediction model, so that the recognition accuracy of the domain word is improved, and the prediction precision of the domain word is improved; the acoustic feature extraction is carried out on the original voice data, the vector feature extraction is carried out on the original text data, the acoustic feature extraction and the vector feature extraction are carried out on the original voice data and the original text data, the probability of words in the second field is calculated by using a sequence prediction model after the features of the original voice data and the original text data are fused, a large number of non-field words are prevented from being mined, and therefore the prediction accuracy is improved.
Drawings
Fig. 1 is a flowchart of a method for recognizing a domain word in speech according to embodiment 1 of the present invention.
Fig. 2 is a flowchart of step S13 of the method for recognizing a domain word in speech according to embodiment 1 of the present invention.
Fig. 3 is a flowchart of step S12 of the method for recognizing a domain word in speech according to embodiment 1 of the present invention.
Fig. 4 is a flowchart of domain word mining of a multi-stage policy mechanism of a method for recognizing a domain word in speech according to embodiment 1 of the present invention.
Fig. 5 is a block diagram of a system for recognizing a domain word in speech according to embodiment 2 of the present invention.
Fig. 6 is a schematic diagram of a hardware structure of an electronic device according to embodiment 3 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
In this embodiment, a method for recognizing a domain word in speech is provided, as shown in fig. 1, the method includes:
s11, converting the original voice data into original text data.
And S12, preprocessing the original text data to generate candidate new word data.
And S13, calculating the degree of freedom information of the candidate new word data, and determining the probability of the first field word.
And S14, generating a fusion feature vector according to the acoustic feature corresponding to the original voice data and the vector feature corresponding to the original text data, inputting the fusion feature vector into the sequence prediction model, and outputting the probability of the second domain word.
And S15, determining the probability value of the domain word corresponding to the original voice data based on the first domain word probability and the second domain word probability.
The sequence prediction model is obtained by training a sample voice data to be trained and a labeled sample voice data based on a recurrent neural network, and each frame of the labeled sample voice data has a domain word label or a non-domain word label.
In step S11, the original voice data as the source voice of the voice recognition sample can be captured from the network, intercepted from the voice call, or collected from the recording device. The original speech data may include speech segments of one language and may also include speech segments of multiple languages, for example, the original speech data may be chinese: i love singing, or, may also be english and chinese: i love to sing. Original voice data are converted into original text data through the existing voice recognition technology, so that errors caused by processing operation are reduced, and the efficiency and the accuracy of field word recognition in voice are improved.
In step S12, the original text data is cleaned, special characters are filtered out, the linguistic qi words are removed, the difference is made after the linguistic qi words are compared with the original vocabulary formed by the domain words that have been mined, and the remaining words are used as candidate new word data.
In step S13, the number of candidate words in the candidate new word data is counted, the left information entropy and the right information entropy corresponding to each candidate word are sequentially calculated, the weighted sum of the left information entropy and the right information entropy of each candidate word is comprehensively calculated, and finally, whether the candidate word is a complete word is determined according to the result of the weighted sum, and the candidate word meeting the word requirement is determined as an effective candidate word. And further calculating the degree of freedom information according to the calculation result of the left and right information entropies corresponding to the effective candidate words, thereby determining the probability of the first field words corresponding to the effective candidate words in the original text data.
In step S14, acoustic features are extracted from the original speech data, vector features are extracted from the original text data corresponding to the original speech data, and the acoustic features and the vector features are fused and input into a sequence prediction model, for example, the sequence prediction model is BI-LSTM + CRF.
In step S15, in the process of calculating the second domain word probability, since the length of the word is not fixed, the calculated second domain word probability needs to be length-normalized.
Figure BDA0003551165830000061
Wherein, P (y)i) A probability value representing a word of the word, theThe probability value is obtained from the last layer output of the network, to obtain P (y)i) Further, the probability value of the word is obtained,
Figure BDA0003551165830000062
y1 denotes the carry, y2 denotes the trip. And setting a weighted summation formula shown in the specification so as to flexibly control the influence result of the sequence prediction model and the degree of freedom information on the probability value of the field word.
Lfinal=β*P_cheng+(1-β)*P
Wherein P _ cheng represents a first domain word probability, P represents a normalized second domain word probability, and LfinalThe probability value of the field word corresponding to the original voice data is represented, and the value of beta is adjusted, so that a better prediction effect is achieved.
In this embodiment, as shown in fig. 2, step S12 includes:
and S121, performing word segmentation processing on the original text data to generate a plurality of original words.
And S122, cleaning the vocabulary with the preset word frequency and the tone vocabulary in the original vocabulary to generate candidate new word data.
In steps S121-S122, a tokenizer may be employed to perform a tokenization process on the original text data. For example, "you take journey to travel king", after the word segmentation ware is handled, can generate the word of "you are", "take journey", "travel", "king". And further performing N-element word frequency statistics, deleting the preset word frequency, for example, removing 3-element and 2-element linguistic words, or removing words with the length larger than 7, and finally comparing with the original word list, removing old words in the words, thereby determining candidate new word data.
In this embodiment, as shown in fig. 3, step S13 includes:
s131, determining the degree of freedom measurement index of the candidate new word data.
And S132, calculating the agglomeration degree index of the candidate new word data.
And S133, calculating the word probability of the first field based on the degree of freedom measurement index and the degree of cohesion index.
In step S131, if the text segment of the current candidate new word data is x, the above word is yl, and the following word is yr, the information entropy calculation formula is as follows:
Figure BDA0003551165830000073
that is, P (y | x) represents the probability of y words appearing in the text fragment x, e.g., "hello, carry, travel, king", if "carry" is selected to represent x, then "hello" is selected to represent yl, and "travel" is selected to represent yr, and P (y | x) is represented in turn and substituted into the above formula, and left information entropy and right information entropy are calculated, e.g., left information entropy can be represented as ΣHello (a good luck)P ('hello' | 'carry') logP ('hello' | 'carry').
In the process of further measuring the degree of freedom, the sizes of left and right information entropies (LE, RE) of the text segment x are comprehensively considered. According to the absolute value (| LE-RE |) of the difference between LE and RE, the calculation formula of the degree of freedom measurement index is obtained as follows:
Figure BDA0003551165830000071
that is, LE represents the left information entropy, RE represents the right information entropy, and w represents the current word.
In step S132, in order to further quantify the degree of aggregation PMI of a character combination, the probability of occurrence of the character combination is divided by the probability of occurrence of each component, and an improved version of the inter-point mutual information formula is used as an index for measuring the degree of aggregation of the character combination, and the improved degree of aggregation index is calculated as follows:
Figure BDA0003551165830000072
that is, P (w) represents the probability of occurrence of the current word, n represents the length of the word, k represents an adjustable weight, and P (c1) P (c2) … … P (cn) represents the words that make up the word. For example, taking "carry" as an example, P (c1) represents "carry", P (c2) represents "carry", and n represents 2. Taking "travel net" as an example, P (c1) represents "travel", P (c2) represents "row", P (c2) represents "net", and n represents 3.
In step S133, based on the degree of freedom measure and the degree of aggregation measure calculated in the above steps, whether the word is a domain word is determined by using a weighted sum formula, and the probability formula of the first domain word P _ cheng is as follows, where λ is between (0, 1).
P_cheng=λL(w)+(1-λ)PMI
In an implementation scheme, step S15 specifically includes: and calculating the probability value of the field word according to the weighted result or the summation result of the first field word probability and the second field word probability.
And summing the first domain word probability P _ cheng and the second domain word probability P corresponding to the original voice data, namely, performing weighted summation on the calculation result of the degree of freedom information and the prediction result of the model, adjusting the ratio of the two by adjusting beta to obtain an output result, and finally judging whether the domain words are real or not based on a proper threshold value.
As shown in fig. 4, the original speech data is transcribed into original text data, and the policy mechanism of the first-layer domain word judgment standard is to use improved degree-of-freedom information to break the original text data; the strategy mechanism of the second layer of domain word judgment standard is to judge the original text data by using a sequence prediction model; and then, performing joint calculation on the first judgment result and the second judgment result to serve as a strategy mechanism of a third-layer domain word judgment standard. The method adopts three-level strategies, the accuracy rate is obviously improved compared with that of only one strategy, a large number of non-field words are prevented from being mined, and the cost of manual verification is greatly reduced.
In the embodiment, a method for mining a domain word for voice recognition is provided, in which a probability of a first domain word is obtained based on calculation of degree-of-freedom information, and a probability value of a domain word is determined together with a probability of a second domain word obtained by calculation of a sequence prediction model, so that accuracy of domain word recognition is improved, and accuracy of domain word prediction is improved; by extracting acoustic features of the original voice data, then extracting vector features of the original text data, fusing the features of the original voice data and the original text data, and then calculating the probability of words in the second field by using a sequence prediction model, a large number of non-field words are prevented from being mined, so that the prediction accuracy is improved,
example 2
As shown in fig. 5, the present embodiment provides a system for recognizing a domain word in speech, including: a conversion module 110, a pre-processing module 120, a calculation module 130, a model prediction module 140, and a determination module 150.
The conversion module 110 converts the original voice data into original text data.
And the preprocessing module 120 is configured to preprocess the original text data to generate candidate new word data.
And the calculating module 130 is configured to calculate the degree of freedom information of the candidate new word data, and determine the probability of the first domain word.
And the model prediction module 140 is configured to generate a fusion feature vector according to the acoustic feature corresponding to the original speech data and the vector feature corresponding to the original text data, input the fusion feature vector into the sequence prediction model, and output the second domain word probability.
The determining module 150 is configured to determine a probability value of a domain word corresponding to the original voice data based on the first domain word probability and the second domain word probability.
The sequence prediction model is obtained by training based on a recurrent neural network according to sample voice data to be trained and labeled sample voice data, and each frame of the labeled sample voice data is provided with a domain word label or a non-domain word label.
The original voice data as the source voice of the voice recognition sample can be captured from the network, also can be intercepted from the voice call, and also can be collected from the recording equipment. The original speech data may include speech segments of one language and may also include speech segments of multiple languages, for example, the original speech data may be chinese: i love singing, or, may also be english and chinese: i love to sing. The conversion module 110 converts the original voice data into the original text data, thereby reducing errors introduced by processing operations and improving the efficiency and accuracy of the field word recognition of the voice recognition.
The preprocessing module 120 cleans the original text data, filters out special characters, removes the mood words, compares the mood words with the original word list formed by the excavated field words, and takes the remaining words as candidate new word data.
The calculating module 130 calculates the left information entropy and the right information entropy corresponding to each candidate word in sequence by counting the number of candidate words in the candidate new word data, then calculates the weighted sum of the left information entropy and the right information entropy of each candidate word comprehensively, and finally judges whether the candidate word is a complete word according to the result of the weighted sum, and determines the candidate word meeting the word requirement as an effective candidate word. And further calculating the degree of freedom information according to the calculation result of the left and right information entropies corresponding to the effective candidate words, thereby determining the probability of the first field words corresponding to the effective candidate words in the original text data.
Acoustic features are extracted from original speech data, vector features are extracted from original text data corresponding to the original speech data, and the acoustic features and the vector features are fused by the model prediction module 140 and input into a sequence prediction model, for example, the sequence prediction model is BI-LSTM + CRF.
In the process of calculating the second domain word probability, since the length of the word is not fixed, the calculated second domain word probability needs to be length-normalized.
Figure BDA0003551165830000101
Wherein, P (y)i) Representing the probability value of a word, obtained by the last layer output of the network, to P (y)i) Further solving the probability value of the word,
Figure BDA0003551165830000102
y1 denotes the carry, y2 denotes the trip. The determination module 150 is based onThe weighted summation formula shown below is used for flexibly controlling the influence result of the sequence prediction model and the degree of freedom information on the probability value of the field word.
Lfinal=β*P_cheng+(1-β)*P
Wherein P _ cheng represents a first domain word probability, P represents a normalized second domain word probability, and LfinalThe probability value of the field word corresponding to the original voice data is represented, and the value of beta is adjusted, so that a better prediction effect is achieved.
As shown in fig. 5, in the present embodiment, the preprocessing module 120 includes:
the first processing unit 121 is configured to perform word segmentation on the original text data to generate a plurality of original words.
The second processing unit 122 is configured to clean the vocabulary with the preset word frequency and the mood vocabulary in the original vocabulary, and generate candidate new word data.
The first processing unit 121 may perform a word segmentation process on the original text data using a word segmenter. For example, "you take journey to travel king", after the word segmentation ware is handled, can generate the word of "you are", "take journey", "travel", "king". Further, N-gram word frequency statistics is performed, and the second processing unit 122 deletes the preset word frequency, for example, 3-gram and 2-gram linguistic words may be removed, or words with a length greater than 7 may be removed, and finally, after comparing with the original word list, old words in the words are removed, so as to determine candidate new word data.
As shown in fig. 5, in this embodiment, the calculation module 130 includes:
the determining unit 131 is configured to determine a degree of freedom measure of the candidate new word data.
The first calculating unit 132 is configured to calculate an aggregation index of the candidate new word data.
The second calculating unit 133 is configured to calculate the first domain word probability based on the degree of freedom metric and the degree of aggregation metric.
If the text segment of the current candidate new word data is x, the above word is yl, and the following word is yr, the information entropy calculation formula is as follows:
Figure BDA0003551165830000113
that is, P (y | x) represents the probability of y words appearing in the text fragment x, e.g., "hello, carry, travel, king", if "carry" is selected to represent x, then "hello" is selected to represent yl, and "travel" is selected to represent yr, and P (y | x) is represented in turn and substituted into the above formula, and left information entropy and right information entropy are calculated, e.g., left information entropy can be represented as ΣYou goodP ('hello' | 'carry away') logP ('hello' | 'carry away').
In the process of further measuring the degree of freedom, the sizes of left and right information entropies (LE, RE) of the text fragment x are comprehensively considered. The determining unit 131 obtains the calculation formula of the degree of freedom metric according to the absolute value (| LE-RE |) of the difference between LE and RE as follows:
Figure BDA0003551165830000111
that is, LE represents the left information entropy, RE represents the right information entropy, and w represents the current word.
In order to further quantify the degree of agglomeration PMI of a character combination, the first calculating unit 132 divides the probability of occurrence of the character combination by the probability of occurrence of each constituent component, and uses the improved version of the inter-point mutual information formula as an index for measuring the degree of agglomeration of the character combination, and the improved degree of agglomeration index calculation formula is as follows:
Figure BDA0003551165830000112
that is, P (w) represents the probability of occurrence of the current word, n represents the length of the word, k represents an adjustable weight, and P (c1) P (c2) … … P (cn) represents the words that make up the word. For example, taking "carry" as an example, P (c1) represents "carry", P (c2) represents "carry", and n represents 2. Taking "travel net" as an example, P (c1) represents "travel", P (c2) represents "row", P (c2) represents "net", and n represents 3.
Based on the degree of freedom measure index and the degree of aggregation index calculated in the above steps, the second calculating unit 133 determines whether the word is a domain word by using a weighted sum formula, and the probability formula of the first domain word P _ cheng is as follows, where λ is between (0, 1).
P_cheng=λL(w)+(1-λ)PMI
As shown in fig. 5, in this embodiment, the determining module 150 includes:
and a third calculating unit 151, configured to calculate a probability value of the domain word according to a weighted result or a summed result of the first domain word probability and the second domain word probability.
And summing the first domain word probability P _ cheng and the second domain word probability P corresponding to the original voice data, namely, performing weighted summation on the calculation result of the degree of freedom information and the prediction result of the model, adjusting the ratio of the two by adjusting beta to obtain an output result, and finally judging whether the domain words are real or not based on a proper threshold value.
In the embodiment, a system for recognizing a domain word in a voice is provided, where the probability of the domain word is determined by calculating the degree of freedom information to obtain the probability of the first domain word and the probability of the second domain word calculated by a sequence prediction model, so that the accuracy of domain word recognition is improved, and the precision of domain word prediction is improved; the original voice data is subjected to acoustic feature extraction, then the original text data is subjected to vector feature extraction, the original voice data and the original text data are subjected to feature fusion, and then the probability of words in the second field is calculated by using a sequence prediction model, so that a large number of non-field words are prevented from being mined, the prediction accuracy is improved,
example 3
Fig. 6 is a schematic structural diagram of an electronic device provided in this embodiment. The electronic device includes a memory, a processor and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the method for recognizing the domain words in the speech of embodiment 1, and the electronic device 60 shown in fig. 6 is only an example and should not bring any limitation to the functions and the scope of the embodiment of the present invention.
The electronic device 60 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 60 may include, but are not limited to: the at least one processor 61, the at least one memory 62, and a bus 63 connecting the various system components (including the memory 62 and the processor 61).
The bus 63 includes a data bus, an address bus, and a control bus.
The memory 62 may include volatile memory, such as Random Access Memory (RAM)621 and/or cache memory 622, and may further include Read Only Memory (ROM) 623.
The memory 62 may also include a program/utility 625 having a set (at least one) of program modules 624, such program modules 624 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 61 executes various functional applications and data processing, such as a recognition method of a domain word in speech according to embodiment 1 of the present invention, by executing the computer program stored in the memory 62.
The electronic device 60 may also communicate with one or more external devices 64 (e.g., a keyboard, a pointing device, etc.). Such communication may be through an input/output (I/O) interface 65. Also, model-generating device 60 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 66. As shown, network adapter 66 communicates with the other modules of model-generating device 60 via bus 63. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 60, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 4
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the recognition method of a domain word in speech of embodiment 1.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation manner, the present invention can also be implemented in the form of a program product including program code for causing a terminal device to execute the steps of implementing the method for recognizing a domain word in speech of embodiment 1 when the program product is run on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (10)

1. A method for recognizing a domain word in speech, the method comprising:
converting original voice data into original text data;
preprocessing the original text data to generate candidate new word data;
calculating the degree of freedom information of the candidate new word data, and determining the probability of the first domain word;
generating a fusion feature vector according to the acoustic features corresponding to the original voice data and the vector features corresponding to the original text data, inputting the fusion feature vector into a sequence prediction model, and outputting a second domain word probability;
determining a probability value of a domain word corresponding to the original voice data based on the first domain word probability and the second domain word probability;
the sequence prediction model is obtained by training based on a recurrent neural network according to sample voice data to be trained and labeled sample voice data, and each frame of the labeled sample voice data is provided with a domain word label or a non-domain word label.
2. The method for recognizing a domain word in speech according to claim 1, wherein the step of calculating the degree of freedom information of the candidate new word data and determining the probability of the first domain word comprises:
determining a degree of freedom measurement index of the candidate new word data;
calculating the cohesion degree index of the candidate new word data;
and calculating the first field word probability based on the degree of freedom measurement index and the degree of aggregation index.
3. The method for recognizing the domain word in the speech according to claim 1, wherein the step of preprocessing the original text data to generate the candidate new word data comprises:
performing word segmentation processing on the original text data to generate a plurality of original words;
and cleaning the vocabulary with preset word frequency and the tone vocabulary in the original vocabulary to generate the candidate new word data.
4. The method for recognizing a domain word in speech according to claim 1, wherein the step of determining the probability value of the domain word corresponding to the original speech data based on the first domain word probability and the second domain word probability comprises:
and calculating the probability value of the field word according to the weighted result or the summation result of the first field word probability and the second field word probability.
5. A system for recognizing a domain word in speech, the system comprising:
the conversion module is used for converting the original voice data into original text data;
the preprocessing module is used for preprocessing the original text data to generate candidate new word data;
the calculation module is used for calculating the freedom degree information of the candidate new word data and determining the probability of the first field word;
the model prediction module is used for generating a fusion feature vector according to the acoustic feature corresponding to the original voice data and the vector feature corresponding to the original text data, inputting the fusion feature vector into a sequence prediction model and outputting the probability of a second domain word;
a determining module, configured to determine a probability value of a domain word corresponding to the original voice data based on the first domain word probability and the second domain word probability;
the sequence prediction model is obtained by training based on a recurrent neural network according to sample voice data to be trained and labeled sample voice data, and each frame of the labeled sample voice data is provided with a domain word label or a non-domain word label.
6. The system for recognizing domain words in speech according to claim 5, wherein said computation module comprises:
the determining unit is used for determining the degree of freedom measurement index of the candidate new word data;
the first calculation unit is used for calculating the cohesion degree index of the candidate new word data;
and the second calculating unit is used for calculating the probability of the first domain words based on the degree of freedom measurement index and the degree of cohesion index.
7. The system for recognizing domain words in speech according to claim 5, wherein said preprocessing module comprises:
the first processing unit is used for carrying out word segmentation processing on the original text data to generate a plurality of original words;
and the second processing unit is used for cleaning the vocabulary with the preset word frequency and the tone vocabulary in the original vocabulary to generate the candidate new word data.
8. The system for recognizing domain words in speech according to claim 5, wherein said determining module comprises:
and the third calculating unit is used for calculating the probability value of the field word according to the weighted result or the summed result of the first field word probability and the second field word probability.
9. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program being executed by the processor to perform the method for recognizing a domain word in speech according to any one of claims 1 to 4.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements a method for recognizing a domain word in speech according to any one of claims 1 to 4.
CN202210278367.4A 2022-03-17 2022-03-17 Method, system, device and medium for recognizing domain words in speech Pending CN114743551A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210278367.4A CN114743551A (en) 2022-03-17 2022-03-17 Method, system, device and medium for recognizing domain words in speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210278367.4A CN114743551A (en) 2022-03-17 2022-03-17 Method, system, device and medium for recognizing domain words in speech

Publications (1)

Publication Number Publication Date
CN114743551A true CN114743551A (en) 2022-07-12

Family

ID=82276799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210278367.4A Pending CN114743551A (en) 2022-03-17 2022-03-17 Method, system, device and medium for recognizing domain words in speech

Country Status (1)

Country Link
CN (1) CN114743551A (en)

Similar Documents

Publication Publication Date Title
Chollampatt et al. A multilayer convolutional encoder-decoder neural network for grammatical error correction
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
US10789415B2 (en) Information processing method and related device
CN113272894A (en) Fully supervised speaker logging
CN107688803B (en) Method and device for verifying recognition result in character recognition
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
CN111199474B (en) Risk prediction method and device based on network map data of two parties and electronic equipment
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN111222976B (en) Risk prediction method and device based on network map data of two parties and electronic equipment
CN112507628B (en) Risk prediction method and device based on deep bidirectional language model and electronic equipment
CN116432655B (en) Method and device for identifying named entities with few samples based on language knowledge learning
CN112767922B (en) Speech recognition method for contrast predictive coding self-supervision structure joint training
US20230237993A1 (en) Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN109635282B (en) Chapter parsing method, device, medium and computing equipment for multi-party conversation
CN113780418A (en) Data screening method, system, equipment and storage medium
CN112836019A (en) Public health and public health named entity identification and entity linking method and device, electronic equipment and storage medium
CN110910905B (en) Mute point detection method and device, storage medium and electronic equipment
CN115330142B (en) Training method of joint capacity model, capacity demand matching method and device
CN111143533A (en) Customer service method and system based on user behavior data
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
Iori et al. The direction of technical change in AI and the trajectory effects of government funding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination