CN111192572A - Semantic recognition method, device and system - Google Patents

Semantic recognition method, device and system Download PDF

Info

Publication number
CN111192572A
CN111192572A CN201911421165.5A CN201911421165A CN111192572A CN 111192572 A CN111192572 A CN 111192572A CN 201911421165 A CN201911421165 A CN 201911421165A CN 111192572 A CN111192572 A CN 111192572A
Authority
CN
China
Prior art keywords
pinyin
semantic
voice
information
semantic recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911421165.5A
Other languages
Chinese (zh)
Inventor
蔡勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zebra Network Technology Co Ltd
Original Assignee
Zebra Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zebra Network Technology Co Ltd filed Critical Zebra Network Technology Co Ltd
Priority to CN201911421165.5A priority Critical patent/CN111192572A/en
Publication of CN111192572A publication Critical patent/CN111192572A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning

Abstract

The invention provides a method, a device and a system for semantic recognition, wherein the method comprises the following steps: acquiring voice information and extracting a voice state according to the voice information; and inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information. The method has the advantages of realizing professional semantic extraction, improving the semantic understanding accuracy, reducing semantic understanding errors caused by recognition errors of homophone characters, having wide universality and being suitable for control type voice recognition scenes such as automobiles and homes.

Description

Semantic recognition method, device and system
Technical Field
The invention relates to the technical field of computer natural language processing, in particular to a semantic recognition method, a semantic recognition device and a semantic recognition system.
Background
With the rapid development of ASR (automatic speech recognition), the semantic understanding technology based on the characters recognized by ASR has also been widely applied and developed.
Although ASR has matured, ASR recognition is not ideal in a particular landing area. For example, in the fields of medicine, biology and chemistry, although ASR can be recognized, the recognition accuracy of ASR is not high, the use requirements in each field are different, development needs to be performed for each field, the development cost is high, and the ASR speech recognition has a poor effect in the professional field.
Since semantic understanding requires the use of ASR-recognized words, when ASR-recognized words have deviations, semantic understanding is severely affected.
Disclosure of Invention
The invention provides a semantic recognition method, a semantic recognition device and a semantic recognition system, which are used for realizing professional semantic recognition, improving the recognition accuracy, reducing semantic understanding errors caused by ASR homophone recognition errors, having wide universality and being suitable for control speech recognition scenes of automobiles, homes and the like.
In a first aspect, a method for semantic recognition provided in an embodiment of the present invention includes:
acquiring voice information and extracting a voice state according to the voice information;
and inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information.
In one possible design, before inputting the speech state into the target semantic recognition model, the method further includes:
acquiring a training data set;
inputting the training data set into an initial semantic recognition model, wherein the initial semantic recognition model comprises a pinyin conversion branch and a matching branch, the pinyin conversion branch is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state, and the matching branch is used for obtaining corresponding semantic information according to the pinyin characteristics to obtain the target semantic recognition model.
In one possible design of the system, the system may be,
obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state, including:
sequentially obtaining character characteristics corresponding to each voice state according to a plurality of sequentially arranged voice states, and sequentially obtaining corresponding pinyin characteristics according to the character characteristics;
or obtaining corresponding character characteristics according to a plurality of sequentially arranged voice states, wherein the character characteristics comprise character characteristics corresponding to the first voice state, and sequentially obtaining corresponding pinyin characteristics from the character characteristics corresponding to the first voice state to the character characteristics at the front end and the rear end until obtaining pinyin characteristics corresponding to all the character characteristics.
In one possible design, further comprising:
and marking the corresponding tone characteristics for the pinyin characteristics, wherein the tone characteristics are used for obtaining corresponding semantic information by combining the pinyin characteristics.
In one possible design, further comprising:
space marks are arranged among a plurality of pinyin characteristics, and the pinyin characteristics are connected into a pinyin characteristic string.
In one possible design, obtaining corresponding semantic information according to the pinyin features includes:
acquiring the highest semantic information probability corresponding to the pinyin feature string according to the pinyin feature string;
and if the highest semantic information probability is not less than a probability threshold, determining semantic information corresponding to the pinyin features.
In one possible design, after obtaining semantic information corresponding to the speech information, the method further includes:
and displaying the semantic information.
In a second aspect, an embodiment of the present invention provides a method for semantic recognition, including:
acquiring voice information and extracting a voice state;
and inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for recognizing the voice state to obtain semantic information corresponding to the voice information.
In a second aspect, an apparatus for semantic recognition provided in an embodiment of the present invention includes:
the acquisition module is used for acquiring voice information and extracting a voice state according to the voice information;
and the recognition module is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state and obtaining semantic information corresponding to the voice information.
In one possible design, before inputting the speech state into the target semantic recognition model, the method further includes:
acquiring a training data set;
inputting the training data set into an initial semantic recognition model, wherein the initial semantic recognition model comprises a pinyin conversion branch and a matching branch, the pinyin conversion branch is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state, and the matching branch is used for obtaining corresponding semantic information according to the pinyin characteristics to obtain the target semantic recognition model. In one possible design of the system, the system may be,
obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state, including:
sequentially obtaining character characteristics corresponding to each voice state according to a plurality of sequentially arranged voice states, and sequentially obtaining corresponding pinyin characteristics according to the character characteristics;
or obtaining corresponding character characteristics according to a plurality of sequentially arranged voice states, wherein the character characteristics comprise character characteristics corresponding to the first voice state, and sequentially obtaining corresponding pinyin characteristics from the character characteristics corresponding to the first voice state to the character characteristics at the front end and the rear end until obtaining pinyin characteristics corresponding to all the character characteristics.
In one possible design, further comprising:
and marking the corresponding tone characteristics for the pinyin characteristics, wherein the tone characteristics are used for obtaining corresponding semantic information by combining the pinyin characteristics.
In one possible design, further comprising:
space marks are arranged among a plurality of pinyin characteristics, and the pinyin characteristics are connected into a pinyin characteristic string.
In one possible design, obtaining corresponding semantic information according to the pinyin features includes:
acquiring the highest semantic information probability corresponding to the pinyin feature string according to the pinyin feature string;
and if the highest semantic information probability is not less than a probability threshold, determining semantic information corresponding to the pinyin features.
In one possible design, after obtaining semantic information corresponding to the speech information, the method further includes:
and displaying the semantic information.
In a third aspect, an apparatus for semantic recognition provided in an embodiment of the present invention includes:
the acquisition module is used for acquiring voice information and extracting a voice state;
and the recognition module is used for inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for recognizing the voice state to obtain semantic information corresponding to the voice information.
In a fourth aspect, a system for semantic recognition provided in an embodiment of the present invention includes: the device comprises a memory and a processor, wherein the memory stores executable instructions of the processor; wherein the processor is configured to perform the method of semantic recognition of any of the first aspect via execution of the executable instructions.
In a fifth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for semantic recognition according to any one of the first aspect.
The invention provides a method, a device and a system for semantic recognition, wherein the method comprises the following steps: acquiring voice information and extracting a voice state according to the voice information; and inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information. The method has the advantages of realizing professional semantic extraction, improving the semantic understanding accuracy, reducing semantic understanding errors caused by recognition errors of homophone characters, having wide universality and being suitable for control type voice recognition scenes such as automobiles and homes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of an exemplary scenario in accordance with the present invention;
FIG. 2 is a flowchart of a semantic recognition method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a target semantic recognition model in the semantic recognition method according to an embodiment of the present invention;
FIG. 4 is a first schematic diagram of a target semantic recognition model in the semantic recognition method according to the first embodiment of the present invention;
FIG. 5 is a second schematic diagram of a target semantic recognition model in the semantic recognition method according to the second embodiment of the present invention;
FIG. 6 is a flowchart of a semantic recognition method according to a third embodiment of the present invention;
FIG. 7 is a diagram illustrating a target semantic recognition model in the semantic recognition method according to the third embodiment of the present invention;
FIG. 8 is a flowchart of a semantic recognition method according to a fourth embodiment of the present invention;
fig. 9 is a schematic diagram illustrating a partial effect in the semantic recognition method according to the fourth embodiment of the present invention;
fig. 10 is a schematic structural diagram of an apparatus for semantic recognition according to a fifth embodiment of the present invention;
fig. 11 is a schematic structural diagram of a semantic recognition system according to a sixth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a typical scenario of the present invention, for example, fig. 1 obtains voice information through a voice obtaining device 11, and then the voice information can be recognized and output as corresponding semantic information through the semantic recognition system of the present invention. The semantic information can comprise voice characters, json format display can be adopted, the recognition accuracy is improved, the error rate of homophone character recognition is reduced, the universality is wide, and the method is suitable for control voice recognition scenes of automobiles, homes and the like. Fig. 2 is a flowchart of a semantic recognition method according to an embodiment of the present invention, and as shown in fig. 2, the method in this embodiment may include:
s201, acquiring voice information, and extracting a voice state according to the voice information.
In the present embodiment, by collecting voice information of a continuous audio stream, compressed formats such as mp3, wmv, etc. are usually collected, and in an alternative embodiment, the voice information needs to be converted into an uncompressed waveform file to be processed, such as a Windows PCM file, which includes a file header and a sound waveform. In an alternative embodiment, the acquired speech information is pre-processed, for example, the silence of the beginning and end segments is cut off, to reduce the interference to the subsequent processing, wherein the sound is framed by sound analysis, that is, the sound is cut into small segments, each of which is a frame, the framing operation is not simple but open, and the frames are generally overlapped with the frames, and a moving window is used to implement the framing.
After framing, extracting a speech state, i.e. extracting acoustic features, such as MFCC (Mel Frequency cepstrum coefficient), so as to input the extracted speech state into a target semantic recognition model in the following, and obtain semantic information corresponding to the speech information. Wherein the speech state can be represented according to 3 states into which each basic syllable in the speech information is split, and such represented syllable is called tri-phone. A segment of speech can therefore be represented by a series of states and every 3 states represents a syllable.
S202, inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information.
In combination with the above example, the extracted voice state is input into the target semantic recognition model, referring to fig. 3, fig. 3 is a schematic diagram of the target semantic recognition model in the semantic recognition method provided in the embodiment of the present invention, as shown in fig. 3, the target semantic recognition model may include a pinyin conversion branch and a matching branch, the pinyin conversion branch obtains pinyin features and character features according to the voice state, and the matching branch is used for obtaining corresponding semantic information according to the pinyin features to obtain the target semantic recognition model.
In an alternative embodiment, the target semantic recognition model is established based on semantic rules, wherein a set of rules with attributes are provided for each production formula of the grammar, and the rules represent the combination relationship among the components in the Chinese sentence and become the semantic rules. And by using the semantic rule, sentences for understanding semantic information can be summarized by using the semantic rule, and a label, namely representing semantic information, is output to express the semantic information as the meaning of a certain type of semantics, wherein the rules are generally grammars, such as: < ac _ down > [ please ] turn down the air conditioner (turning on | tone) { internal _ down }; the grammar shows that when the user speaks one of four sentences, namely, air conditioner _ down, air conditioner is turned down, and the like, a semantic intent is sent. Grammar is expressed as a tree-like state diagram data structure. The state diagram is driven by the input characters to move along the direction of an arrow, when the state diagram can move to endNode, the input sentence can be matched with the semantic rule, and the semantic information _ down represented by the state diagram is output at the moment.
Because the existing ASR language model is not accurate enough, errors can be output, especially for homophones, such as 'turn air conditioner low' recognized as 'turn air conditioner on'. At this time, the existing text-based rule is adopted, and the user cannot continue to walk after walking to the node to make a beat, so that the user cannot understand the sentence.
However, in the embodiment, when the target semantic recognition model based on the ASR semantic rule is used to output a text, a corresponding pinyin feature is also output, where the pinyin feature is ba3 kong1tiao2 da3 di1 (including a tone feature). Based on the rules based on the existing characters, the rules set as pinyin by the compiler can be adapted to the pinyin features to establish semantic rules, such as < ac _ down > is modified as: [ q ] ba king tiao (da | tiao) di { intent ═ airconditioner _ down } (without regard to pitch characteristics).
Preferably, the tone characteristics can also be considered, and the simplification is as follows: [ q 3] ba3 kong1tiao2(da3| tiao2) di1{ intent is airconditioner _ down }.
Preferably, a space is inserted if the pinyin features of two or more characters are connected together. And then the pinyin feature (combined with tone feature) sequence of ba3 kong1tiao2 da3 di1 is used in the matching branch to drive the state diagram to move forward, and when the state diagram moves to the endNode, the matching can be successfully matched, and semantic information is output.
The semantic rule-based approach in this embodiment may solve the problem of homophones but different words identified by prior art asr. In an alternative embodiment, the pinyin conversion branch may directly obtain the pinyin features, and the technical principle and implementation process are similar to the above process and will not be described herein again.
In an alternative embodiment, a target semantic recognition model of the neural network model is established, and semantic information is output through the target semantic recognition model of the neural network model by using pinyin features and labels, which are listed in table 1 below.
TABLE 1
Figure BDA0002352424700000071
Figure BDA0002352424700000081
Specifically, referring to fig. 4, fig. 4 is a schematic diagram of a target semantic recognition model in the semantic recognition method provided in the first embodiment of the present invention, as shown in fig. 4, for example, pinyin features are sent to various models such as CNN/DNN, and after multiple convolutions, a predicted label is output, a distribution distance (loss) is obtained by comparing distributions of the predicted label and the labeled label, a weight of the CNN/DNN is modified by the loss, the distribution of the predicted label and the labeled label is closest to be a highest semantic information probability through multiple iterations, and a predicted probability p (label) corresponding to the label is obtained. The phonetic state is converted into corresponding phonetic feature through the phonetic conversion branch, and then corresponding semantic information is obtained in the CNN/DNN matching branch according to the phonetic feature. In an alternative embodiment, the pinyin conversion branch may be a pinyin conversion branch established based on semantic rules, and the specific implementation process and technical principles thereof are as described in the above examples, and are not described herein again.
In an optional embodiment, an important achievement of a pre-training language model developed based on google in natural language understanding is established in this embodiment to obtain a target semantic recognition model, referring to fig. 5, fig. 5 is a schematic diagram of the target semantic recognition model in the semantic recognition method provided by the second embodiment of the present invention, as shown in fig. 5, a speech state is converted into a corresponding pinyin feature through a pinyin conversion branch, and then a prediction probability p (label) corresponding to the prediction label is obtained through multiple iterations through various models such as a pre-training model of Bert and a CNN/DNN, where the prediction label is closest to the distribution of the labeled label, and if (p label) is greater than T (preset threshold), the prediction label is determined, and corresponding semantic information is output. The phonetic state is converted into corresponding phonetic features through the phonetic conversion branch, and corresponding semantic information is obtained in matching branches such as Bert, CNN/DNN and the like according to the phonetic features. In an alternative embodiment, the pinyin conversion branch may be a pinyin conversion branch established based on semantic rules, and the specific implementation process and technical principles thereof are as described in the above examples, and are not described herein again.
In the embodiment, the voice state is converted into the pinyin characteristics, so that the error rate of homophone recognition is reduced, and the accuracy rate of semantic recognition is improved.
Fig. 6 is a flowchart of a semantic recognition method provided in the third embodiment of the present invention, and as shown in fig. 6, the semantic recognition method in this embodiment may include:
s301, acquiring voice information and extracting a voice state.
In this embodiment, by collecting the voice information of the continuous audio stream, the compression formats such as mp3 and wmv are common, and may be collected in real time or collected in advance. In an alternative embodiment, the audio signal is converted into a non-compressed waveform file for processing, such as a Windows PCM file, which includes a header and a sound waveform.
S302, inputting the voice information into a target semantic recognition model, wherein the target semantic recognition model is used for inputting the voice state into the target semantic recognition model, and the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information.
Referring to fig. 7 specifically, fig. 7 is a schematic diagram of a target semantic recognition model in the semantic recognition method according to the third embodiment of the present invention, as shown in fig. 7, collected voice information is input into the target semantic model, and the voice information is recognized in an acoustic model, so that the language model is input to obtain semantic information corresponding to the voice information. In an optional embodiment, the matching branch may also use a Bert, CNN/DNN, or other language model to obtain, through multiple iterations, a predicted label closest to the distribution of the labeled label, and a predicted probability p (label) corresponding to the predicted label, where if p (label) is greater than T (a preset threshold), the predicted label is determined, and corresponding semantic information is output. In an optional embodiment, when the maximum p (label) is obtained and corresponds to T (a preset threshold), the predicted label is determined, and corresponding semantic information is output.
In an optional embodiment, in the process of extracting the voice state, if it is detected that there is a key feature corresponding to the voice information (for example, an environmental sound "buy things"), semantic recognition may be performed in combination with the key feature (for example, a semantic corresponding environment: a mall is recognized by the environmental sound, and then semantic recognition is performed in combination with the environment), so as to improve recognition accuracy. The embodiment can realize professional semantic extraction, improve the semantic understanding accuracy and reduce semantic understanding errors caused by recognition errors of homophones.
Fig. 8 is a flowchart of a semantic recognition method according to a fourth embodiment of the present invention, as shown in fig. 8, the semantic recognition method in this embodiment may add step S200 on the basis of fig. 2 before inputting the speech state into the target semantic recognition model, specifically,
s200: acquiring a training data set; inputting the training data set into an initial semantic recognition model, wherein the initial semantic recognition model comprises a pinyin conversion branch and a matching branch, the pinyin conversion branch is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to a voice state, and the matching branch is used for obtaining corresponding semantic information according to the pinyin characteristics to obtain a target semantic recognition model.
In the embodiment, by capturing a large number of characters on the network and then obtaining the pinyin characteristics, not only the probability of a single pinyin corresponding to Chinese characters and phrases can be obtained, but also the probability of the pinyin characteristics with different lengths corresponding to semantic information can be obtained most importantly. In the method of semantic recognition in the prior art, the training dataset may result in P (open empty | e.g. sound "da kai kong") -2/3, P (open control | e.g. sound "da kai kong") -1/3, P (open condition | e.g. sound "da kai big") -1/2, and P (open ice | da kai big) — 1/2. The character probability can be obtained by counting the probability of different Chinese characters under the same sound. In a similar embodiment, the probability of different semantic tables appearing under the same pinyin feature is counted to obtain the semantic probability, so that the semantic information is output when the highest probability table appears.
And then inputting the training data set into the initial semantic recognition model, obtaining corresponding pinyin characteristics according to the voice state through a pinyin conversion branch of the initial semantic recognition model, obtaining corresponding semantic information according to the pinyin characteristics through a matching branch, and training to obtain a target semantic recognition model.
In an alternative embodiment, obtaining the pinyin characteristics or the pinyin characteristics and the text characteristics according to the voice state includes:
sequentially obtaining character characteristics corresponding to each voice state according to a plurality of sequentially arranged voice states, and sequentially obtaining corresponding pinyin characteristics according to the character characteristics;
or obtaining corresponding character characteristics according to a plurality of sequentially arranged voice states, wherein the character characteristics comprise character characteristics corresponding to the first voice state, and sequentially obtaining corresponding pinyin characteristics from the character characteristics corresponding to the first voice state to the character characteristics at the front end and the rear end until obtaining the pinyin characteristics corresponding to all the character characteristics.
In this embodiment, the pinyin features are obtained according to the voice state by expressing a data structure of a tree diagram in the target semantic recognition model based on the ASR semantic rule, the arranged character features can be sequentially obtained according to the time sequence of the voice state, and then each character feature is converted into the pinyin feature, for example, the voice state "please turn the air conditioner low, i.e., meaning the inner _ down" refers to fig. 9, fig. 9 is a partial effect schematic diagram of the semantic recognition method provided by the fourth embodiment of the present invention, and referring to fig. 9, the voice state is driven according to the arrow direction, and when the voice state is driven to the endNode, each voice state is sequentially converted into the pinyin feature.
Or, in order to improve the speed and efficiency of the conversion, in an optional embodiment, the voice state may be used to obtain corresponding text features, and a text feature corresponding to the first voice state is selected from the text features corresponding to the voice states, and the text feature corresponding to the first voice state starts to be converted, and pinyin features are sequentially obtained from the text feature of the first voice state to the text features of the front end and the rear end until pinyin features corresponding to all text features are obtained. For example, the character features in the first voice state are driven according to the arrow direction, and pinyin features are obtained towards the character features at the front end and the rear end, so that pinyin features are obtained until all the character features are converted into pinyin features. In this embodiment, the first speech state and the corresponding text feature are not limited. The pinyin features obtained according to the voice state can be based on Chinese character rules, for example, a dictionary and the like are utilized to convert the voice state into character features, and then the pinyin features are obtained according to the character features.
In an alternative embodiment, the method further comprises:
and marking the corresponding tone characteristics for the pinyin characteristics, wherein the tone characteristics are used for obtaining corresponding semantic information by combining the pinyin characteristics.
Specifically, the corresponding tone features can be labeled in the process of converting the voice state, and the corresponding semantic information can be obtained by combining the pinyin features, so that the tone features are beneficial to identifying more accurate semantic information from the voice information. In combination with the above example, the phonetic feature label corresponds to a tone feature of ba3 kong1tiao2 da3 di 1.
In an optional embodiment, further comprising:
space marks are arranged among the pinyin characteristics, and the pinyin characteristics are connected into a pinyin characteristic string.
In combination with the above example, the pinyin features may be connected into a pinyin feature string, and space identifiers are provided between the plurality of pinyin features to separate each pinyin feature, thereby avoiding confusion and improving the accuracy of recognition.
Wherein, obtain corresponding semantic information according to the spelling characteristic, include:
acquiring the highest semantic information probability corresponding to the pinyin feature string according to the pinyin feature string;
and if the highest semantic information probability is not less than the probability threshold, determining the semantic information corresponding to the pinyin features.
Specifically, the pinyin characteristics are obtained to corresponding semantic information through a matching branch in an initial semantic recognition model based on ASR established semantic rules. In an alternative implementation, the highest probability of semantic information corresponding to the pinyin feature string is obtained according to the pinyin feature string, for example, p (label | S1, S2.) is output through a matching branch, if the probability of max (p (label | S1, S2 …) × p (S1, S2.)) is greater than a probability threshold T, the semantic information corresponding to the pinyin feature is determined, and the input speech information is determined to correspond to the semantic label, wherein label is semantic information, S1, S2, etc. are speech states, and the probability threshold T is not limited in this embodiment, wherein, the speech states of S1, S2, etc. can be represented according to 3 states into which each base syllable in the speech information is split, and the represented syllable is called tri-phone (polyphony).
In an alternative embodiment, an initial semantic recognition model based on a neural network model, for example, various models including CNN/DNN, outputs a predicted label through a matching branch therein, obtains a distribution distance (loss) by comparing distributions of the predicted label and a labeled label, modifies a weight of the CNN/DNN through the loss, and determines semantic information corresponding to a pinyin feature by performing multiple iterations to make the distributions of the predicted label and the labeled label closest.
In an optional embodiment, a pre-training language model developed based on google is established to obtain an initial semantic recognition model, the predicted label is output through matching branches in the model, for example, the pre-training model such as Bert and various models such as CNN/DNN, distribution distance (loss) is obtained by comparing the distribution of the predicted label and the labeled label, the weight of CNN/DNN is modified through loss, and the semantic information corresponding to the pinyin features is determined by performing multiple iterations to make the distribution of the predicted label and the labeled label closest to each other.
In an optional embodiment, an initial semantic recognition model which is not based on a training model in the prior art and a pre-training language model developed based on google are established to obtain the initial semantic recognition model, a speech state is input into any one of the initial semantic recognition models through a training data set, the speech state is converted into corresponding pinyin characteristics according to the speech state, and corresponding semantic information is further obtained according to the pinyin characteristics through a matching branch to train to obtain a target semantic recognition model. For example, the target semantic recognition model can be obtained based on a pre-trained pinyin bert model developed by google.
In an optional embodiment, after obtaining the semantic information corresponding to the voice information, the method further includes:
and displaying the semantic information.
For example, when semantic information corresponding to the voice information is obtained through the target semantic recognition model, the semantic information is displayed, for example, semantic information corresponding to the voice information of "turn air conditioner low" is displayed.
In the speech recognition process, since the chinese characters may correspond to homophones, that is, one pinyin feature may correspond to a plurality of chinese characters, the present embodiment obtains the highest semantic information probability through the determined pinyin feature by converting the speech state into the pinyin feature, and if the highest semantic information probability is not less than the probability threshold, determines the semantic information corresponding to the pinyin feature. The method not only improves the recognition accuracy rate and reduces the error rate of homophone character recognition, but also has wide universality and is suitable for control type voice recognition scenes of automobiles, homes and the like.
In an alternative embodiment, the method for semantic recognition in this embodiment may add step S300 (not shown) on the basis of fig. 6 before inputting the speech information into the target semantic recognition model, specifically,
s300: acquiring a training data set; inputting the training data set into an initial semantic recognition model, wherein the initial semantic recognition model recognizes sound features of a large amount of pre-stored voice information, and then obtains semantic information corresponding to the voice information through the sound features, so as to obtain a target semantic recognition model. In an alternative embodiment, the initial semantic recognition model may include matching branches and is used to derive corresponding semantic information based on the sound features. In an optional embodiment, the matching branch may also use a Bert, CNN/DNN, or other language model to obtain, through multiple iterations, a predicted label closest to the distribution of the labeled label, and a predicted probability p (label) corresponding to the predicted label, where if p (label) is greater than T (a preset threshold), the predicted label is determined, and corresponding semantic information is output. In an optional embodiment, when the maximum p (label) is obtained and corresponds to T (a preset threshold), the predicted label is determined, and corresponding semantic information is output.
In an optional embodiment, after obtaining the semantic information corresponding to the voice information, the method further includes: and displaying the semantic information.
The embodiment not only improves the recognition accuracy rate and reduces the error rate of homophone character recognition, but also has wide universality and is suitable for control type voice recognition scenes of automobiles, homes and the like.
Fig. 10 is a schematic structural diagram of a semantic recognition apparatus according to a fifth embodiment of the present invention, and as shown in fig. 10, the semantic recognition apparatus according to this embodiment may include:
an obtaining module 31, configured to obtain voice information and extract a voice state according to the voice information;
an identification module 32 for
And inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information.
In one possible design, before inputting the speech state into the target semantic recognition model, the method further includes:
acquiring a training data set;
inputting the training data set into an initial semantic recognition model, wherein the initial semantic recognition model comprises a pinyin conversion branch and a matching branch, and the pinyin conversion branch is used for
And the matching branch is used for obtaining corresponding semantic information according to the pinyin characteristics to obtain a target semantic recognition model.
In one possible design of the system, the system may be,
obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state, comprising:
sequentially obtaining character characteristics corresponding to each voice state according to a plurality of sequentially arranged voice states, and sequentially obtaining corresponding pinyin characteristics according to the character characteristics;
or obtaining corresponding character characteristics according to a plurality of sequentially arranged voice states, wherein the character characteristics comprise character characteristics corresponding to the first voice state, and sequentially obtaining corresponding pinyin characteristics from the character characteristics corresponding to the first voice state to the character characteristics at the front end and the rear end until obtaining the pinyin characteristics corresponding to all the character characteristics.
In one possible design, further comprising:
and marking the corresponding tone characteristics for the pinyin characteristics, wherein the tone characteristics are used for obtaining corresponding semantic information by combining the pinyin characteristics.
In one possible design, further comprising:
space marks are arranged among the pinyin characteristics, and the pinyin characteristics are connected into a pinyin characteristic string.
In one possible design, obtaining corresponding semantic information according to the pinyin features includes:
acquiring the highest semantic information probability corresponding to the pinyin feature string according to the pinyin feature string;
and if the highest semantic information probability is not less than the probability threshold, determining the semantic information corresponding to the pinyin features.
In one possible design, after obtaining semantic information corresponding to the speech information, the method further includes:
and displaying the semantic information.
The device for semantic recognition in this embodiment may execute the technical solutions in the methods shown in fig. 2 and fig. 8, and the specific implementation process and technical principle of the device refer to the related descriptions in the methods shown in fig. 2 and fig. 8, which are not described herein again.
Fig. 11 is a schematic structural diagram of a semantic recognition system according to a sixth embodiment of the present invention, and as shown in fig. 11, the semantic recognition system 40 according to this embodiment may include: a processor 41 and a memory 42.
A memory 42 for storing a computer program (e.g., an application program, a functional module, etc. implementing the above-described method of semantic recognition), computer instructions, etc.;
the computer programs, computer instructions, etc. described above may be stored in one or more memories 42 in partitions. And the above-mentioned computer program, computer instructions, data, etc. can be called by the processor 41.
A processor 41 for executing the computer program stored in the memory 42 to implement the steps of the method according to the above embodiments.
Reference may be made in particular to the description relating to the preceding method embodiment.
The processor 41 and the memory 42 may be separate structures or may be integrated structures integrated together. When the processor 41 and the memory 42 are separate structures, the memory 42 and the processor 41 may be coupled by a bus 43.
The server in this embodiment may execute the technical solutions in the methods shown in fig. 2 and fig. 8, and the specific implementation process and technical principle of the server refer to the relevant descriptions in the methods shown in fig. 2 and fig. 8, which are not described herein again.
In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.
Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (12)

1. A method of semantic recognition, comprising:
acquiring voice information and extracting a voice state according to the voice information;
and inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information.
2. The method of claim 1, further comprising, prior to inputting the speech state into the target semantic recognition model:
acquiring a training data set;
inputting the training data set into an initial semantic recognition model, wherein the initial semantic recognition model comprises a pinyin conversion branch and a matching branch, the pinyin conversion branch is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state, and the matching branch is used for obtaining corresponding semantic information according to the pinyin characteristics to obtain the target semantic recognition model.
3. The method of claim 2, wherein obtaining the pinyin feature, or the pinyin feature and the text feature based on the voice status comprises:
sequentially obtaining character characteristics corresponding to each voice state according to a plurality of sequentially arranged voice states, and sequentially obtaining corresponding pinyin characteristics according to the character characteristics;
or obtaining corresponding character characteristics according to a plurality of sequentially arranged voice states, wherein the character characteristics comprise character characteristics corresponding to the first voice state, and sequentially obtaining corresponding pinyin characteristics from the character characteristics corresponding to the first voice state to the character characteristics at the front end and the rear end until obtaining pinyin characteristics corresponding to all the character characteristics.
4. The method of claim 3, further comprising:
and marking the corresponding tone characteristics for the pinyin characteristics, wherein the tone characteristics are used for obtaining corresponding semantic information by combining the pinyin characteristics.
5. The method of claim 3, further comprising:
space marks are arranged among a plurality of pinyin characteristics, and the pinyin characteristics are connected into a pinyin characteristic string.
6. The method of claim 5, wherein obtaining corresponding semantic information according to the pinyin features comprises:
acquiring the highest semantic information probability corresponding to the pinyin feature string according to the pinyin feature string;
and if the highest semantic information probability is not less than a probability threshold, determining semantic information corresponding to the pinyin features.
7. The method according to any one of claims 1-6, further comprising, after obtaining semantic information corresponding to the speech information:
and displaying the semantic information.
8. A method of semantic recognition, comprising:
acquiring voice information and extracting a voice state;
and inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for recognizing the voice state to obtain semantic information corresponding to the voice information.
9. An apparatus for semantic recognition, comprising:
the acquisition module is used for acquiring voice information and extracting a voice state according to the voice information;
and the recognition module is used for inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for obtaining pinyin characteristics or pinyin characteristics and character characteristics according to the voice state to obtain semantic information corresponding to the voice information.
10. An apparatus for semantic recognition, comprising:
the acquisition module is used for acquiring voice information and extracting a voice state;
and the recognition module is used for inputting the voice state into a target semantic recognition model, wherein the target semantic recognition model is used for recognizing the voice state to obtain semantic information corresponding to the voice information.
11. A system for semantic recognition, comprising: the device comprises a memory and a processor, wherein the memory stores executable instructions of the processor; wherein the processor is configured to perform the method of semantic recognition of any one of claims 1-7 via execution of the executable instructions.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of semantic recognition according to any one of claims 1 to 7.
CN201911421165.5A 2019-12-31 2019-12-31 Semantic recognition method, device and system Pending CN111192572A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911421165.5A CN111192572A (en) 2019-12-31 2019-12-31 Semantic recognition method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911421165.5A CN111192572A (en) 2019-12-31 2019-12-31 Semantic recognition method, device and system

Publications (1)

Publication Number Publication Date
CN111192572A true CN111192572A (en) 2020-05-22

Family

ID=70709799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911421165.5A Pending CN111192572A (en) 2019-12-31 2019-12-31 Semantic recognition method, device and system

Country Status (1)

Country Link
CN (1) CN111192572A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017647A (en) * 2020-09-04 2020-12-01 北京蓦然认知科技有限公司 Semantic-combined speech recognition method, device and system
CN112185356A (en) * 2020-09-29 2021-01-05 北京百度网讯科技有限公司 Speech recognition method, speech recognition device, electronic device and storage medium
CN115148189A (en) * 2022-07-27 2022-10-04 中国第一汽车股份有限公司 Multifunctional synchronous implementation system and method for driver-defined voice command
US11862143B2 (en) 2020-07-27 2024-01-02 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for processing speech dialogues
CN112017647B (en) * 2020-09-04 2024-05-03 深圳海冰科技有限公司 Semantic-combined voice recognition method, device and system

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1499484A (en) * 2002-11-06 2004-05-26 北京天朗语音科技有限公司 Recognition system of Chinese continuous speech
CN103578467A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Acoustic model building method, voice recognition method and electronic device
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
US8700404B1 (en) * 2005-08-27 2014-04-15 At&T Intellectual Property Ii, L.P. System and method for using semantic and syntactic graphs for utterance classification
CN107644642A (en) * 2017-09-20 2018-01-30 广东欧珀移动通信有限公司 Method for recognizing semantics, device, storage medium and electronic equipment
CN108446278A (en) * 2018-07-17 2018-08-24 弗徕威智能机器人科技(上海)有限公司 A kind of semantic understanding system and method based on natural language
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109192202A (en) * 2018-09-21 2019-01-11 平安科技(深圳)有限公司 Voice safety recognizing method, device, computer equipment and storage medium
CN109326285A (en) * 2018-10-23 2019-02-12 出门问问信息科技有限公司 Voice information processing method, device and non-transient computer readable storage medium
CN109410918A (en) * 2018-10-15 2019-03-01 百度在线网络技术(北京)有限公司 For obtaining the method and device of information
CN109545190A (en) * 2018-12-29 2019-03-29 联动优势科技有限公司 A kind of audio recognition method based on keyword
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal
CN110008471A (en) * 2019-03-26 2019-07-12 北京博瑞彤芸文化传播股份有限公司 A kind of intelligent semantic matching process based on phonetic conversion
CN110060677A (en) * 2019-04-04 2019-07-26 平安科技(深圳)有限公司 Voice remote controller control method, device and computer readable storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1499484A (en) * 2002-11-06 2004-05-26 北京天朗语音科技有限公司 Recognition system of Chinese continuous speech
US8700404B1 (en) * 2005-08-27 2014-04-15 At&T Intellectual Property Ii, L.P. System and method for using semantic and syntactic graphs for utterance classification
CN103578467A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Acoustic model building method, voice recognition method and electronic device
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
CN107644642A (en) * 2017-09-20 2018-01-30 广东欧珀移动通信有限公司 Method for recognizing semantics, device, storage medium and electronic equipment
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN108446278A (en) * 2018-07-17 2018-08-24 弗徕威智能机器人科技(上海)有限公司 A kind of semantic understanding system and method based on natural language
CN109192202A (en) * 2018-09-21 2019-01-11 平安科技(深圳)有限公司 Voice safety recognizing method, device, computer equipment and storage medium
CN109410918A (en) * 2018-10-15 2019-03-01 百度在线网络技术(北京)有限公司 For obtaining the method and device of information
CN109326285A (en) * 2018-10-23 2019-02-12 出门问问信息科技有限公司 Voice information processing method, device and non-transient computer readable storage medium
CN109545190A (en) * 2018-12-29 2019-03-29 联动优势科技有限公司 A kind of audio recognition method based on keyword
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal
CN110008471A (en) * 2019-03-26 2019-07-12 北京博瑞彤芸文化传播股份有限公司 A kind of intelligent semantic matching process based on phonetic conversion
CN110060677A (en) * 2019-04-04 2019-07-26 平安科技(深圳)有限公司 Voice remote controller control method, device and computer readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11862143B2 (en) 2020-07-27 2024-01-02 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for processing speech dialogues
CN112017647A (en) * 2020-09-04 2020-12-01 北京蓦然认知科技有限公司 Semantic-combined speech recognition method, device and system
CN112017647B (en) * 2020-09-04 2024-05-03 深圳海冰科技有限公司 Semantic-combined voice recognition method, device and system
CN112185356A (en) * 2020-09-29 2021-01-05 北京百度网讯科技有限公司 Speech recognition method, speech recognition device, electronic device and storage medium
CN115148189A (en) * 2022-07-27 2022-10-04 中国第一汽车股份有限公司 Multifunctional synchronous implementation system and method for driver-defined voice command

Similar Documents

Publication Publication Date Title
CN110364171B (en) Voice recognition method, voice recognition system and storage medium
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN107016994B (en) Voice recognition method and device
CN109065032B (en) External corpus speech recognition method based on deep convolutional neural network
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
CN108829894B (en) Spoken word recognition and semantic recognition method and device
CN111090727B (en) Language conversion processing method and device and dialect voice interaction system
CN109686383B (en) Voice analysis method, device and storage medium
CN110675855A (en) Voice recognition method, electronic equipment and computer readable storage medium
CN111243599B (en) Speech recognition model construction method, device, medium and electronic equipment
CN111192572A (en) Semantic recognition method, device and system
CN109377981B (en) Phoneme alignment method and device
CN109036471B (en) Voice endpoint detection method and device
CN112466279B (en) Automatic correction method and device for spoken English pronunciation
CN110870004A (en) Syllable-based automatic speech recognition
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
KR102192678B1 (en) Apparatus and method for normalizing input data of acoustic model, speech recognition apparatus
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
JP4499389B2 (en) Method and apparatus for generating decision tree questions for speech processing
CN111933116B (en) Speech recognition model training method, system, mobile terminal and storage medium
Avram et al. Romanian speech recognition experiments from the robin project
CN113393830B (en) Hybrid acoustic model training and lyric timestamp generation method, device and medium
CN113889115A (en) Dialect commentary method based on voice model and related device
CN115050351A (en) Method and device for generating timestamp and computer equipment
JP2938865B1 (en) Voice recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200522

RJ01 Rejection of invention patent application after publication