CN112735394A - Semantic parsing method and device for voice - Google Patents

Semantic parsing method and device for voice Download PDF

Info

Publication number
CN112735394A
CN112735394A CN202011488961.3A CN202011488961A CN112735394A CN 112735394 A CN112735394 A CN 112735394A CN 202011488961 A CN202011488961 A CN 202011488961A CN 112735394 A CN112735394 A CN 112735394A
Authority
CN
China
Prior art keywords
recognition results
text
target
text recognition
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011488961.3A
Other languages
Chinese (zh)
Other versions
CN112735394B (en
Inventor
苏腾荣
朱文博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202011488961.3A priority Critical patent/CN112735394B/en
Publication of CN112735394A publication Critical patent/CN112735394A/en
Application granted granted Critical
Publication of CN112735394B publication Critical patent/CN112735394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Abstract

The invention provides a semantic parsing method and a semantic parsing device for voice, wherein the method comprises the following steps: acquiring a plurality of text recognition results of the voice data and phoneme recognition results corresponding to the text recognition results; acquiring a target recognition result with the highest confidence from the plurality of text recognition results; determining a domain classification result to which the voice data belongs according to the target recognition result; in the preset text field to which the voice data belongs, the music name of the voice data is determined according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results, so that the problem that in the related art, because homophones can only be subjected to error correction sending processing, a voice interaction system is low in music name recognition accuracy, the interaction success rate is low is solved, the music name recognition accuracy is improved, and the interaction success rate of a user in the music field is also improved.

Description

Semantic parsing method and device for voice
Technical Field
The invention relates to the field of communication, in particular to a semantic parsing method and device for voice.
Background
In modern daily life, users usually like to call terminal equipment such as a sound box, a mobile phone and the like through an intelligent voice conversation system to play songs, but because the names of partial songs have no context correlation, the recognition accuracy and the interaction success rate are not satisfactory. Aiming at the technical problem of poor interaction success rate of song names in the music field in an intelligent dialogue system, the invention provides a method for obtaining a final analysis result by calling semantic analysis and phoneme editing distance algorithm screening after outputting a phoneme-level N-Best identification result with the highest confidence score by utilizing a Lattice search path in a speech recognition decoder so as to improve the interaction success rate in the music field.
In the existing voice dialogue system, natural voice audio data from a user is acquired from input equipment through a semantic interaction system, the audio data is input to a cloud end to be called by an identification engine, a text form identification result is returned, and the identification result is subjected to semantic analysis, namely: filtering the text of the recognition result through a rejection module, recognizing whether the text is in the recognition grammar, and obtaining a classification result through a classifier; and then carrying out regular matching by using BNF (Backos-norm) and returning a matching result.
In the technical scheme, the calling process and the calling method are too dependent on the recognition result of the recognition engine, errors in common forms such as homophones and the like can only be processed in other modes such as error correction and the like, so that the interaction success rate of the voice interaction system in the music field is low, and the use requirements of users cannot be met. For example, the song name is "san sheng shi", and the recognition result may be "san sheng shi". Interaction failure caused by the recognition result is a common reason that interaction failure is caused by semantic parsing in the music field.
Aiming at the problem that the accuracy of the voice interaction system in music name recognition is low and the interaction success rate is low because homophones can only be transmitted and processed through error correction in the related technology, no solution is provided.
Disclosure of Invention
The embodiment of the invention provides a semantic analysis method and a semantic analysis device for voice, which are used for at least solving the problem that in the related technology, because homophone can only be sent and processed through error correction, the semantic analysis accuracy in the music field is low.
According to an embodiment of the present invention, there is provided a semantic parsing method for speech, including:
acquiring a plurality of text recognition results of voice data and phoneme recognition results corresponding to the text recognition results, wherein one text recognition result corresponds to one phoneme recognition result;
acquiring a target recognition result with the highest confidence from the plurality of text recognition results;
determining a domain classification result to which the voice data belongs according to the target recognition result;
and in the preset text field to which the voice data belongs, determining the music name of the voice data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results.
Optionally, determining the domain classification result to which the voice data belongs according to the target recognition result includes:
extracting slot positions of the text recognition result to obtain slot position information;
acquiring text information of the voice data according to the slot position information;
and determining a domain classification result to which the voice data belongs according to the text information.
Optionally, determining, according to the text information, a domain classification result to which the speech data belongs includes:
and inputting the text information into a pre-trained deep neural network model to obtain a field classification result corresponding to the text information output by the deep neural network model.
Optionally, determining the music name of the speech data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results includes:
performing slot position extraction on the plurality of text recognition results to obtain slot position information corresponding to the plurality of text recognition results;
acquiring name text information of the text recognition results according to the slot position information corresponding to the text recognition results;
respectively acquiring target phonemes corresponding to the slot position information from phoneme recognition results corresponding to the text recognition results to obtain a plurality of target phoneme recognition results;
and determining the music name of the voice data according to the target phoneme recognition results.
Optionally, determining the music name of the speech data according to the plurality of target phoneme recognition results comprises:
respectively determining the editing distances of the target phoneme recognition results and phonemes corresponding to the music names in a music list, wherein the music list stores the corresponding relations between the music names and the phonemes;
acquiring a target phoneme recognition result corresponding to the minimum editing distance;
and determining the music name of the voice data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result.
Optionally, determining the music name of the speech data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result includes:
if the target phoneme recognition result corresponds to a text recognition result, determining that the music name corresponding to the target phoneme recognition result is the music name of the voice data;
if the target phoneme recognition result corresponds to a plurality of text recognition results, respectively determining the editing distances between the plurality of text recognition results corresponding to the target phoneme recognition result and the music names in the music list; and acquiring a target text recognition result corresponding to the minimum editing distance, and determining that the music name corresponding to the target text recognition result is the music name of the voice data.
Optionally, the obtaining a plurality of text recognition results of the speech data, and the phoneme recognition results corresponding to the plurality of text recognition results include:
performing voice recognition on the voice data to obtain a plurality of text recognition results;
and performing phoneme conversion on the plurality of text recognition results to obtain phoneme recognition results corresponding to the plurality of text recognition results.
According to another embodiment of the present invention, there is also provided a speech semantic analysis device, including:
the system comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a plurality of text recognition results of voice data and phoneme recognition results corresponding to the text recognition results, and one text recognition result corresponds to one phoneme recognition result;
the second acquisition module is used for acquiring a target recognition result with the highest confidence coefficient from the plurality of text recognition results;
the first determining module is used for determining a domain classification result to which the voice data belongs according to the target recognition result;
and a second determining module, configured to determine, in a preset text field to which the voice data belongs, a music name of the voice data according to the multiple text recognition results and the phoneme recognition results corresponding to the multiple text recognition results.
Optionally, the first determining module includes:
the first extraction submodule is used for extracting the slot position of the text recognition result to obtain slot position information;
the first obtaining submodule is used for obtaining the text information of the voice data according to the slot position information;
and the first determining submodule is used for determining a domain classification result to which the voice data belongs according to the text information.
Optionally, the first determining submodule is further used for
And inputting the text information into a pre-trained deep neural network model to obtain a field classification result corresponding to the text information output by the deep neural network model.
Optionally, the second determining module includes:
the second extraction submodule is used for performing slot position extraction on the plurality of text recognition results to obtain slot position information corresponding to the plurality of text recognition results;
the second obtaining submodule is used for obtaining name text information of the text recognition results according to the slot position information corresponding to the text recognition results;
a third obtaining submodule, configured to obtain target phonemes corresponding to the slot position information from the phoneme recognition results corresponding to the multiple text recognition results, respectively, to obtain multiple target phoneme recognition results;
and the second determining submodule is used for determining the music name of the voice data according to the multiple target phoneme recognition results.
Optionally, the second determining submodule is further used for
Respectively determining the editing distances of the target phoneme recognition results and phonemes corresponding to the music names in a music list, wherein the music list stores the corresponding relations between the music names and the phonemes;
acquiring a target phoneme recognition result corresponding to the minimum editing distance;
and determining the music name of the voice data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result.
Optionally, the second determining submodule is further used for
If the target phoneme recognition result corresponds to a text recognition result, determining that the music name corresponding to the target phoneme recognition result is the music name of the voice data;
if the target phoneme recognition result corresponds to a plurality of text recognition results, respectively determining the editing distances between the plurality of text recognition results corresponding to the target phoneme recognition result and the music names in the music list; and acquiring a target text recognition result corresponding to the minimum editing distance, and determining that the music name corresponding to the target text recognition result is the music name of the voice data.
Optionally, the first obtaining module includes:
the voice recognition submodule is used for carrying out voice recognition on the voice data to obtain a plurality of text recognition results;
and the conversion submodule is used for carrying out phoneme conversion on the plurality of text recognition results to obtain phoneme recognition results corresponding to the plurality of text recognition results.
According to a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the invention, a plurality of text recognition results of voice data and phoneme recognition results corresponding to the text recognition results are obtained; acquiring a target recognition result with the highest confidence from the plurality of text recognition results; determining a domain classification result to which the voice data belongs according to the target text recognition result; in the preset text field to which the voice data belongs, the music name of the voice data is determined according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results, so that the problem that in the related art, because homophones can only be subjected to error correction sending processing, a voice interaction system is low in music name recognition accuracy, the interaction success rate is low is solved, the music name recognition accuracy is improved, and the interaction success rate of a user in the music field is also improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal of a semantic parsing method of voice according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method for semantic parsing of speech according to an embodiment of the invention;
FIG. 3 is a flow chart of a semantic parsing method based on a result of polyphonic recognition according to an embodiment of the present invention;
FIG. 4 is a block diagram of an apparatus for semantic parsing of speech according to an embodiment of the present invention;
FIG. 5 is a block diagram I of a speech semantic parsing apparatus according to a preferred embodiment of the present invention;
fig. 6 is a block diagram two of a speech semantic parsing apparatus according to a preferred embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Example 1
The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of a semantic parsing method of speech according to an embodiment of the present invention, as shown in fig. 1, the mobile terminal may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the semantic analysis method of speech in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio FrequeNcy (RF) module, which is used to communicate with the internet in a wireless manner.
Based on the above mobile terminal or network architecture, in this embodiment, a semantic parsing method for voice is provided, and fig. 2 is a flowchart of the semantic parsing method for voice according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:
step S202, acquiring a plurality of text recognition results of the voice data and phoneme recognition results corresponding to the text recognition results, wherein one text recognition result corresponds to one phoneme recognition result;
in an embodiment of the present invention, the step S202 may specifically include: performing voice recognition on the voice data to obtain a plurality of text recognition results, when performing voice recognition, obtaining a plurality of recognition results and corresponding confidence degrees, sorting the plurality of recognition results according to the confidence degree being greater than that, selecting a preset number of recognition results to obtain a plurality of text recognition results, wherein the preset data can be set according to actual conditions, for example, 5, 10, 15, and the like;
and obtaining the target text recognition result with the highest confidence degree from the plurality of text recognition results, and performing phoneme conversion on the plurality of text recognition results to obtain phoneme recognition results corresponding to the plurality of text recognition results, namely determining the recognition result with the highest confidence degree as the target text recognition result, wherein the recognition result with the highest confidence degree can be defaulted as the target text recognition result with the highest recognition accuracy.
Step S204, obtaining a target recognition result with the highest confidence from the plurality of text recognition results;
step S206, determining the domain classification result of the voice data according to the target recognition result;
in an embodiment of the present invention, the step S206 may specifically include:
extracting the slot position of the text recognition result to obtain slot position information;
acquiring text information of the voice data according to the slot position information;
determining a domain classification result to which the voice data belongs according to the text information, and further inputting the text information into a pre-trained deep neural network model to obtain a domain classification result corresponding to the text information output by the deep neural network model, wherein the domain classification result is a music domain, a non-music domain, a literature domain and the like, and mainly distinguishes whether the text domain is a preset text domain, and the preset text domain can be set as the music domain.
Step S208, in the preset text field to which the speech data belongs, determining the music name of the speech data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results.
In an embodiment of the present invention, the step S208 may specifically include:
s2081, extracting the slot positions of the text recognition results to obtain slot position information corresponding to the text recognition results;
s2082, respectively obtaining name text information of the plurality of text recognition results according to the slot position information corresponding to the plurality of text recognition results;
s2083, respectively obtaining target phonemes corresponding to the slot position information from the phoneme recognition results corresponding to the text recognition results to obtain a plurality of target phoneme recognition results;
s2084, determining the music name of the speech data according to the multiple target phoneme recognition results.
In an optional embodiment, the step S2084 may specifically include:
respectively determining the editing distances of the target phoneme recognition results and phonemes corresponding to the music names in a music list, wherein the music list stores the corresponding relations between the music names and the phonemes, and what needs to be explained is that the phoneme conversion mode of converting the music names into the phonemes in the music list is the same as the mode of converting the text recognition results into the phoneme recognition results;
acquiring a target phoneme recognition result corresponding to the minimum editing distance;
determining the music name of the voice data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result, and further determining the music name corresponding to the target phoneme recognition result as the music name of the voice data if the target phoneme recognition result corresponds to one text recognition result; if the target phoneme recognition result corresponds to a plurality of text recognition results, respectively determining the editing distances between the plurality of text recognition results corresponding to the target phoneme recognition result and the music names in the music list; and acquiring a target text recognition result corresponding to the minimum editing distance, and determining the music name corresponding to the target text recognition result as the music name of the voice data.
Through the steps S202 to S208, the problem that in the related art, because homophones can only be sent through error correction, the recognition accuracy of the music name of the voice interaction system is low, so that the interaction success rate is low can be solved, the recognition accuracy of the music name is improved, and the interaction success rate of a user in the music field is also improved.
In the embodiment of the invention, voice recognition is carried out on audio data (namely voice data) to obtain an OneBest text recognition result (corresponding to the target text recognition result) and a phoneme recognition result of NBest (corresponding to the phoneme recognition results corresponding to the text recognition results), semantic analysis is carried out on the OneBest text form recognition result, a field classification result is obtained through a language model and a two-classifier, whether the recognition result is a music field is judged, if the recognition result is the music field, phoneme-level recognition results of NBest and a music list are subjected to phoneme editing distance calculation one by one (here, a music list is also required to be subjected to phoneme conversion through the same dictionary to ensure that phonemes are consistent); to get a more correct music title.
After the intelligent voice conversation system is started, the audio data of the user is input to the cloud for voice recognition, the recognition result is divided into two parts, one part is the optimal text result returned by the Lattice of the recognition decoder: the OneBest result is an NBest recognition result of top N output by a decoder Lattice network according to the confidence result, and comprises a text and a corresponding phoneme sequence; and performing field analysis on the optimal text recognition result, obtaining a field classification result through a language model and a two-classifier to judge whether the field is a music field, and calling a phoneme sequence of a music list to calculate the editing distance if the field is the music field. And (4) carrying out editing distance calculation on the NBest result and the phoneme sequence of the music list one by one, wherein the minimum result is the optimal result. And returning the result to the terminal equipment to finish one interaction.
Fig. 3 is a flowchart of a semantic parsing method based on a polyphonic identification result according to an embodiment of the present invention, as shown in fig. 3, which mainly includes the following steps:
step S301, the terminal equipment acquires user audio data by taking a frame as a unit;
step S302, inputting audio data into a cloud recognition engine, and performing voice recognition to obtain an OneBest text recognition result and an NBest phoneme recognition result;
the speech recognition result is stored as map < string >, vector < string > >, first stores the text of the OneBest recognition result in the decoder Lattice, second stores the recognition result of topN (N is 10 in the test) and the corresponding phoneme sequence, the form of expression is the same as the dictionary, for example: "s Aa N sh ea NG sh Ib sh ad NG on Sansheng stone". When semantic analysis is carried out, the music list in the database is stored in the dictionary form so as to facilitate calculation; at the same time, a common dictionary is prepared so as to remove the semantic instruction part and reserve the song name part.
Step S303, inputting the OneBest result into semantic analysis, and performing field recognition on the text to obtain a field recognition result;
a step S304 of determining whether or not the region identification result is a music region, and if yes, executing a step S305, and if no, executing a step S307;
step S305, extracting slot position information from a plurality of text recognition results of NBest, and acquiring music names (namely the target phonemes) in phoneme recognition results corresponding to the text recognition results to obtain a plurality of target phoneme recognition results;
if the user is in the music field, extracting the user intention and the slot position through a music analyzer, deleting the intention from the recognition result by using a dictionary, and simultaneously deleting the corresponding phoneme sequence;
step S306, performing phoneme editing distance calculation on a plurality of target phoneme sequences corresponding to the obtained music names and the music names in the music list one by one, counting a minimum editing distance result, and taking a return value of the minimum editing distance result as a final return value;
and step S307, returning the result to the terminal to complete the interaction task.
According to the embodiment of the invention, semantic analysis is carried out according to the OneBest identification result, whether the music field is judged by classification, and music name slot position information is extracted; and reserving the NBest recognition result of the engine, and storing the result in the form of: including names and corresponding phonemes, where the phonemes are sequences of corresponding text phonemes generated from a dictionary that is consistent with the phone set of the music list. And (4) respectively carrying out phoneme editing distance calculation on the NBest recognition result and the music library list to obtain a recognition result text with the minimum editing distance.
Due to the fact that the context of the song name is poor in correlation, the voice recognition effect of the song name is unsatisfactory, meanwhile, common error forms such as homophones and the like also cause frequent interaction failure when a user uses the song name, and the scheme can improve the interaction success rate of the user in the music field.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
In this embodiment, a semantic parsing apparatus for speech is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a semantic parsing apparatus for speech according to an embodiment of the present invention, as shown in fig. 4, including:
a first obtaining module 42, configured to obtain a plurality of text recognition results of the speech data and phoneme recognition results corresponding to the plurality of text recognition results, where one text recognition result corresponds to one phoneme recognition result;
a second obtaining module 44, configured to obtain a target recognition result with a highest confidence from the multiple text recognition results;
a first determining module 46, configured to determine, according to the target recognition result, a domain classification result to which the speech data belongs;
a second determining module 48, configured to determine, in a preset field to which the speech data belongs, a music name of the speech data according to the multiple text recognition results and the phoneme recognition results corresponding to the multiple text recognition results.
Fig. 5 is a block diagram of a speech semantic analysis device according to an embodiment of the present invention, and as shown in fig. 5, the first determining module 46 includes:
the first extraction submodule 52 is configured to perform slot extraction on the text recognition result to obtain slot information;
the first obtaining submodule 54 is configured to obtain text information of the voice data according to the slot position information;
and the first determining submodule 56 is used for determining a domain classification result to which the voice data belongs according to the text information.
Optionally, the first determining submodule 56 is further used for
And inputting the text information into a pre-trained deep neural network model to obtain a domain classification result corresponding to the text information output by the deep neural network model.
Fig. 6 is a block diagram ii of a speech semantic analysis device according to an embodiment of the present invention, and as shown in fig. 6, the second determining module 48 includes:
the second extraction submodule 62 is configured to perform slot extraction on the multiple text recognition results to obtain slot information corresponding to the multiple text recognition results;
a second obtaining submodule 64, configured to obtain name text information of the multiple text recognition results according to the slot position information corresponding to the multiple text recognition results, respectively;
a third obtaining submodule 66, configured to obtain target phonemes corresponding to the slot position information from the phoneme recognition results corresponding to the multiple text recognition results, respectively, to obtain multiple target phoneme recognition results;
a second determining sub-module 68 for determining the music name of the speech data according to the plurality of target phoneme recognition results.
Optionally, the second determining submodule 68 is also used for
Respectively determining the editing distances of the target phoneme recognition results and phonemes corresponding to the music names in a music list, wherein the music list stores the corresponding relations between the music names and the phonemes;
acquiring a target phoneme recognition result corresponding to the minimum editing distance;
and determining the music name of the voice data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result.
Optionally, the second determining submodule 68 is also used for
If the target phoneme recognition result corresponds to a text recognition result, determining a music name corresponding to the target phoneme recognition result as the music name of the voice data;
if the target phoneme recognition result corresponds to a plurality of text recognition results, respectively determining the editing distances between the plurality of text recognition results corresponding to the target phoneme recognition result and the music names in the music list; and acquiring a target text recognition result corresponding to the minimum editing distance, and determining the music name corresponding to the target text recognition result as the music name of the voice data.
Optionally, the first obtaining module 42 includes:
the voice recognition submodule is used for carrying out voice recognition on the voice data to obtain a plurality of text recognition results;
and the conversion submodule is used for carrying out phoneme conversion on the plurality of text recognition results to obtain phoneme recognition results corresponding to the plurality of text recognition results.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a plurality of text recognition results of the voice data and phoneme recognition results corresponding to the text recognition results, wherein one text recognition result corresponds to one phoneme recognition result;
s2, obtaining a target recognition result with the highest confidence degree from the plurality of text recognition results;
s3, determining the domain classification result of the voice data according to the target recognition result;
and S4, in the preset text field to which the voice data belongs, determining the music name of the voice data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results.
Optionally, in this embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-ONly Memory (ROM), a RaNdom Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store computer programs.
Example 4
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a plurality of text recognition results of the voice data and phoneme recognition results corresponding to the text recognition results, wherein one text recognition result corresponds to one phoneme recognition result;
s2, obtaining a target recognition result with the highest confidence degree from the plurality of text recognition results;
s3, determining the domain classification result of the voice data according to the target recognition result;
and S4, in the preset text field to which the voice data belongs, determining the music name of the voice data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A semantic parsing method of speech is characterized by comprising the following steps:
acquiring a plurality of text recognition results of voice data and phoneme recognition results corresponding to the text recognition results, wherein one text recognition result corresponds to one phoneme recognition result;
acquiring a target recognition result with the highest confidence from the plurality of text recognition results;
determining a domain classification result to which the voice data belongs according to the target recognition result;
and in the preset text field to which the voice data belongs, determining the music name of the voice data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results.
2. The method of claim 1, wherein determining the domain classification result to which the speech data belongs according to the target recognition result comprises:
extracting slot positions of the text recognition result to obtain slot position information;
acquiring text information of the voice data according to the slot position information;
and determining a domain classification result to which the voice data belongs according to the text information.
3. The method of claim 2, wherein determining the domain classification result to which the speech data belongs according to the text information comprises:
and inputting the text information into a pre-trained deep neural network model to obtain a field classification result corresponding to the text information output by the deep neural network model.
4. The method of claim 1, wherein determining the music name of the speech data according to the plurality of text recognition results and the phoneme recognition results corresponding to the plurality of text recognition results comprises:
performing slot position extraction on the plurality of text recognition results to obtain slot position information corresponding to the plurality of text recognition results;
acquiring name text information of the text recognition results according to the slot position information corresponding to the text recognition results;
respectively acquiring target phonemes corresponding to the slot position information from phoneme recognition results corresponding to the text recognition results to obtain a plurality of target phoneme recognition results;
and determining the music name of the voice data according to the target phoneme recognition results.
5. The method of claim 4, wherein determining the music name of the speech data from the plurality of target phoneme recognition results comprises:
respectively determining the editing distances of the target phoneme recognition results and phonemes corresponding to the music names in a music list, wherein the music list stores the corresponding relations between the music names and the phonemes;
acquiring a target phoneme recognition result corresponding to the minimum editing distance;
and determining the music name of the voice data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result.
6. The method of claim 5, wherein determining the music name of the speech data according to the target phoneme recognition result and one or more text recognition results corresponding to the target phoneme recognition result comprises:
if the target phoneme recognition result corresponds to a text recognition result, determining that the music name corresponding to the target phoneme recognition result is the music name of the voice data;
if the target phoneme recognition result corresponds to a plurality of text recognition results, respectively determining the editing distances between the plurality of text recognition results corresponding to the target phoneme recognition result and the music names in the music list; and acquiring a target text recognition result corresponding to the minimum editing distance, and determining that the music name corresponding to the target text recognition result is the music name of the voice data.
7. The method according to any one of claims 1 to 6, wherein obtaining a plurality of text recognition results of the speech data, and the phoneme recognition results corresponding to the plurality of text recognition results comprises:
performing voice recognition on the voice data to obtain a plurality of text recognition results;
and performing phoneme conversion on the plurality of text recognition results to obtain phoneme recognition results corresponding to the plurality of text recognition results.
8. A semantic parsing apparatus for speech, comprising:
the system comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a plurality of text recognition results of voice data and phoneme recognition results corresponding to the text recognition results, and one text recognition result corresponds to one phoneme recognition result;
the second acquisition module is used for acquiring a target recognition result with the highest confidence coefficient from the plurality of text recognition results;
the first determining module is used for determining a domain classification result to which the voice data belongs according to the target recognition result;
and a second determining module, configured to determine, in a preset text field to which the voice data belongs, a music name of the voice data according to the multiple text recognition results and the phoneme recognition results corresponding to the multiple text recognition results.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 7 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.
CN202011488961.3A 2020-12-16 2020-12-16 Semantic parsing method and device for voice Active CN112735394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011488961.3A CN112735394B (en) 2020-12-16 2020-12-16 Semantic parsing method and device for voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011488961.3A CN112735394B (en) 2020-12-16 2020-12-16 Semantic parsing method and device for voice

Publications (2)

Publication Number Publication Date
CN112735394A true CN112735394A (en) 2021-04-30
CN112735394B CN112735394B (en) 2022-12-30

Family

ID=75603733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011488961.3A Active CN112735394B (en) 2020-12-16 2020-12-16 Semantic parsing method and device for voice

Country Status (1)

Country Link
CN (1) CN112735394B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010233019A (en) * 2009-03-27 2010-10-14 Kddi Corp Caption shift correction device, reproduction device, and broadcast device
CN103077714A (en) * 2013-01-29 2013-05-01 华为终端有限公司 Information identification method and apparatus
US20170004824A1 (en) * 2015-06-30 2017-01-05 Samsung Electronics Co., Ltd. Speech recognition apparatus, speech recognition method, and electronic device
CN108986790A (en) * 2018-09-29 2018-12-11 百度在线网络技术(北京)有限公司 The method and apparatus of voice recognition of contact
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal
CN110060662A (en) * 2019-04-12 2019-07-26 北京百度网讯科技有限公司 Audio recognition method and device
US10395640B1 (en) * 2014-07-23 2019-08-27 Nvoq Incorporated Systems and methods evaluating user audio profiles for continuous speech recognition
CN111883122A (en) * 2020-07-22 2020-11-03 海尔优家智能科技(北京)有限公司 Voice recognition method and device, storage medium and electronic equipment
CN111916088A (en) * 2020-08-12 2020-11-10 腾讯科技(深圳)有限公司 Voice corpus generation method and device and computer readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010233019A (en) * 2009-03-27 2010-10-14 Kddi Corp Caption shift correction device, reproduction device, and broadcast device
CN103077714A (en) * 2013-01-29 2013-05-01 华为终端有限公司 Information identification method and apparatus
US10395640B1 (en) * 2014-07-23 2019-08-27 Nvoq Incorporated Systems and methods evaluating user audio profiles for continuous speech recognition
US20170004824A1 (en) * 2015-06-30 2017-01-05 Samsung Electronics Co., Ltd. Speech recognition apparatus, speech recognition method, and electronic device
CN108986790A (en) * 2018-09-29 2018-12-11 百度在线网络技术(北京)有限公司 The method and apparatus of voice recognition of contact
CN109976702A (en) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 A kind of audio recognition method, device and terminal
CN110060662A (en) * 2019-04-12 2019-07-26 北京百度网讯科技有限公司 Audio recognition method and device
CN111883122A (en) * 2020-07-22 2020-11-03 海尔优家智能科技(北京)有限公司 Voice recognition method and device, storage medium and electronic equipment
CN111916088A (en) * 2020-08-12 2020-11-10 腾讯科技(深圳)有限公司 Voice corpus generation method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN112735394B (en) 2022-12-30

Similar Documents

Publication Publication Date Title
US9564127B2 (en) Speech recognition method and system based on user personalized information
CN101326572B (en) Speech recognition system with huge vocabulary
CN110797016B (en) Voice recognition method and device, electronic equipment and storage medium
EP2700071B1 (en) Speech recognition using multiple language models
CN108428446A (en) Audio recognition method and device
CN109949071A (en) Products Show method, apparatus, equipment and medium based on voice mood analysis
CN108447471A (en) Audio recognition method and speech recognition equipment
CN101567189A (en) Device, method and system for correcting voice recognition result
CN110807093A (en) Voice processing method and device and terminal equipment
CN110164416B (en) Voice recognition method and device, equipment and storage medium thereof
US11410685B1 (en) Method for detecting voice splicing points and storage medium
CN107112007A (en) Speech recognition equipment and audio recognition method
JP2010032865A (en) Speech recognizer, speech recognition system, and program
CN110942765B (en) Method, device, server and storage medium for constructing corpus
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN110570838B (en) Voice stream processing method and device
CN112735394B (en) Semantic parsing method and device for voice
CN111414748A (en) Traffic data processing method and device
CN111063337A (en) Large-scale voice recognition method and system capable of rapidly updating language model
CN113112992A (en) Voice recognition method and device, storage medium and server
CN113724698B (en) Training method, device, equipment and storage medium of voice recognition model
CN111680514A (en) Information processing and model training method, device, equipment and storage medium
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN111128127A (en) Voice recognition processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant