CN111429903A

CN111429903A - Audio signal identification method, device, system, equipment and readable medium

Info

Publication number: CN111429903A
Application number: CN202010196311.5A
Authority: CN
Inventors: 徐晓明; 刘鹏; 徐犇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-07-17
Anticipated expiration: 2040-03-19
Also published as: CN111429903B

Abstract

The embodiment of the disclosure provides an audio signal identification method, an audio signal identification device, an audio signal identification system, an audio signal identification device and a readable medium, wherein the method comprises the following steps: judging whether the text recognition result is matched with the text content with a specific intention or not according to the text recognition result of the received audio signal; if the text recognition result is matched with the text content with the specific intention, performing intention recognition on the text recognition result to obtain a target intention of the text recognition result; determining slot position information corresponding to the text recognition result by using the mapping relation between the matched text content and the slot position information; and obtaining a semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

Description

Audio signal identification method, device, system, equipment and readable medium

Technical Field

The disclosed embodiments relate to the field of computer technologies, and in particular, to an audio signal identification method, apparatus, system, device, and readable medium.

Background

With the continuous development of audio signal identification technology, the audio signal identification technology is widely applied to the fields of automobile driving, smart home, intelligent business systems and the like, and intelligent equipment can accurately execute corresponding functions through the audio signal identification technology.

When the intelligent device processes the user request, semantic recognition of a control instruction type request (control instruction for short) is important. Control instructions that the user said are accurately and comprehensively identified, so that the intelligent equipment can better meet user experience.

Disclosure of Invention

The embodiment of the disclosure provides an audio signal identification method, an audio signal identification device, an audio signal identification system, audio signal identification equipment and a readable medium, which can improve the semantic identification accuracy of an audio signal.

In a first aspect, an embodiment of the present disclosure provides an audio signal identification method, including: judging whether the text recognition result is matched with the text content with a specific intention or not according to the text recognition result of the received audio signal; if the text recognition result is matched with the text content with the specific intention, performing intention recognition on the text recognition result to obtain a target intention of the text recognition result; determining slot position information corresponding to the text recognition result by using the mapping relation between the matched text content and the slot position information; and obtaining a semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

In a second aspect, an embodiment of the present disclosure provides an audio signal identification apparatus, including: the text matching module is used for judging whether the text recognition result is matched with the text content with the specific intention or not according to the text recognition result of the received audio signal; the intention identification module is used for carrying out intention identification on the text identification result if the intention identification module is matched with the text content with the specific intention so as to obtain the target intention of the text identification result; the slot position mapping module is used for determining slot position information corresponding to the text recognition result by utilizing the mapping relation between the matched text content and the slot position information; and the result determining module is used for obtaining the semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

In a third aspect, an embodiment of the present disclosure provides an audio signal identification system, including a sound collecting device, configured to receive an audio signal; an audio signal identification device for implementing the audio signal identification method of the first aspect described above using a received audio signal.

In a fourth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a memory on which one or more programs are stored, the one or more programs, when executed by the one or more processors, causing the one or more processors to perform the audio signal recognition method of the first aspect; and one or more I/O interfaces connected between the processor and the memory and configured to realize information interaction between the processor and the memory.

In a fifth aspect, the disclosed embodiments provide a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the audio signal identification method of the first aspect.

According to the audio signal identification method, the device, the system, the equipment and the readable medium provided by the embodiment of the disclosure, before the intention identification and the slot mapping are carried out, whether the text identification result of the audio signal is matched with the text content with the specific intention is judged by using the text content with the specific intention, and on the premise of matching, the intention identification and the slot mapping are carried out on the text identification result. By adding a preposed matching process on the basis of the idea recognition and the slot mapping, the accuracy of the semantic recognition result can be greatly improved.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:

fig. 1 is a schematic view of an application scenario of an audio signal identification system according to an embodiment of the present disclosure;

fig. 2 is a flowchart of an audio signal identification method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating semantic recognition using Ngram vocabularies according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart illustrating semantic recognition by using a semantic collocation text set according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a processing flow for mapping semantic layer semantic recognition results to application layer semantic recognition results according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of an offline training process provided by the embodiment of the present disclosure;

fig. 7 is a schematic diagram illustrating another audio signal identification process provided by the embodiment of the present disclosure;

fig. 8 is a block diagram of a speech recognition apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an audio signal identification system according to an embodiment of the present disclosure;

fig. 10 is a block diagram illustrating an audio signal recognition apparatus according to an embodiment of the present disclosure;

fig. 11 is a block diagram of a computer-readable medium according to an embodiment of the disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the following describes in detail an audio signal identification method, apparatus, system, device and readable medium provided by the present disclosure with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but example embodiments may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

Fig. 1 is a schematic view of an application scenario of an audio signal identification system according to an embodiment of the present disclosure. As shown in fig. 1, a practical application scenario of the audio signal recognition system may include a sound source 10 and an audio signal recognition system 20, and the audio signal recognition system 20 includes: a sound collecting device 21 and an audio signal identifying device 22, the sound collecting device 21 may comprise a microphone array. Specifically, the sound collecting device 21 may send the audio signal to the audio signal identifying device 22 after receiving the audio signal emitted by the sound source 10, so that the audio signal identifying device 22 can perform text identification and semantic identification on the audio signal.

In one embodiment, the audio signal may include a voice signal, or may be a meaningful audio signal played by a machine, as long as the audio signal can drive the audio interaction device to perform audio interaction; the sound source 10 may comprise a user and/or a machine capable of playing meaningful audio signals; the audio signal recognition device 22 may include a smart speaker, a smart vehicle operating system, a smart battery, and other smart devices that may perform audio interaction.

In the embodiment of the disclosure, text recognition is performed on a received audio signal to obtain a text recognition result of the audio signal, and the text recognition result is used as a user request (query) to perform a subsequent semantic recognition process.

In the embodiment of the present disclosure, when the smart device processes the user request, semantic recognition of a control instruction class request (which may be referred to as a control instruction for short below) is important. The control instructions may include device control instructions (e.g., power off, volume adjustment), player control instructions (e.g., stop, previous, next), function control instructions (e.g., open favorites, view play history), and the like.

Generally, if a method based on template matching is adopted for the control command, the templates used in matching are generally written by manual prior, and the number is limited. However, the generalization capability of the method is poor, and in practical application, the method cannot be identified when the user utterance form is slightly changed, and the recall rate is low.

When the audio recognition method based on text classification and sequence marking is adopted, the intention recognition needs to be carried out by using the text classification method, and on the premise that the user request can be recognized as the target intention, the slot position recognition is carried out by using the sequence marking method. The text classification and sequence labeling method generally needs a large scale, such as hundreds of thousands to millions of levels of manual labeling data for training, the implementation cost is high, and meanwhile, the overall accuracy is very limited under the control instruction intention with fuzzy boundaries, for example, the overall accuracy can only reach about 90 percent generally.

Therefore, it is desirable to provide an audio signal identification method, which can accurately and comprehensively identify a control instruction spoken by a user, so that an intelligent device can better meet user experience.

Fig. 2 is a flowchart of an audio signal identification method according to an embodiment of the present disclosure. As shown in fig. 2, the method may include the following steps.

S110, for the text recognition result of the received audio signal, determining whether the text recognition result matches the text content with a specific intention.

And S120, if the text content is matched with the text content with the specific intention, performing intention recognition on the text recognition result to obtain the target intention of the text recognition result.

And S130, determining the slot position information corresponding to the text recognition result by using the mapping relation between the matched text content and the slot position information.

And S140, obtaining a semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

According to the audio signal identification method disclosed by the embodiment of the disclosure, before the intention identification and the slot mapping are carried out, whether the text identification result of the audio signal is matched with the text content with the specific intention is judged by using the text content with the specific intention, and on the premise of matching, the intention identification and the slot mapping are carried out on the text identification result. By adding a preposed matching process on the basis of the idea recognition and the slot mapping, the accuracy of the semantic recognition result can be greatly improved.

In the embodiment of the present disclosure, the text content with a specific intention may be at least one of a corpus Ngram vocabulary and a semantic collocation text set obtained by pre-training.

In one embodiment, the textual content with a particular intent includes a corpus Ngram vocabulary; the step of performing intent recognition on the text recognition result in step S120 to obtain a target intent of the text recognition result may specifically include: and performing intention recognition on the text recognition result by using an intention classification model obtained by pre-training to obtain a target intention of the text recognition result.

In one embodiment, the slot information in step S130 refers to slot information of a semantic layer, that is, slot information of a semantic layer; the slot position information corresponding to the text recognition result refers to the slot position information corresponding to the text recognition result on the semantic level.

For ease of understanding, the flow of semantic recognition using Ngram vocabularies is described below by an exemplary embodiment. FIG. 3 illustrates a flow diagram of semantic recognition with Ngram vocabularies of an exemplary embodiment. As shown in fig. 3, the flow of semantic recognition includes the following steps.

S21, as shown in "Ngram match", determines whether or not a query matches a Ngram vocabulary with a specific intention with respect to a received user request (query).

In this step, the query may be a text recognition result obtained by performing text recognition on the audio signal. When judging whether the query is matched with an intention Ngram vocabulary, if so, judging that the query is matched with a semantic collocation with a specific intention; if not, the process is ended.

Referring to FIG. 3, for example, the query is "we want to go out of your power and do not want to turn off", when the query contains "power" and "off" in the Ngram vocabulary, it is determined that the query matches a semantic collocation with a specific intent.

S22, if matching with the Ngram vocabulary, looking up the preset Ngram mapping table, mapping the query into specific slot and slot position value.

In the embodiments of the present disclosure, the slot information may be understood as: key information identified for an audio signal, such as a user speech signal. The slot position information in the embodiment of the present disclosure includes slot position information of a semantic layer and slot position information of an application layer. The slot position information of the semantic layer is slot position information corresponding to the specified text content on the semantic layer; the slot position information of the application layer is slot position information corresponding to the slot position information of the semantic layer in the application layer, and can be used for determining a control instruction in the application according to the intention of the application layer and the slot position information of the application layer.

In this step, the Ngram mapping table is used to indicate the mapping relationship between the matched text content and the slot information. The mapping relation between the text content and the slot position information is the mapping relation between the text content and the slot position information of the semantic layer, and the slot position information of the semantic layer is the slot position information corresponding to the semantics. Referring to fig. 3, the text contents "power source", "close", and "look up the preset Ngram mapping table, respectively, to obtain the slot position information of the semantic layer corresponding to" close ", that is, the slot position name is" op _ type ", and the slot position value is" power _ off "

S23, as shown in the figure, "intention classification", performs intention recognition on the query using the intention classification model to obtain the intention to which the query belongs.

In this step, the intention classification model may be obtained by performing model training in advance using a query labeled with intention classification. The type and training process of the intent classification model are not particularly limited by the disclosed embodiments.

Referring to FIG. 3, intent recognition is performed on the query "we want to go out of your power and do not want to turn off", and the intent of the query is "set _ power".

In this embodiment, the target intention of the text recognition result and the corresponding slot information are used as the semantic recognition result of the audio signal, and the slot information corresponding to the text recognition result is the slot information of the semantic layer corresponding to the text recognition result. It should be understood that, in the flow of semantic recognition described with reference to fig. 3, the processing sequence of steps S12 and S13 is not specifically limited, and in practical applications, the step corresponding to "intent classification" in step S13 may be performed first, and then the step corresponding to "slot mapping" in step S12 may be performed, where the processing sequence does not affect the semantic recognition result of the audio signal.

According to the audio signal identification method disclosed by the embodiment of the disclosure, a preposed ngram matching process can be added on the basis of performing intention classification by using a text classification model, so that the accuracy of a semantic identification result can be greatly improved.

In one embodiment, the textual content with a particular intent includes a set of semantic collocation texts; the step of performing intent recognition on the text recognition result in step S120 to obtain a target intent of the text recognition result may specifically include: s20, obtaining semantic matching texts matched with the semantic matching text set from the text recognition result; and S21, determining the target intention of the text recognition result according to the matched semantic collocation text by utilizing the preset mapping relation between the semantic collocation text and the target intention.

For ease of understanding, the following describes a process for semantic recognition using semantic collocation text sets by way of an exemplary embodiment. FIG. 4 illustrates a flow diagram of semantic recognition using semantic collocation text sets in an exemplary embodiment. As shown in fig. 4, the flow of semantic recognition includes the following steps.

S31, as shown in the figure, "semantic collocation recognition", aiming at the input query, judging that the query is matched with the semantic collocation with specific intention, if the matching is successful, outputting the semantic collocation matched by the query; otherwise, the identification result is null, and the process is ended.

Referring to fig. 4, for example, the query is "i do not want to listen to this playing-me-collected song", and when "playing" and "collected song" in the semantic collocation text set are included in the query, the query is determined to match the semantic collocation with a specific intention.

And S32, as shown in the figure of 'intention matching', searching a preset semantic matching mapping table by using the matched semantic matching text to obtain the intention corresponding to the matched semantic matching text.

Referring to fig. 4, the semantic collocation text "play" and "favorite songs" that match correspond to the intent "music.

And S33, as shown in the 'slot mapping', utilizing the matched semantic matching text to search a preset semantic matching mapping table to obtain the slot information corresponding to the matched semantic matching text.

Referring to fig. 4, the slot information corresponding to the matched semantic collocation text "play" is "play" i.e. the slot name is "action" and the slot value is "play"; the slot information corresponding to the "collected song" is "list _ name ═ favorite", that is, the slot name is "list _ name", and the slot bit value is "favorite".

In the semantic recognition flow described with reference to fig. 4, the processing sequence of steps S32 and S33 is not specifically limited, and in practical application, the step corresponding to the "intent classification" in step S33 may be performed first, and then the step corresponding to the "slot mapping" in step S32 may be performed; the steps S32 and S33 may also be combined, that is, under the condition that the query is matched with the semantic matching with the specific intent, the matched semantic matching text is used to search the preset semantic matching mapping table, so as to obtain the target intent and the corresponding slot information of the matched semantic matching text. The adjustment of the processing sequence or the combination of the steps of steps S32 and S33 does not affect the semantic recognition result of the audio signal.

According to the audio signal identification method disclosed by the embodiment of the disclosure, semantic matching is utilized, and matching processing of semantic matching of the query and a certain intention is added before intention identification and slot mapping are carried out, so that the effect of the identification method on long-query or speech identification error query can be greatly improved, and the fault tolerance and recall rate of a semantic identification result can be remarkably improved.

In one embodiment, the text content with a specific intent, while including the Ngram vocabulary, also includes a set of semantic collocation texts; step S140 may specifically include the following steps.

And S41, if the text recognition result is only matched with the semantic collocation text set, determining the semantic recognition result of the audio signal according to the target intention of the text recognition result matched with the semantic collocation text set and the corresponding slot position information.

And S42, if the text recognition result is matched with the corpus Ngram vocabulary, determining the semantic recognition result of the audio signal according to the target intention of the text recognition result matched with the corpus Ngram vocabulary and the corresponding slot position information.

In one embodiment, a processing flow of performing matching processing on the query by using an Ngram vocabulary, performing intention identification by using an intention classification model obtained by pre-training, and performing slot mapping on the query is called a processing method based on Ngram and the classification model; the processing flow of performing intent recognition and slot position mapping on the query by using the mapping relationship between the preset semantic collocation text and the target intent and the mapping relationship between the preset semantic collocation text and the slot position information after matching the query by using the semantic collocation text set is called as a knowledge-based processing method.

In this embodiment, in the preset mapping relationship between the semantic matching text and the target intent and the mapping relationship between the semantic matching text and the slot information, the mapping relationship between the semantic matching text and the slot information is a mapping relationship between the semantic matching text and the slot information of the semantic layer. The processing result generated by the processing method based on the ngram and the classification model has higher accuracy than that of the processing method based on knowledge. Therefore, for the query, the priority of the semantic recognition result obtained by matching the Ngram vocabulary is higher than that of the semantic collocation text set.

In one embodiment, the audio recognition method further comprises: s51, judging whether the text recognition result is correct; s52, if the text recognition result is matched with the preset text template, determining the text template matched with the text recognition result; and S53, based on the corresponding relationship between the preset text template and the target intention and the slot, using the target intention and the slot information corresponding to the matched text template as the target intention and the corresponding slot information of the text recognition result.

In this embodiment, based on the corresponding relationship between the preset text template and the target intent and slot position, the target intent corresponding to the matched text template at the semantic layer and the slot position information corresponding to the semantic layer can be obtained; when the preset text template is used for template matching, the preset text template can be literal matching based on a regular expression, so that if the query is successfully matched with the preset text template, the processing result obtained based on the template matching is higher in priority than the processing result obtained by matching the query with the text content with the specific intention.

In one embodiment, the audio recognition method further comprises: s61, obtaining corresponding application intention and slot position information according to the target intention of the text recognition result and the corresponding slot position information by using the preset semantic intention and the mapping relation between the slot position and the application intention and the slot position; and S62, obtaining the application instruction identification result corresponding to the audio signal according to the corresponding application intention and the slot position information.

In this embodiment, the preset mapping relationship between the semantic intention and the slot position and the application intention and the slot position refers to the preset mapping relationship between the semantic intention and the slot position and the application intention and the slot position, the semantic intention and the slot position represent the intention and the slot position information of the semantic layer, and the application intention and the slot position represent the intention and the slot position information of the application layer; the "application intention and slot information" obtained in step S61 is the intention of the application layer and the slot information of the application layer, and the slot information of the application layer is key information related to the application instruction.

For ease of understanding, the process flow of mapping semantic layer semantic recognition results to application layer semantic recognition results is described below by an exemplary embodiment. Fig. 5 is a schematic processing flow diagram of mapping a semantic layer semantic recognition result to an application layer semantic recognition result according to the embodiment of the present disclosure.

In the case where the query matches the semantic collocation text set as shown in FIG. 5, steps S71-S73 of the audio signal recognition method of FIG. 5 are substantially the same as steps S31-S33 of the audio signal recognition method of FIG. 4, except that the audio signal recognition method of FIG. 5 further includes the following steps.

S74, as shown in "application layer intention mapping" in fig. 5, the intention mapping of the application layer is performed by using the target intention and slot information obtained in the above steps S71-S73, and the target intention and slot information of the application layer are obtained.

In this embodiment, the intention and the slot position of the semantic layer are mapped to the intention and the slot position of the application layer, so that an application control instruction directly acting on the application of the intelligent device can be obtained, and the accuracy of application control on the intelligent device is improved.

In one embodiment, the text content with the specific intention is a corpus Ngram, and the corpus Ngram is the text content constructed by predicate units and noun units meeting the first confidence requirement, which are obtained by performing text recognition training by using a user search text labeled as having a target intention; if the text content with the specific intention is a semantic collocation text set, performing text recognition training by using a user search text labeled as the text with the target intention to obtain text content constructed by predicate units and noun units meeting the second confidence requirement; and wherein the first confidence requirement is satisfied less than the second confidence requirement.

In this embodiment, the main steps of the offline training process of the corpus Ngram vocabulary are the same as those of the training process of the knowledge-based semantic collocation text set, except that the judgment criteria are broader, and whether the criterion of the Ngram is judged to be that whether the query containing the Ngram is likely to be the target intention (i.e. the first confidence requirement is satisfied); the criterion for determining whether the text is a semantic collocation text is that the query containing the semantic collocation must be the intention (i.e. the requirement of the second confidence degree is satisfied, and the requirement of the second confidence degree is greater than the requirement of the first confidence degree).

In one embodiment, the corpus Ngram vocabulary is a vocabulary with n-grams; when a corpus Ngram word list is constructed, predicate units and noun units meeting the first confidence requirement are used as candidate corpuses, and the candidate corpuses are labeled to obtain the corpus Ngram word list with n-gram.

That is, when performing offline training of a corpus Ngram vocabulary, predicate units and noun units meeting the first confidence requirement are used as Ngram candidate corpuses, and an effective unitary or multivariate Ngram vocabulary is obtained by artificially labeling the Ngram candidate corpuses.

In one embodiment, the semantic collocation text set includes semantic collocation and a corresponding semantic value, and the semantic collocation and the corresponding semantic value are obtained by performing semantic annotation on a predicate unit and a noun unit which meet the second confidence requirement.

That is to say, when performing offline training of a semantic collocation text set, semantic annotation is performed on a predicate unit and a noun unit meeting a second confidence requirement, so as to obtain a semantic collocation and a semantic value (value) of the semantic collocation.

In one embodiment, the step of performing text recognition training by using the user search text labeled as having the target intent may specifically include the following steps.

And S81, generating word segmentation results corresponding to the target intention of the user search text based on the pre-established knowledge word list.

S82, selecting the first semantic unit meeting the preset credibility condition from the word segmentation result, and updating the knowledge word list by using the selected first semantic unit.

And S83, generating updated word segmentation results corresponding to the target intention based on the updated knowledge word list.

S84, screening a second semantic unit meeting the confidence coefficient condition from the new word segmentation result, and updating the knowledge word list again by using the screened second semantic unit, wherein the first semantic unit and the second semantic unit are different semantic units in the predicate semantic unit and the noun semantic unit.

And S85, generating a re-updated word segmentation result corresponding to the target intention based on the re-updated knowledge word list until the screened first semantic unit and the screened second semantic unit both meet the preset semantic unit set convergence condition, and obtaining a predicate semantic unit set and a noun semantic unit set.

For ease of understanding, the training process of the Ngram vocabulary and the semantic collocation text set in the embodiment of the present disclosure is described below with reference to FIG. 6. Fig. 6 shows an offline training process schematic of an embodiment of the present disclosure. As shown in FIG. 6, in one embodiment, the training process may include the following steps.

S901, obtaining a labeled query under a target intention by means of mining or manual labeling, and performing basic word segmentation on the labeled query to obtain a plurality of word segmentation segments.

In this step, the annotation query under the target intent may contain a keyword with the target intent. As shown in fig. 6, the label query under the target intent may include, for example: "channel List," "View playlist," "View favorite List," "open playlist," "Play favorite List," "open favorite List," "Play List," "favorites List," "shuffle favorite List," and "open channel List," and the like.

And S902, searching a knowledge word list, and performing word segmentation and combination on the query basic word segmentation result to obtain a self-adaptive word segmentation result under the target intention.

In this step, the pre-established knowledge word list may include a phrase corresponding to the target intention text.

S903, through context entropy calculation, a predicate semantic unit is screened from the ngram prefix;

and S904, calculating the confidence of each semantic unit, and manually screening the units with higher confidence to obtain a predicate semantic unit list.

In the step, in the training process of the query word list, whether the criterion of the predicate semantic unit is that whether the query of the unit meets a first confidence requirement is judged, for example, the confidence is higher than a first confidence threshold; in the training process of the semantic collocation text set, whether the unit is a predicate semantic unit is judged to be a predicate semantic unit, whether the query of the unit meets a second confidence requirement is judged, and for example, the confidence is higher than a second confidence threshold. The second confidence threshold is greater than the first confidence threshold.

Therefore, in the process of training the query word list, if the criterion of the predicate semantic unit is that the query of the unit meets the first confidence requirement, the query of the unit is judged to be possibly a target intention, and the confidence requirement of the query word list obtained through training can be met.

In the training process of the semantic collocation text set, if the criterion of the predicate semantic unit is that the query of the unit meets the second confidence requirement, the query of the unit is determined to be a target intention, so that the confidence requirement of the semantic collocation text set obtained through training is met.

And S905, supplementing the predicate semantic unit list into the knowledge word list to obtain an updated knowledge word list.

And S906, searching the updated knowledge word list, and performing word segmentation and combination on the query basic word segmentation result to obtain a target intention self-adaptive word segmentation result.

In this step, the step of merging the word cuts is the same as that in S903, and the embodiment of the present disclosure is not described again.

And S907, screening noun semantic units from the ngram suffix through context entropy calculation.

S908, calculating the confidence of each semantic unit, and manually screening the units with higher confidence to obtain a noun semantic unit list.

In the step, similar to the method for obtaining the predicate semantic unit list by screening, in the training process of the query word list, whether the criterion of the noun semantic unit is that whether the query containing the unit is possible to be an intention is judged, that is, the requirement of the first confidence coefficient is met; in the training process of the semantic collocation text set, whether the standard of the noun semantic unit is that whether the query containing the unit is the intention is determined, namely the requirement of a second confidence degree is met. The embodiments of the present disclosure are not described in detail.

S909, the noun semantic unit list is added into the knowledge word list to obtain the updated indication word list.

In this embodiment, the above steps S902-S909 are repeated until the two semantic unit lists substantially converge. The steps of' predicate semantic unit screening and noun semantic unit screening are circularly performed until the two semantic unit lists basically converge the basic convergence condition of the two semantic unit lists, which is as follows: the number of predicates in the predicate semantic unit set and the number of nouns in the noun semantic unit set are not increased any more.

S910, semantics corresponding to the noun semantic unit and the predicate semantic unit are manually marked to obtain semantic collocation and corresponding semantics.

As shown in fig. 6, in the off-line training process, noun unit screening may be performed first, and then predicate semantic unit screening may be performed. That is, in the above steps S902-S909, after the step S901, S902-S905 may be executed first, and then S906-S909 are executed, that is, the predicate semantic unit is screened first, and then the noun semantic unit is screened; or S906-S909 and S902-S905 can be executed first, namely, the noun semantic units are screened first and then the predicate semantic units are selected. The screening sequence of the semantic units does not influence the final screening results of the predicate semantic units and the noun semantic units.

As a specific example, referring to fig. 6, if the target intention is "open my music list", in the training process of the query vocabulary, when the predicate semantic unit is "open" or "look at" and the noun semantic unit is "list" obtained by screening, the "open", "look at" and "list" may be used as candidate queries for "open my music list" with the target intention; in the training process of the semantic collocation text set, the predicate semantic unit is required to be screened to be opened, and the noun semantic unit is the My music list, so that the opened and My music list can be used as the semantic collocation of the target intention.

In order to better understand the audio signal identification method of the present disclosure, an audio signal identification method of an exemplary embodiment of the present disclosure is described below through fig. 7. Fig. 7 illustrates an identification flow of an audio signal of still another example of the present disclosure. As shown in fig. 7, the method for recognizing an audio signal includes the following steps.

And S1001, template matching is carried out corresponding to the query input by the user, and if the text recognition result is successfully matched with the preset text template, the target intention and the slot position information corresponding to the matched text template are obtained.

In the following description, the target intention and slot information obtained by template matching in step S1001 may be referred to as a semantic recognition result of template matching.

S1002, matching the query with a Ngram vocabulary, if the matching is successful, performing slot mapping on the matched Ngram to obtain slot information corresponding to the query, and performing semantic recognition on the query by using an intention classification model obtained through pre-training to obtain a target intention corresponding to the query.

In the following description, the target intention and slot information obtained in step S1002 may be referred to as a semantic recognition result based on an ngram matching and classification model.

S1003, matching the query with the semantic matching text set, and if matching is successful, mapping the matched semantic matching based on a semantic matching mapping table to obtain the target intention and slot position information corresponding to the query.

In the following description, the target intention and slot position information obtained in step S1003 may be referred to as a semantic recognition result obtained by a knowledge-based method.

And S1004, merging the semantic recognition result of the template matching, the semantic recognition result based on the ngram matching and the semantic recognition result based on the knowledge to obtain semantic layer intention and slot position information.

In this step, the combined processing result may be obtained according to the priority of the processing result. Specifically, the priority of the semantic recognition result obtained based on the template matching method is greater than the priority of the semantic recognition result obtained based on the ngram matching and classification model; and the priority of the semantic recognition result obtained based on the ngram matching and classification model is greater than that of the semantic recognition result obtained by the knowledge-based method.

In this embodiment, the recognition result based on the template is combined with the recognition result based on the ngram + classification model method and the recognition result based on the knowledge method, so that the recognition accuracy can be further improved.

Step S1005, mapping the semantic layer intent and the slot position information into application layer intent and slot position information, and using the application layer intent and the slot position information as an application layer semantic recognition result of the audio signal.

In FIG. 7, the basic flow of the training of the intent classification model and the offline training of the semantic collocation text set is also shown. The intention classification model can be obtained by performing model training in advance by using a query labeled by intention classification; the process of the offline training of the semantic collocation text set is consistent with the training process of the Ngram vocabulary and the semantic collocation text set described above with reference to FIG. 6, and the embodiment of the disclosure is not repeated.

According to the audio signal identification method disclosed by the embodiment of the disclosure, a preposed ngram matching process is added before intention identification is carried out through a text classification model, so that the identification accuracy can be greatly improved; and by using a knowledge-based method, matching based on a semantic collocation text set is firstly carried out on a semantic recognition result of the audio signal, and a semantic collocation mapping table is searched according to the matching result to obtain a target intention and a slot position, so that the fault tolerance and the recall rate of recognition can be improved.

Fig. 8 shows a block diagram of a speech recognition apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus includes the following modules.

A text matching module 1110, configured to determine, according to a text recognition result of the received audio signal, whether the text recognition result matches a text content with a specific intention.

And the intention identifying module 1120 is used for identifying the intention of the text identifying result if the text identifying result is matched with the text content with the specific intention, so as to obtain the target intention of the text identifying result.

The slot mapping module 1130 is configured to determine slot information corresponding to the text recognition result by using a mapping relationship between the matched text content and the slot information.

And the result determining module 1140 is configured to obtain a semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

According to the audio signal recognition device disclosed by the embodiment of the disclosure, whether the text recognition result of the audio signal is matched with the text content with the specific intention is judged by utilizing the text content with the specific intention, and on the premise of matching, intention recognition and slot mapping processing are carried out on the text recognition result. By adding a preposed matching process on the basis of the idea recognition and the slot mapping, the accuracy of the semantic recognition result can be greatly improved.

In one embodiment, the textual content with a particular intent includes a corpus Ngram vocabulary; the intention identification module 1120 is specifically further configured to: and performing intention recognition on the text recognition result by using an intention classification model obtained by pre-training to obtain a target intention of the text recognition result.

In one embodiment, the textual content with a particular intent includes a set of semantic collocation texts; the intention identification module 1120 is specifically further configured to: obtaining a semantic collocation text matched with the semantic collocation text set from a text recognition result; and determining the target intention of the text recognition result according to the matched semantic collocation text by utilizing the preset mapping relation between the semantic collocation text and the target intention.

In one embodiment, the textual content with a particular intent further includes a set of semantic collocation texts; the result determination module 1140 further comprises: the first recognition result determining unit is used for determining the semantic recognition result of the audio signal according to the target intention and the corresponding slot position information of the text recognition result matched with the semantic collocation text set if the text recognition result is only matched with the semantic collocation text set; and the second identification result determining unit is used for determining the semantic identification result of the audio signal according to the target intention and the corresponding slot position information of the text identification result matched with the corpus Ngram vocabulary if the text identification result is matched with the corpus Ngram vocabulary.

In one embodiment, the audio signal recognition apparatus may further include: the template matching judging unit is used for judging whether the text recognition result is matched with a preset text template; the text model determining unit is used for determining a text template matched with the text recognition result if the text recognition result is matched with a preset text template; and the template matching result determining unit is used for taking the target intention and the slot position information corresponding to the matched text template as the target intention and the corresponding slot position information of the text recognition result based on the corresponding relation between the preset text template and the target intention and the slot position.

In one embodiment, the audio signal recognition apparatus may further include: the application layer identification result mapping module is used for obtaining corresponding application intention and slot position information according to the target intention of the text identification result and the corresponding slot position information by utilizing the preset semantic intention and the mapping relation between the slot position and the application intention; and obtaining an application instruction identification result corresponding to the audio signal according to the corresponding application intention and the slot position information.

In one embodiment, the text content with the specific intention is a corpus Ngram, and the corpus Ngram is the text content constructed by predicate units and noun units meeting the first confidence requirement, which are obtained by performing text recognition training by using a user search text labeled as having a target intention.

In one embodiment, the text content with the specific intention is a semantic collocation text set, and the semantic collocation text set is text content constructed by using a predicate unit and a noun unit which are obtained by performing text recognition training by using a user search text labeled as having a target intention and meet a second confidence requirement. The first confidence requirement is lower than the second confidence requirement.

In one embodiment, the audio signal recognition apparatus further includes a text recognition training module, and the text recognition training module may specifically include: the word segmentation unit is used for generating word segmentation results of the user search texts, corresponding to the target intentions, based on a pre-established knowledge word list; the first semantic screening unit is used for screening a first semantic unit meeting a preset reliability condition from the word segmentation result and updating the knowledge word list by utilizing the screened first semantic unit; the word segmentation unit is also used for generating an updated word segmentation result corresponding to the target intention based on the updated knowledge word list; the second semantic screening unit is used for screening a second semantic unit meeting the confidence degree condition from the new word segmentation result and updating the knowledge word list again by using the screened second semantic unit, wherein the first semantic unit and the second semantic unit are different semantic units in a predicate semantic unit and a noun semantic unit; and the word segmentation unit is also used for generating a re-updated word segmentation result corresponding to the target intention based on the re-updated knowledge word list until the screened first semantic unit and the screened second semantic unit both meet the preset semantic unit set convergence condition, so as to obtain a predicate semantic unit set and a noun semantic unit set.

According to the audio signal identification device disclosed by the embodiment of the disclosure, before intention identification is carried out through a text classification model, a preposed ngram matching process is added, so that the identification accuracy can be greatly improved; and the method based on knowledge can be utilized to firstly match the semantic identification result of the audio signal based on the semantic collocation text set, and look up the semantic collocation mapping table according to the matching result to obtain the target intention and the slot position, thereby improving the fault tolerance and the recall rate of the identification.

Fig. 9 shows a schematic structural diagram of an audio signal identification system according to an embodiment of the present disclosure. In one embodiment, an audio signal identification system, comprises: a sound collecting device 1210 for receiving an audio signal; an audio signal recognition device 1220 for implementing any one of the audio signal recognition described in the above embodiments.

Fig. 10 is a block diagram illustrating an audio signal recognition apparatus provided in an embodiment of the present disclosure; as shown in fig. 10, an embodiment of the present disclosure provides an audio signal recognition apparatus including: one or more processors 1301; a memory X02 on which one or more programs are stored which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of the preceding claims; and one or more I/O interfaces 1303 connected between the processor and the memory, and configured to enable information interaction between the processor and the memory.

The processor 1301 is a device with data processing capability, and includes but is not limited to a Central Processing Unit (CPU), the memory 1302 is a device with data storage capability, and includes but is not limited to a random access memory (RAM, more specifically, SDRAM, DDR, etc.), a Read Only Memory (ROM), a charged erasable programmable read only memory (EEPROM), and a flash memory (F L ASH), and the I/O interface (read/write interface) 1303 is connected between the processor 1301 and the memory 1302, and can realize information interaction between the processor 1301 and the memory 1302, and includes but is not limited to a data Bus (Bus).

In some embodiments, the processor 1301, the memory 1302, and the I/O interface 1303 are interconnected via a bus 1304, which in turn connects other components of the electronic device 1300.

FIG. 11 shows a block diagram of a computer-readable medium provided by an embodiment of the disclosure. As shown in fig. 11, an embodiment of the present disclosure provides a computer-readable medium, on which a computer program is stored, and the program, when executed by a processor, implements any one of the audio signal identification methods described above.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. An audio signal identification method, comprising:

judging whether a text recognition result of a received audio signal is matched with text content with a specific intention or not according to the text recognition result;

if the text content with the specific intention is matched, performing intention recognition on the text recognition result to obtain a target intention of the text recognition result;

determining slot position information corresponding to the text recognition result by using the mapping relation between the matched text content and the slot position information;

and obtaining a semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

2. The method of claim 1, wherein the textual content with specific intent comprises a corpus Ngram; the performing intention recognition on the text recognition result to obtain a target intention of the text recognition result includes:

and performing intention recognition on the text recognition result by using an intention classification model obtained by pre-training to obtain a target intention of the text recognition result.

3. The method of claim 1, wherein the textual content with a particular intent comprises a set of semantic collocation texts; the performing intention recognition on the text recognition result to obtain a target intention of the text recognition result includes:

obtaining a semantic collocation text matched with the semantic collocation text set from the text recognition result;

and determining the target intention of the text recognition result according to the matched semantic collocation text by utilizing the mapping relation between the preset semantic collocation text and the target intention.

4. The method of claim 2, wherein the text content with a particular intent further comprises a set of semantic collocation texts; the obtaining of the semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information includes:

if the text recognition result is only matched with the semantic collocation text set, determining the semantic recognition result of the audio signal according to the target intention and the corresponding slot position information of the text recognition result matched with the semantic collocation text set;

and if the text recognition result is matched with the Ngram word list, determining the semantic recognition result of the audio signal according to the target intention of the text recognition result matched with the Ngram word list and the corresponding slot position information.

5. The method of claim 1, wherein the method further comprises:

judging whether the text recognition result is matched with a preset text template or not;

if the text recognition result is matched with a preset text template, determining the text template matched with the text recognition result;

and based on the corresponding relation between a preset text template and the target intention and the slot position, taking the target intention and the slot position information corresponding to the matched text template as the target intention and the corresponding slot position information of the text recognition result.

6. The method of claim 1, wherein the method further comprises:

obtaining corresponding application intention and slot position information according to the target intention of the text recognition result and the corresponding slot position information by using a preset semantic intention and a mapping relation between the slot position and the application intention and the slot position;

and obtaining an application instruction identification result corresponding to the audio signal according to the corresponding application intention and the slot position information.

7. The method of claim 1, wherein,

if the text content with the specific intention is a Ngram word list, performing text recognition training on the Ngram word list by using a user search text marked as having a target intention to obtain text content constructed by predicate units and noun units meeting a first confidence requirement;

if the text content with the specific intention is a semantic collocation text set, performing text recognition training by using a user search text labeled as having a target intention to obtain text content constructed by a predicate unit and a noun unit meeting a second confidence requirement;

and wherein the meeting the first confidence requirement is lower than the second confidence requirement.

8. The method of claim 7, wherein,

the Ngram vocabulary is a vocabulary with n-gram syntax;

and when the Ngram word list is constructed, using predicate units and noun units meeting the first confidence requirement as candidate linguistic data, and labeling the candidate linguistic data to obtain the word list.

9. The method of claim 1, wherein,

the semantic collocation text set comprises semantic collocation and corresponding semantic values;

the semantic collocation and the corresponding semantic value are obtained by performing semantic annotation on a predicate unit and a noun unit which meet the requirement of a second confidence degree.

10. The method of claim 7, wherein the performing text recognition training using user search text labeled as having a target intent comprises:

generating word segmentation results of the user search texts, corresponding to the target intentions, based on a pre-established knowledge word list;

screening a first semantic unit meeting a preset confidence condition from the word segmentation result, and updating the knowledge word list by using the screened first semantic unit;

generating an updated word segmentation result corresponding to the target intention based on the updated knowledge word list;

selecting a second semantic unit meeting the confidence degree condition from the new word segmentation result, and updating the knowledge word list again by using the selected second semantic unit, wherein the first semantic unit and the second semantic unit are different semantic units in a predicate semantic unit and a noun semantic unit;

and generating a word segmentation result corresponding to the target intention, which is updated again, based on the updated knowledge word list until the screened first semantic unit and the screened second semantic unit both meet the convergence condition of a preset semantic unit set, so as to obtain the predicate semantic unit set and the noun semantic unit set.

11. An audio signal identification apparatus comprising:

the text matching module is used for judging whether the text recognition result is matched with the text content with a specific intention or not according to the text recognition result of the received audio signal;

the intention identification module is used for carrying out intention identification on the text identification result if the text identification result is matched with the text content with the specific intention so as to obtain the target intention of the text identification result;

the slot position mapping module is used for determining slot position information corresponding to the text recognition result by utilizing the mapping relation between the matched text content and the slot position information;

and the result determining module is used for obtaining the semantic recognition result of the audio signal according to the target intention of the text recognition result and the corresponding slot position information.

12. An audio signal identification system comprising:

a sound collecting device for receiving an audio signal;

audio signal identification device for implementing an audio signal identification method according to any of claims 1-10 using said received audio signal.

13. An audio signal identification device comprising:

one or more processors;

a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the audio signal identification method according to any one of claims 1 to 10;

one or more I/O interfaces connected between the processor and the memory and configured to enable information interaction between the processor and the memory.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the audio signal recognition method according to any one of claims 1 to 10.