CN111178055B

CN111178055B - Corpus identification method, apparatus, terminal device and medium

Info

Publication number: CN111178055B
Application number: CN201911307187.9A
Authority: CN
Inventors: 刘志强; 李前国; 叶筠
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2022-07-29
Anticipated expiration: 2039-12-18
Also published as: CN111178055A; WO2021120876A1

Abstract

The embodiment of the application is applicable to the technical field of information, and provides a corpus identification method, a corpus identification device, terminal equipment and a medium, wherein the method comprises the following steps: acquiring an original corpus to be identified; adopting a plurality of Natural Language Understanding (NLU) engines to identify the original corpus and respectively obtaining intention categories corresponding to the NLU engines; determining the intention credibility of the original corpus according to the intention category of each NLU engine; and identifying the original corpus according to the intention credibility. The embodiment can perform fine-grained credibility processing on the external NLU service according to the field or intention, so as to realize recognition of mass corpora, thereby generating a corpus. When the terminal identifies the corpus based on the corpus, the accuracy of corpus identification can be improved while the service recall rate is effectively improved. The method can be widely applied to the fields of artificial intelligence and the like, in particular to various application scenes needing to realize the service recall rate based on natural language understanding.

Description

Corpus identification method, apparatus, terminal device and medium

Technical Field

The application belongs to the technical field of information, and particularly relates to a corpus identification method, a corpus identification device, a terminal device and a medium.

Background

Recall Ratio (Recall Ratio), also known as Recall Ratio, refers to the Ratio of the amount of relevant information to the total amount retrieved from the database. In the fields of Artificial Intelligence (AI), etc., improving the service recall rate is helpful to enhance the service experience of users. For example, when a user uses a voice assistant service on a terminal such as a mobile phone, the voice assistant can determine whether the user speaks accurately and complete a corresponding task or return corresponding information, which greatly affects the normal use of the user.

At present, in order to increase the service recall rate, a terminal manufacturer selects a plurality of third-party Content Providers (CPs) with Natural Language Understanding (NLU) capability to be simultaneously accessed into a terminal, and an NLU system of each CP identifies corpora input by a user respectively, and then selects a result from the corpora to return to the user. However, the NLU systems of multiple CPs that the terminal accesses simultaneously have strong and weak capabilities, and the definitions of each CP for different fields or intentions are not completely unified standards, and it is difficult to accurately compare the two CPs with each other. On the other hand, the introduction of a plurality of NLU systems improves the recall rate, but also increases the awkward chatting and reduces the accuracy of corpus identification.

Disclosure of Invention

The embodiment of the application provides a corpus identification method, a corpus identification device, terminal equipment and a medium, which can improve service recall rate and improve corpus identification accuracy rate by performing fine-grained credibility processing on the corresponding fields or intents of the corpus identified by a plurality of external NLU systems.

In a first aspect, an embodiment of the present application provides a corpus identification method, including:

acquiring an original corpus to be identified;

adopting a plurality of Natural Language Understanding (NLU) engines to identify the original corpus and respectively obtaining intention categories corresponding to the NLU engines;

determining the intention credibility of the original corpus according to the intention category of each NLU engine;

and identifying the original corpus according to the intention credibility.

Illustratively, the recognizing the original corpus by using a plurality of natural language understanding NLU engines to obtain an intention category corresponding to each NLU engine respectively includes:

calling processing interfaces of a plurality of NLU engines;

respectively inputting the original language material into a processing interface of each NLU engine to indicate each NLU engine to identify the original language material;

and receiving the intention category output by each NLU engine.

Illustratively, the determining the intention credibility of the original corpus according to the intention category of each NLU engine includes:

determining an intention score corresponding to the intention category of each NLU engine, wherein the intention score corresponding to each intention category is obtained by testing a sample corpus by adopting each NLU engine;

and calculating the intention credibility of the original corpus according to each intention category and the corresponding intention score.

Illustratively, the calculating the intention reliability of the original corpus according to each intention category and the intention score corresponding to the intention category includes:

determining a weight value for each of the intent categories;

and weighting and summing the intention scores corresponding to the intention categories by adopting the weight values to obtain the intention credibility of the original corpus.

Illustratively, the identifying the original corpus according to the intention reliability includes:

if the intention credibility is greater than or equal to a preset credibility threshold, identifying the original corpus as an effective corpus;

and if the intention credibility is smaller than the credibility threshold, identifying the original corpus as an invalid corpus.

Illustratively, after identifying the original corpus as an invalid corpus, further comprising:

judging whether a plurality of intention categories corresponding to the invalid corpus are all empty or not;

if the plurality of intention categories corresponding to the invalid corpus are all empty, deleting the invalid corpus;

if at least one of the intention categories corresponding to the invalid corpus is not empty, dividing the invalid corpus into a plurality of corpus classes according to the intention categories, adopting the NLU engines to identify the invalid corpus in each corpus class again, and identifying the invalid corpus in the corpus classes as the valid corpus if the intention categories identified by each NLU engine are unchanged.

After identifying the original corpus as the valid corpus, the method further includes:

acquiring an initial category of the effective corpus;

and storing the effective linguistic data, the initial category of the effective linguistic data and the intention category identified by each NLU engine into a corpus in an associated manner.

Illustratively, the method further comprises:

dividing a plurality of effective corpora into a plurality of identification classes according to the stored initial classes and intention classes of the effective corpora;

Counting the number of effective corpora contained in each identification class;

and generating a white list of the corpus according to the number of the effective corpora contained in each identification class.

Illustratively, the dividing the plurality of valid corpora into a plurality of recognition classes according to the stored initial categories and intention categories of the plurality of valid corpora includes:

and dividing the effective corpora with the same initial category and intention category into the same identification category.

Illustratively, the generating a white list of the corpus according to the number of the valid corpora included in each recognition class includes:

sequencing each identification class according to the number of effective linguistic data contained in each identification class;

and extracting the recognition class in a preset sorting interval to be used as a white list of the corpus.

In a second aspect, an embodiment of the present application provides a corpus identification method, including:

when a target corpus to be identified is received, identifying the target corpus by adopting a plurality of NLU engines to respectively obtain intention categories corresponding to each NLU engine;

extracting a white list matched with the intention category from a preset corpus;

and identifying the target corpus according to the intention category contained in the white list.

Illustratively, the white list is generated by:

dividing a plurality of effective corpora into a plurality of identification classes according to initial classes and intention classes of the plurality of effective corpora stored in the corpus;

Illustratively, the dividing the plurality of effective corpuses into a plurality of recognition classes according to the initial categories and the intention categories of the plurality of effective corpuses stored in the corpus includes:

Illustratively, generating the white list of the corpus according to the number of the valid corpora included in each recognition class includes:

Illustratively, the extracting a whitelist matching the intention category from a preset corpus includes:

Acquiring an initial category of the target corpus;

and extracting a white list containing the initial category and the intention category identified by the at least one NLU engine from the corpus.

Illustratively, the intention categories included in the white list include multiple ones, and the identifying the corpus according to the intention categories included in the white list includes:

determining an intention score corresponding to each intention category contained in the white list, wherein the intention score corresponding to each intention category is obtained by testing a sample corpus by adopting each NLU engine;

and identifying the intention category corresponding to the maximum intention score as the target intention category of the target corpus.

In a third aspect, an embodiment of the present application provides a corpus identification device, including:

the original corpus acquiring module is used for acquiring an original corpus to be identified;

the intention category identification module is used for identifying the original corpus by adopting a plurality of Natural Language Understanding (NLU) engines and respectively obtaining intention categories corresponding to the NLU engines;

an intention credibility determining module, configured to determine an intention credibility of the original corpus according to the intention category of each NLU engine;

And the original corpus identification module is used for identifying the original corpus according to the intention credibility.

In a fourth aspect, an embodiment of the present application provides a corpus identification device, including:

the system comprises an intention category identification module, a target language material recognition module and a target language material recognition module, wherein the intention category identification module is used for identifying the target language material by adopting a plurality of NLU engines when receiving the target language material to be recognized and respectively obtaining intention categories corresponding to the NLU engines;

the white list extraction module is used for extracting a white list matched with the intention type from a preset corpus;

and the target corpus identification module is used for identifying the corpus according to the intention category contained in the white list.

In a fifth aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the corpus identification method according to any one of the first aspect or the second aspect when executing the computer program.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor of a terminal device, implements the corpus identification method according to any one of the first aspect or the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the corpus identification method according to any one of the first aspect or the second aspect.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

according to the method and the device, the plurality of NLU engines are adopted to identify the original linguistic data to be identified, the intention category corresponding to each NLU engine is obtained, and then the intention credibility of the original linguistic data is determined according to the intention category of each NLU engine, so that the original linguistic data can be identified according to the intention credibility. The embodiment can perform fine-grained grading and credibility processing on the external NLU service according to the field or intention, so as to realize identification of mass corpora, thereby generating a corpus. When the terminal identifies the corpus based on the corpus, the accuracy of corpus identification can be improved while the service recall rate is effectively improved. The embodiment can be widely applied to the fields of artificial intelligence and the like, in particular to various application scenes needing to realize the service recall rate based on natural language understanding.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a flowchart illustrating exemplary steps of a corpus identification method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating exemplary steps of a corpus identification method according to another embodiment of the present application;

FIG. 3 is a diagram illustrating a corpus tagging process according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a white list generation process of a corpus according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating exemplary steps of a corpus identification method according to another embodiment of the present application;

fig. 6 is a schematic diagram of a hardware structure of a mobile phone to which a corpus identification method according to an embodiment of the present application is applied;

fig. 7 is a schematic diagram of a software structure of a mobile phone to which a corpus identification method according to an embodiment of the present application is applied;

FIG. 8 is a flowchart illustrating exemplary steps for generating a corpus whitelist according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating a process of applying white lists to a corpus according to an embodiment of the present application;

FIG. 10 is a diagram illustrating a relationship between an original corpus and an intent category according to an embodiment of the present application;

fig. 11 is a block diagram illustrating a corpus identification apparatus according to an embodiment of the present application;

fig. 12 is a block diagram illustrating a corpus identification apparatus according to another embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring to fig. 1, a schematic step flow chart of a corpus identification method provided in an embodiment of the present application is shown, where the method may specifically include the following steps:

s101, obtaining an original corpus to be identified;

the method can be applied to the fields of artificial intelligence and the like, particularly in scenes which relate to natural language understanding and the like and need to recognize voice and texts, and by adopting the corpus recognition method provided by the embodiment, the service recall rate can be remarkably improved.

It should be noted that this embodiment is an introduction to the method from the perspective of labeling the original corpus.

In this embodiment, the original corpus to be recognized may be obtained by capturing various text information input by the user in the network, or may be obtained by performing text conversion on the voice of the user. For example, when the user uses the voice assistant on the terminal, the text information obtained by converting the words or sentences spoken by the voice assistant can be used as the original corpus. The source of the original corpus is not limited in this embodiment.

Typically, the amount of raw corpora is very large, perhaps on the order of millions. Since the processing process of each original corpus is basically similar according to the method provided by this embodiment, for convenience of understanding, this embodiment only describes the method by referring to the processing process of one original corpus.

S102, recognizing the original corpus by adopting a plurality of Natural Language Understanding (NLU) engines, and respectively obtaining intention categories corresponding to the NLU engines;

in this embodiment, the NLU engines may refer to NLU systems provided by multiple CPs for recognizing corpora, and each NLU system may separately recognize the received original corpora and output a corresponding recognition result.

As an example of this embodiment, not less than three CPs providing NLU services can be accessed, so that for the same original corpus, at least the recognition results given by three NLU systems can be obtained.

The recognition result in this embodiment may refer to information of a domain or an intention to which the original corpus recognized by each NLU system belongs, that is, an intention category. For example, for a certain original corpus, after recognition, it can be first determined whether the corpus belongs to general encyclopedia or a utility tool.

In general, there may be differences in the areas in which different NLU systems are good. For example, one NLU system may have a higher recognition rate for sub-areas such as food, take-out, etc., while another NLU system may be better at knowledge in sub-areas such as maps, travel, etc.

Therefore, in the present embodiment, when an intention category is recognized by a different NLU system, the degree of reliability of the intention category can also be determined. For example, when performing corpus recognition using an NLU system having a high recognition rate for a sub-field such as food or take-out, if the output intention category belongs to the sub-field such as food or take-out, it can be considered that the reliability of the recognized intention category is relatively high; on the contrary, if the identified intention category belongs to the sub-fields of maps, travel, and the like, the reliability is relatively low.

Of course, the judgment of the reliability of the identified different intention categories may be given by the CP providing the NLU service, or the terminal manufacturer accessing each CP may test each NLU system in advance to obtain the evaluation information of each NLU system, or the evaluation information may be obtained by combining the evaluations of both the CP and the terminal manufacturer, which is not limited in this embodiment.

S103, determining the intention credibility of the original corpus according to the intention category of each NLU engine;

in this embodiment, the intention confidence level of the original corpus can be determined jointly according to the intention category recognized by each NLU system. The intention credibility of the original corpus may refer to whether the original corpus is recognizable or not and whether the original corpus is understandable by the NLU system or not.

In a specific implementation, if the intention categories recognized by each NLU system are relatively consistent, for example, the recognition result of each NLU system for a certain original corpus is a general encyclopedia, it can be considered that the credibility of the corpus is relatively high. If the recognition results of each NLU system for a certain original corpus are different, the credibility of the corpus can be considered to be low.

And S104, identifying the original corpus according to the intention credibility.

When the original corpora are identified according to the intention credibility, the original corpora with higher intention credibility can be marked as valid corpora, and the original corpora with lower intention credibility can be marked as invalid corpora.

The above describes a process of identifying a single original corpus, and after identifying a large amount of corpora according to the above process, the marking of the large amount of corpora can be completed, thereby generating a corpus. When the terminal identifies the corpus based on the corpus, the accuracy of corpus identification can be improved while the service recall rate is effectively improved.

Referring to fig. 2, a schematic step flow chart of a corpus identification method according to another embodiment of the present application is shown, where the method specifically includes the following steps:

S201, obtaining an original corpus to be identified;

it should be noted that, in this embodiment, the method is introduced from the perspective of constructing a white list of a corpus on the basis of labeling an original corpus and obtaining the corpus containing a large amount of effective corpuses.

Similar to the foregoing embodiment, the original corpus to be recognized in this embodiment may also be obtained by collecting a huge amount of texts converted from speech of users in the current network and manually input texts. The original corpus to be recognized can reach millions or more.

S202, recognizing the original corpus by adopting a plurality of Natural Language Understanding (NLU) engines, and respectively obtaining intention categories corresponding to the NLU engines;

in this embodiment, the NLU services provided by different CPs may be implemented by different NLU engines, which are NLU systems corresponding to the CPs. And the terminal can realize data interaction with each NLU engine through a corresponding processing interface.

Therefore, in a specific implementation, when an original corpus to be recognized needs to be processed, the processing interfaces of a plurality of NLU engines may be called first, and then the original corpus is input into the processing interface of each NLU engine, so as to instruct each NLU engine to recognize the original corpus. After the original corpus is identified, each NLU engine can return a corresponding result through the processing interface. The terminal can receive the intention category output by each NLU engine as the recognition result of each NLU system on the original corpus.

S203, determining an intention score corresponding to the intention category of each NLU engine, wherein the intention score corresponding to each intention category is obtained by testing the sample corpus by adopting each NLU engine;

after the intent categories identified by each NLU engine are obtained, an intent score corresponding to the intent categories can be determined. In this embodiment, the intention score corresponding to each intention category of the NLU engine may be obtained by testing the sample corpus using the NLU engine.

In specific implementation, a part of corpora can be collected in advance to serve as sample corpora for testing, then each NLU engine accessed by a terminal is adopted to identify the sample corpora respectively to obtain corresponding identification results, and each intention category of each NLU engine can be scored by manually analyzing the identification results. When scoring intent categories, an intent that an NLU engine excels in may be assigned a higher score, while an intent category that is not excellence or has a relatively low recognition accuracy is assigned a relatively lower score.

As shown in table one, the intention score is an example of intention scores corresponding to the intention categories of the home CPs in the present embodiment. Examples of scores for the respective intent categories for the NLU engines provided by the three CPs are given in table one, namely CP1, CP2, and CP 3. Taking CP1 as an example, the strong intention category is the category labeled sub-intention 1.2, and accordingly, the intention score corresponding to this category would also be relatively high at 1.5 points; the category of sub-intention 1.1 with the score of 0.5 belongs to the field which is not good for the NLU engine provided by CP1, and the intention score of the category is relatively low and is 0.5; for other categories, such as sub-intention 1.3, sub-intention 1.4, etc., the recognition results obtained by using the NLU engine provided by CP1 for recognition are generally represented, and the corresponding intention score is 1.0.

Table one:

s204, calculating the intention credibility of the original corpus according to each intention category and the corresponding intention score;

in this embodiment, the intention reliability of the original corpus to be recognized may be calculated according to each intention category and its corresponding intention score.

In a specific implementation, the intention credibility of the original corpus can be joggled according to the following formula:

wherein n is the number of CPs accessed by the terminal. Generally, n ≧ 3, i.e., at least three of the NLU systems provided by the CP are accessed.

For a certain original corpus, if the intention category output by CP1 is sub-intention 1.1, the intention category output by CP2 is sub-intention 2.2, and the intention category output by CP3 is sub-intention 3.1, the intention confidence level of the original corpus is 0.5+1.5+ 0.5-2.5, see the intention score shown in table one above.

Of course, as an example of this embodiment, when calculating the intention reliability of the original corpus, a weight value of each intention category may be determined first, and then the above weight values are adopted to perform weighted summation on the intention scores corresponding to each intention category, so as to obtain the intention reliability of the original corpus. The above-mentioned weight value may be set when scoring each intention category, for example, a higher weight is set for an intention category that the NLU system is good at. The embodiment is not limited to the specific manner of calculating the intention reliability of the original corpus according to each intention category and the corresponding intention score.

S205, if the intention credibility is greater than or equal to a preset credibility threshold, identifying the original corpus as an effective corpus;

when the original corpus is identified according to the intention reliability, comparing the intention reliability with a preset reliability threshold, and if the intention reliability is greater than or equal to the reliability threshold, identifying the original corpus as an effective corpus; if the intent confidence level is less than the confidence threshold, the original corpus may be identified as invalid corpus.

In a specific implementation, the confidence threshold may be set according to actual needs. For example, the threshold may be determined according to the number of accessed CPs, and the reliability threshold is selected to be equal to half of the number of accessed CPs, that is, the reliability threshold is n × 50%, which is not limited in this embodiment.

It should be noted that, for the currently identified invalid corpus, it may be further determined whether a plurality of intention categories corresponding to the invalid corpus are all empty, that is, whether each NLU engine cannot identify the corpus. If the plurality of intention categories corresponding to the invalid corpora are all empty, it indicates whether each NLU engine cannot identify the corpora, and at this time, the invalid corpora may be deleted.

If at least one of the intention categories corresponding to the invalid corpus is not empty, it indicates that at least one NLU engine can identify the corpus, and only the NLU engines have inconsistent classifications when identifying the corpus. At this time, all the ineffective corpuses can be divided into a plurality of corpuses according to the identified intention category, and the plurality of NLU engines are adopted again to identify the ineffective corpuses in each corpuses. If the intention category identified by each NLU engine remains unchanged, it indicates that the classification result obtained by each NLU engine identifying the invalid corpus in the corpus class is stable, so that the invalid corpus in the corpus class can be identified as the valid corpus.

For example, for some original corpora, if the calculated intention confidence level is less than the confidence threshold after the NLU engine provided by CP1, CP2, and CP3 is used to identify, these original corpora should be marked as invalid corpora according to the above steps. However, if the individual NLU engine can identify the original corpora and output the corresponding classification result, the invalid corpora with the same classification result can be classified into the same corpus class. For example, both CP1 and CP2 are identified as invalid, but CP3 identifies that all the corpuses of child intent 3.3 are divided into the same corpus class. And then, recognizing each corpus in the corpus class by adopting the three NLU engines again, if the recognition result is that both the CP1 and the CP2 are recognized as invalid and the CP3 recognizes the sub-intention 3.3, marking the invalid corpus in the corpus class as a valid corpus, adding the valid corpus into the corpus, and taking the sub-intention 3.3 recognized by the CP3 as the intention class corresponding to the corpuses.

In this embodiment, the original corpora lower than the threshold of the confidence level are aggregated in batches, and then are recognized for multiple times, and if the result recognized by each NLU engine is fixed, the corpora can be marked as valid corpora and stored in the corpus. The embodiment re-identifies the part of the corpus identified as the invalid corpus by the auxiliary means, so that the problem that a large number of original corpora are deleted due to different identification accuracy rates of each NLU engine can be solved, and the number of corpora can be effectively expanded.

S206, acquiring the initial category of the effective corpus, and storing the effective corpus, the initial category of the effective corpus and the intention category identified by each NLU engine into a corpus in an associated manner;

in this embodiment, when storing the effective corpus into the corpus, the intent categories of the corpus may be stored together.

In a specific implementation, an initial category of the valid corpus may be further obtained, where the initial category is obtained by roughly identifying an original corpus when the original corpus is collected.

For example, when a large amount of original corpora are collected, the NLU system for rough classification may be used to perform preliminary screening on the original corpora to obtain a preliminary screening intention classification of each original corpus, which is used as an initial category of the original corpora.

The material marked as valid corpora is then stored into the corpus along with its initial class and the intent class output by each NLU engine.

S207, dividing the plurality of effective linguistic data into a plurality of identification classes according to the stored initial classes and intention classes of the plurality of effective linguistic data;

after the corpus containing the large amount of effective corpuses is obtained, the embodiment can also perform aggregation statistics on the large amount of effective corpuses in the corpus to construct a corpus white list.

In a specific implementation, when aggregation statistics is performed on a large number of effective corpora, the corresponding effective corpora with the same initial category and intention category may be first divided into the same identification class.

For example, for the valid corpora, a category string corresponding to each valid corpus may be generated according to the difference between the initial category and the intention category, such as [ initial category — 0.1, CP1 intention category 1.1, CP2 intention category 2.1, CP3 intention category 3.2], and the category string indicates that the initial category of a certain valid corpus is sub-intention 0.1, and the intention categories obtained by using three different NLU engines for recognition are sub-intention 1.1, sub-intention 2.1, and sub-intention 3.2, respectively.

After the category character string of each valid corpus is identified in the above manner, all corpora having the same category character string may be aggregated into the same identification class. The initial classes of the effective corpora in each recognition class are the same, and the intention classes obtained by recognizing the corpora in the recognition class by using three different NLU engines are also the same respectively.

And S208, counting the number of the effective corpora contained in each identification class, and generating a white list of the corpus according to the number of the effective corpora contained in each identification class.

In this embodiment, after the mass corpora are aggregated according to the foregoing steps and divided into different recognition classes, the number of corpora included in each recognition class may be respectively counted. For example, an identification class contains 10000 corpora, an identification class contains 500 corpora, and so on.

Then, a corpus white list can be constructed according to the number of contained corpora.

In a particular implementation, those recognition classes that contain a corpus quantity that exceeds a certain threshold may be selected as corpus whitelists. For example, for all the recognition classes obtained by aggregation, those recognition classes containing over 2000 corpora may be recognized as a corpus white list, and the above threshold may be determined according to actual needs, which is not limited in this embodiment.

As an example of this embodiment, after the number of the effective corpuses included in each recognition class is obtained through statistics, each recognition class may be sorted according to the number of the effective corpuses included in each recognition class. For example, all recognition classes may be sorted in order of the number of corpora included by at least the number of corpora included. Then, the recognition classes in the preset sorting interval are extracted and used as a white list of the corpus. For example, the top 80% ranked recognition classes may be selected as a white list for the corpus.

In some cases, the white list of the corpus is also modified by manual labeling, which is not limited in this embodiment.

In the embodiment of the application, the plurality of NLU engines are adopted to identify the original corpora, the intention categories corresponding to each NLU engine can be output, then the intention credibility of the original corpora can be obtained according to each intention category and the intention score corresponding to the intention category, so that the original corpora with the intention credibility exceeding the preset credibility threshold can be marked as the effective corpora, and a corpus containing massive effective corpora is generated. On the basis, the corpus in the corpus is aggregated and counted to obtain a corresponding corpus white list for subsequent corpus identification. In the embodiment, each original corpus is identified into a sub-category of different NLU systems by performing fine-grained credibility processing on the field or intention understood by the external NLU service, and then each sub-category is further processed according to a unified standard, so that the intention category most matched with the original corpus is identified, the service recall rate can be effectively improved, certain identification accuracy can be ensured under the condition of improving the recall rate, and the efficiency and accuracy of corpus identification are improved. Simultaneously, in this process, this embodiment can also realize the automatic labeling to magnanimity corpus, has solved and has generated magnanimity corpus and need rely on artifical arrangement and the problem of labelling, has improved the efficiency that the corpus generated, helps obtaining abundanter corpus, and the corpus that generates can continue to influence subsequent corpus discernment again for the corpus that can supply to compare and match is more, has further promoted the service recall rate.

For ease of understanding, the corpus identification method of the present application is described below with reference to a specific example.

Fig. 3 is a schematic diagram of a corpus tagging process in this embodiment. According to the labeling process shown in fig. 3, initially, a first NLU preliminary screening may be performed on the collected text to obtain an original corpus such as encyclopedic, chatting and the like that needs to be further identified and processed by an external NLU and a preliminary screening intention classification of the original corpus. It should be noted that the first NLU preliminary screening may be to roughly process the text to be processed, and the intention classification obtained by the preliminary screening may be a classification category with a larger range.

And for the original linguistic data obtained by primary screening, an NLU processing interface of n CPs can be called, the original linguistic data is used as input information, recognition is carried out through each NLU, and corresponding intention categories and intention scores are output. In this embodiment, the number of called CPs should be no less than 3.

The intention category and the intention score output by each NLU can calculate corresponding intention credibility according to a set formula, and the original linguistic data can be marked as valid linguistic data or invalid linguistic data by comparing the intention credibility with a credibility threshold value. In this embodiment, the reliability threshold may be half of the number of called CPs. Namely, the reliability threshold is n × 50%.

For the original corpus marked as the effective corpus, the primary screening intention classification of the corpus and the intention category output by each NLU can be recorded at the same time, and the marking of the original corpus is finished.

On the basis of fig. 3, referring to fig. 4, a schematic diagram of a corpus whitelist generation process of this embodiment is shown. For a large amount of original corpora, the labeling process shown in fig. 3 may be repeatedly performed to obtain the primary screening intent classification of the large amount of corpora and the intent classification output by each NLU. Then, aggregation statistics can be performed on the mass corpora according to the primary screening intention classification and the intention category output by each NLU, and the top 80% of recognition classes obtained through statistics are taken as a corpus white list. Meanwhile, the white list can be corrected to a certain extent in a manual mode.

The above embodiment describes the original corpus labeling, corpus generation and white list construction in detail, and the following continues to describe the process of identifying corpus based on the white list constructed in the foregoing embodiment. I.e. the application process of the white list.

Referring to fig. 5, a flowchart illustrating schematic steps of a corpus identification method according to another embodiment of the present application is shown, where the method specifically includes the following steps:

S501, when a target corpus to be identified is received, identifying the target corpus by adopting a plurality of NLU engines to respectively obtain intention categories corresponding to the NLU engines;

it should be noted that the corpus identification method provided in this embodiment may be applied to a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and other terminal devices, and the specific type of the terminal device is not limited in this application embodiment.

Take the terminal device as a mobile phone as an example. Fig. 6 is a block diagram illustrating a partial structure of a mobile phone according to an embodiment of the present disclosure. Referring to fig. 6, the handset includes: radio Frequency (RF) circuit 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuit 660, wireless fidelity (Wi-Fi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the handset configuration shown in fig. 6 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 6:

the RF circuit 610 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 680; in addition, the data for designing uplink is transmitted to the base station. Typically, the RF circuitry includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE)), e-mail, Short Messaging Service (SMS), and the like.

The memory 620 may be used to store software programs and modules, and the processor 680 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone 600. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations of a user (e.g., operations of the user on the touch panel 631 or near the touch panel 631 by using any suitable object or accessory such as a finger or a stylus) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 680, and can receive and execute commands sent by the processor 680. In addition, the touch panel 631 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 640 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The Display unit 640 may include a Display panel 641, and optionally, the Display panel 641 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 631 can cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 6, the touch panel 631 and the display panel 641 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the mobile phone.

The handset 600 may also include at least one sensor 650, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 641 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuit 660, speaker 661, and microphone 662 can provide an audio interface between a user and a cell phone. The audio circuit 660 may transmit the electrical signal converted from the received audio data to the speaker 661, and convert the electrical signal into an audio signal through the speaker 661 for output; on the other hand, the microphone 662 converts the collected sound signals into electrical signals, which are received by the audio circuit 660 and converted into audio data, which are processed by the audio data output processor 680 and then transmitted via the RF circuit 610 to, for example, another cellular phone, or output to the memory 620 for further processing.

Wi-Fi belongs to short-distance wireless transmission technology, and a mobile phone can help a user to receive and send emails, browse webpages, access streaming media and the like through a Wi-Fi module 670, and provides wireless broadband internet access for the user. Although fig. 6 shows a Wi-Fi module 670, it is understood that it does not belong to the essential constitution of the handset 600 and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 680 is a control center of the mobile phone, and connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory 620, thereby performing overall monitoring of the mobile phone. Optionally, processor 680 may include one or more processing units; preferably, the processor 680 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 680.

The handset 600 also includes a power supply 690 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 680 via a power management system, such that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the handset 600 may also include a camera. Optionally, the position of the camera on the mobile phone 600 may be front-located or rear-located, which is not limited in this embodiment of the application.

Optionally, the mobile phone 600 may include a single camera, a dual camera, or a triple camera, which is not limited in this embodiment of the present application.

For example, the cell phone 600 may include three cameras, one being a main camera, one being a wide camera, and one being a tele camera.

Optionally, when the mobile phone 600 includes a plurality of cameras, all of the plurality of cameras may be arranged in front of the mobile phone, or all of the plurality of cameras may be arranged in back of the mobile phone, or a part of the plurality of cameras may be arranged in front of the mobile phone, and another part of the plurality of cameras may be arranged in back of the mobile phone, which is not limited in this embodiment of the application.

In addition, although not shown, the mobile phone 600 may further include a bluetooth module, etc., which will not be described herein.

Fig. 7 is a schematic diagram of a software structure of a mobile phone 600 according to an embodiment of the present application. Taking the mobile phone 600 operating system as an Android system as an example, in some embodiments, the Android system is divided into four layers, which are an application layer, an application Framework (FWK) layer, a system layer and a hardware abstraction layer, and the layers communicate with each other through a software interface.

As shown in fig. 7, the application layer may include a series of application packages, which may include short message, calendar, camera, video, navigation, gallery, call, and other applications.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer may include some predefined functions, such as functions for receiving events sent by the application framework layer.

As shown in fig. 7, the application framework layer may include a window manager, a resource manager, and a notification manager, among others.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The application framework layer may further include:

a viewing system that includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide the communication functions of the handset 600. Such as management of call status (including on, off, etc.).

The system layer may include a plurality of functional modules. For example: a sensor service module, a physical state identification module, a three-dimensional graphics processing library (such as OpenGL ES), and the like.

The sensor service module is used for monitoring sensor data uploaded by various sensors in a hardware layer and determining the physical state of the mobile phone 600;

the physical state recognition module is used for analyzing and recognizing user gestures, human faces and the like;

the three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The system layer may further include:

the surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The hardware abstraction layer is a layer between hardware and software. The hardware abstraction layer may include a display driver, a camera driver, a sensor driver, etc. for driving the relevant hardware of the hardware layer, such as a display screen, a camera, a sensor, etc.

The following embodiments may be implemented on the handset 600 having the above-described hardware/software architecture. The following embodiment will take the mobile phone 600 as an example to explain the corpus identification method provided in this embodiment.

In this embodiment, the target corpus to be recognized may refer to words or sentences spoken by the user using the voice service on the mobile phone. For example, a user, when invoking the intelligent voice assistant on a cell phone, may speak a sentence into the voice assistant instructing the voice assistant to perform a task or output some information.

For example, the user may speak "XXX is who" to the voice assistant, which may convert the utterance into text, and the resulting text information is the target corpus to be recognized.

Of course, the user may also input the target corpus into the mobile phone by directly inputting a text, which is not limited in this embodiment.

After receiving the target corpus, the mobile phone can call NLU services provided by the plurality of CPs to respectively identify the target corpus and output corresponding intention categories.

S502, extracting a white list matched with the intention type from a preset corpus;

in this embodiment, the corpus may be obtained by labeling a large amount of original corpora. Besides the corpus itself, information such as intent classification obtained when the corpus is identified by a plurality of NLU systems can be stored in the corpus.

After a plurality of NLUs are adopted to identify the target corpus and corresponding intention categories are obtained, a corpus white list matched with the intention categories can be extracted from the corpus.

Fig. 8 is a flowchart illustrating steps of generating a corpus white list according to this embodiment. The corpus whitelist may be generated by:

s801, dividing a plurality of effective corpora into a plurality of identification classes according to initial classes and intention classes of the plurality of effective corpora stored in the corpus;

in this embodiment, the initial category of the valid corpus stored in the corpus may be obtained when the original corpus is initially screened, and the intention category thereof may be obtained by respectively identifying a plurality of NLU systems.

It should be noted that, when the corpora marked as valid corpora are stored in the corpus, and a plurality of NLU systems are used for recognition, at least one NLU system should be able to understand the corpora and output the corresponding intent category. Thus, the valid corpus stored in the corpus should generally include an initial class of the corpus and at least one intent class of the NLU system output.

In order to construct a corpus white list, the corresponding effective corpora with the same initial category and intention category can be firstly divided into the same recognition class.

After the category character string of each valid corpus is identified in the above manner, all corpora having the same category character string may be aggregated into the same identification class.

S802, counting the number of effective linguistic data contained in each identification class;

after the mass effective corpora are aggregated according to the steps and divided into different identification classes, the number of the corpora included in each identification class can be respectively counted. For example, an identification class may contain 10000 corpus, an identification class may contain 500 corpus, and so on.

And S803, generating a white list of the corpus according to the number of the effective corpora contained in each identification class.

As an example of this embodiment, after the number of the effective corpuses included in each recognition class is obtained through statistics, each recognition class may be sorted according to the number of the effective corpuses included in each recognition class. For example, all recognition classes may be sorted by the number of corpora contained in order of at least the number of corpora. Then, the recognition classes in the preset sorting interval are extracted and used as a white list of the corpus. For example, the top 80% ranked recognition classes may be selected as a white list for the corpus.

It should be noted that, since the processes of generating the corpus white list in steps S801 to S803 in this embodiment are similar to those in embodiments S207 to S208, they can refer to each other, and this embodiment is introduced relatively simply, and the relevant details can refer to the description of the foregoing embodiments.

In this embodiment, when a white list matching with the intention category of the target corpus is extracted from the corpus, the initial category and the partial intention category may be matched, so as to find out a white list with the same initial category and the same partial intention category.

In a specific implementation, an initial class of the target corpus may be first obtained, and then a white list including the initial class and at least one intention class identified by the NLU engine may be extracted from the corpus.

For example, for the target corpus "XXX is", if the intention categories identified by three NLUs are sub-intention 1.2, sub-intention 2.1 and sub-intention 3.1, respectively, when extracting the white list, the initial category of the corpus may be determined first, and then the white list including the initial category and including some of the above sub-intention 1.2, sub-intention 2.1 and sub-intention 3.1 may be found from the corpus.

On the basis of fig. 3, referring to fig. 9, a schematic diagram of a corpus whitelist application process of this embodiment is shown. And for the target corpus to be identified, matching in the corpus can be performed according to the initial screening intention classification of the target corpus and the intention category output by each NLU, and if the initial screening intention classification and the intention category output by part of NLUs are matched by a certain white list, returning the most appropriate identification result according to the intention category in the matched white list and the set sequencing rule.

If the intention category of the partial NLU output is matched, if a white list is the primary screening intention category (initial category) ═ C, CP1 intention category ═ C1, and CP2 intention category ═ C2, then the target corpus of the recognition result of the primary screening intention category ═ C, CP1 intention category ═ C1, CP2 intention category ═ C2, and CP3 intention category ═ C3 may be considered to be in accordance with the white list.

In this embodiment, the set ranking rule may be that the intention scores corresponding to the intention categories output by different NLUs are preferentially selected to be returned to the user; alternatively, according to a pre-designed rule, for different intention categories, a certain CP may be routed preferentially; or, without distinguishing the specific intention category, directly comparing the ranking priorities among the CPs, and selecting the intention category corresponding to the CP with the higher priority to return to the user, which is not limited in this embodiment.

S503, identifying the target corpus according to the intention categories contained in the white list.

For the extracted white list, an intention score corresponding to each intention category included in the white list may be first determined. It should be noted that the intention score corresponding to each intention category can be obtained by testing the sample corpus with each NLU engine. For the obtaining of the intention score, reference may be made to the description of step S203 in the foregoing embodiment, which is not described in detail in this embodiment.

Then, the intention category corresponding to the maximum intention score in the white list can be identified as the target intention category of the target corpus to be identified currently.

For example, in the above example, the intention category recognition results C1 and C2 of CP1 and CP2 included in the white list obtained by matching the target corpus may compare the intention scores of C1 and C2, select the intention category corresponding to the larger score as the final recognition result, and return the final recognition result to the user.

In the embodiment of the application, a plurality of NLU engines are used for identifying received target corpora, after intention categories corresponding to each NLU engine are obtained respectively, a white list matched with the intention categories can be extracted from a preset corpus, and then the target corpora can be identified according to the intention categories contained in the white list. In this embodiment, after the original corpus is labeled, the corpus and the corresponding white list are generated, the corpus can be identified according to the labeled category information and the white list, which is helpful for improving the service recall rate of the terminal user.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

For convenience of understanding, the corpus identification method of the present embodiment is fully described below with reference to specific examples, which may specifically include the following steps:

1. and acquiring the original corpus. The original corpus in this embodiment may be obtained by capturing various text information input by a user in the network, or may be obtained by performing text conversion on the voice of the user. The amount of raw corpora is very large, perhaps in the order of millions.

2. Selecting n (n is more than or equal to 3) CPs providing NLU service as convergence objects, and identifying each original corpus by adopting the NLU system of each CP to obtain a corresponding intention category. Fig. 10 is a schematic diagram illustrating a relationship between the original corpus and the intention category in the present embodiment. As shown in fig. 10, each original corpus needs to be identified by NLU systems of three CP, CP1, CP2 and CP3, and each NLU system outputs a corresponding identification result.

And 2.1, if all the CP classifications are consistent, obtaining the classification result of any CP, and marking the original corpus according to the result.

2.2, if all the CPs can not be identified or classified, judging the original corpus to be invalid corpus, and removing the original corpus from the original corpus.

And 2.3, if the intention credibility corresponding to the intention category identified by each CP is high (the credibility threshold value can be set to be more than half of the number of CPs), automatically marking the current corpus, and recording the primary screening intention classification of the corpus and the intention category output by each CP. The intention reliability can be calculated by the following formula.

And 2.4, if the output intention credibility of each CP is lower than the credibility threshold, clustering corpora lower than the threshold in batches, then identifying for multiple times, if the identified results of each CP are fixed, marking the corpora as effective corpora, and mapping the identified results of each CP to the same overall classification (many-to-one).

As shown in table two, this is an example of a recognition result obtained by recognizing the original corpus. In the example of table two, there are cases where the unrecognized one is inconsistent with the recognition of multiple CPs

Table two:

3. on the basis of the steps, a corpus can be formed by obtaining a large amount of effective corpuses, and a corpus white list can be output by performing aggregation statistics on the corpuses in the corpus. The mass effective corpora in the corpus library and the white list can be used for the subsequent identification aiming at the target corpora.

4. The user inputs a target corpus 'XXX is who', the primary screening intention of the corpus can be classified into 'general encyclopedia' through primary screening, after the recognition is carried out by adopting an NLU system of a plurality of CPs, the CP1 returns the child intention 1.2, the CP2 returns the child intention 2.1, and the CP3 returns the child intention 3.1. By matching the white list as shown in table three, the child intention 3.1 returned by CP3 is not in the white list.

A third table:

5. referring to the intention scores shown in table one, the intention score of child intention 1.2 was 1.5, and the intention score of child intention 2.1 was 0.5. Therefore, the target corpus "who XXX is" can be matched to sub-intention 1.2 and sub-intention 2.1 in the white list, but since sub-intention 1.2 is scored higher, the terminal will return the recognition result of CP1, i.e. sub-intention 1.2, to the user.

After fine-grained processing and credibility scoring are carried out on the plurality of NLU services according to the fields or intentions, the plurality of NLU services are integrated and converged to access the terminal equipment, the service recall rate of a terminal user can be effectively improved, and automatic identification and marking of massive original corpora are realized in the process. Through experiments, according to the corpus identification method provided by the embodiment, the service recall rate of the terminal can be increased from 59.5% to 81.3%, and the accuracy rate is not obviously reduced.

Fig. 11 shows a block diagram of a corpus identifying device according to an embodiment of the present application, which corresponds to the corpus identifying method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 11, the apparatus may be applied to a terminal device, and specifically may include the following modules:

an original corpus acquiring module 1101, configured to acquire an original corpus to be identified;

an intention category identifying module 1102, configured to identify the original corpus by using multiple natural language understanding NLU engines, and obtain an intention category corresponding to each NLU engine respectively;

an intention reliability determining module 1103, configured to determine an intention reliability of the original corpus according to the intention category of each NLU engine;

and an original corpus identifying module 1104, configured to identify the original corpus according to the intention reliability.

In this embodiment of the application, the intention category identifying module 1102 may specifically include the following sub-modules:

the processing interface calling submodule is used for calling the processing interfaces of the NLU engines;

the original corpus input submodule is used for respectively inputting the original corpus into a processing interface of each NLU engine so as to indicate each NLU engine to identify the original corpus;

And the intention category receiving submodule is used for receiving the intention category output by each NLU engine.

In this embodiment of the present application, the intention reliability determining module 1103 may specifically include the following sub-modules:

an intention score determining submodule, configured to determine an intention score corresponding to an intention category of each NLU engine, where the intention score corresponding to each intention category is obtained by testing a sample corpus with each NLU engine;

and the intention credibility operator module is used for calculating the intention credibility of the original corpus according to each intention category and the corresponding intention score.

In this embodiment of the present application, the intention credibility operator module may specifically include the following units:

a weight value determination unit for determining a weight value of each intention category;

and the intention credibility calculating unit is used for weighting and summing the intention scores corresponding to each intention category by adopting the weight values to obtain the intention credibility of the original corpus.

In this embodiment of the present application, the original corpus identifying module 1104 may specifically include the following sub-modules:

the effective corpus identification submodule is used for identifying the original corpus as an effective corpus if the intention credibility is greater than or equal to a preset credibility threshold;

And the invalid corpus identification submodule is used for identifying the original corpus as an invalid corpus if the intention credibility is smaller than the credibility threshold.

In this embodiment, the original corpus identifying module 1104 may further include the following sub-modules:

the intention category judgment submodule is used for judging whether a plurality of intention categories corresponding to the invalid corpus are all empty or not;

an invalid corpus deleting submodule, configured to delete the invalid corpus if a plurality of intention categories corresponding to the invalid corpus are all empty;

and the invalid corpus re-identification submodule is used for dividing the invalid corpus into a plurality of corpus classes according to the intention classes if at least one of the intention classes corresponding to the invalid corpus is not empty, adopting the NLU engines to identify the invalid corpus in each corpus class again, and identifying the invalid corpus in the corpus classes as the valid corpus if the intention classes identified by each NLU engine remain unchanged.

In this embodiment, the apparatus may further include the following modules:

an initial category obtaining module, configured to obtain an initial category of the effective corpus;

and the effective corpus association storage module is used for storing the effective corpus, the initial categories of the effective corpus and the intention categories identified by each NLU engine into a corpus in an association manner.

In an embodiment of the present application, the apparatus may further include the following modules:

the recognition class dividing module is used for dividing the plurality of effective linguistic data into a plurality of recognition classes according to the stored initial classes and intention classes of the plurality of effective linguistic data;

the corpus quantity counting module is used for counting the quantity of the effective corpuses contained in each identification class;

and the white list generation module is used for generating a white list of the corpus according to the number of the effective corpora contained in each identification class.

In this embodiment of the present application, the identification class dividing module may specifically include the following sub-modules:

and the identification class division submodule is used for dividing the corresponding effective linguistic data with the same initial class and intention class into the same identification class.

In this embodiment of the present application, the white list generating module may specifically include the following sub-modules:

the recognition class sorting submodule is used for sorting each recognition class according to the number of the effective linguistic data contained in each recognition class;

and the white list generation submodule is used for extracting the identification class in the preset sorting interval and using the identification class as the white list of the corpus.

Referring to fig. 12, a block diagram of a corpus identifying device according to another embodiment of the present application is shown, where the corpus identifying device may be applied to a terminal device, and specifically includes the following modules:

An intention category identifying module 1201, configured to identify a target corpus by using a plurality of NLU engines when the target corpus to be identified is received, and obtain an intention category corresponding to each NLU engine respectively;

a white list extraction module 1202, configured to extract a white list matched with the intention category from a preset corpus;

and a target corpus identifying module 1203, configured to identify the corpus according to the intention category included in the white list.

In an embodiment of the present application, the white list may be generated by:

the recognition class dividing module is used for dividing the effective linguistic data into a plurality of recognition classes according to the initial classes and the intention classes of the effective linguistic data stored in the corpus;

In this embodiment of the present application, the white list extraction module 1202 may specifically include the following sub-modules:

an initial category obtaining sub-module, configured to obtain an initial category of the target corpus;

and the white list extraction submodule is used for extracting a white list containing the initial category and at least one intention category identified by the NLU engine from the corpus.

In this embodiment, the intention categories included in the white list include a plurality of categories, and the target corpus identifying module 1203 may specifically include the following sub-modules:

an intention score determining submodule, configured to determine an intention score corresponding to each intention category included in the white list, where the intention score corresponding to each intention category is obtained by testing the sample corpus using each NLU engine;

And the target intention category identification submodule is used for identifying the intention category corresponding to the maximum intention score as the target intention category of the target corpus.

For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.

Referring to fig. 13, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 13, the terminal device 1300 of the present embodiment includes: a processor 1310, a memory 1320, and a computer program 1321 stored in the memory 1320 and operable on the processor 1310. The processor 1310, when executing the computer program 1321, implements the steps of the aforementioned corpus identification method in various embodiments, such as the steps S101 to S104 shown in fig. 1. Alternatively, the processor 1310, when executing the computer program 1321, implements the functions of the modules/units in the device embodiments, such as the modules 1101 to 1104 shown in fig. 11.

Illustratively, the computer program 1321 may be partitioned into one or more modules/units that are stored in the memory 1320 and executed by the processor 1310 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which may be used to describe the execution of the computer program 1321 in the terminal device 1300. For example, the computer program 1321 may be divided into an original corpus acquiring module, an intention category identifying module, an intention reliability determining module, and an original corpus identifying module, and the specific functions of each module are as follows:

The original corpus acquiring module is used for acquiring original corpora to be identified;

an intention reliability determining module, configured to determine an intention reliability of the original corpus according to the intention category of each NLU engine;

Alternatively, the computer program 1321 may be further divided into an intention category identification module, a white list extraction module, and a target corpus identification module, and the specific functions of each module are as follows:

The terminal device 1300 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and the like. The terminal device 1300 may include, but is not limited to, a processor 1310, a memory 1320. Those skilled in the art will appreciate that fig. 13 is only one example of a terminal device 1300 and does not constitute a limitation of the terminal device 1300, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 1300 may also include input and output devices, network access devices, buses, etc.

The Processor 1310 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 1320 may be an internal storage unit of the terminal device 1300, such as a hard disk or a memory of the terminal device 1300. The memory 1320 may also be an external storage device of the terminal device 1300, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 1300. Further, the memory 1320 may also include both an internal memory unit and an external memory device of the terminal device 1300. The memory 1320 is used for storing the computer program 1321 and other programs and data required by the terminal device 1300. The memory 1320 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed corpus identification method, apparatus, terminal device and medium may be implemented in other ways. For example, the division of the modules or units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to corpus identification devices, terminal devices and media, recording media, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A corpus identification method is characterized by comprising the following steps:

acquiring an original corpus to be identified;

adopting a plurality of Natural Language Understanding (NLU) engines to identify the original corpus, and respectively obtaining intention categories corresponding to each NLU engine, wherein each intention category of each NLU engine has a corresponding intention score, and the intention scores are positively correlated with the accuracy of identifying the sample corpus of the corresponding intention category by each NLU engine;

determining the intention credibility of the original corpus according to the intention category of each NLU engine, wherein the intention credibility is used for representing that the original corpus is an effective corpus or an invalid corpus;

Identifying the original corpus according to the intention credibility;

judging whether a plurality of intention categories corresponding to the invalid corpus are all empty or not for the identified invalid corpus; if at least one of the intention categories corresponding to the invalid corpus is not empty, dividing the invalid corpus into a plurality of corpus classes according to the intention categories, identifying the invalid corpus in each corpus class by adopting a plurality of NLU engines again, and identifying the invalid corpus in the corpus classes as the valid corpus if the intention categories identified by each NLU engine are unchanged.

2. The method of claim 1, wherein the identifying the raw corpus using a plurality of Natural Language Understanding (NLU) engines to obtain an intention category corresponding to each NLU engine respectively comprises:

calling processing interfaces of a plurality of NLU engines;

and receiving the intention category output by each NLU engine.

3. The method according to claim 1, wherein said determining an intention confidence level of the raw corpus according to the intention category of each NLU engine comprises:

and calculating the intention credibility of the original corpus according to each intention category and the corresponding intention score thereof.

4. The method according to claim 3, wherein said calculating an intention confidence level of the original corpus according to each intention category and its corresponding intention score comprises:

determining a weight value for each of the intent categories;

5. The method according to claim 1, wherein said identifying said original corpus according to said intent beliefs comprises:

6. The method of claim 5, further comprising:

And if the plurality of intention categories corresponding to the invalid corpus are all null, deleting the invalid corpus.

7. The method according to claim 5 or 6, further comprising, after identifying the original corpus as valid corpus:

acquiring an initial category of the effective corpus;

8. The method of claim 7, further comprising:

9. The method according to claim 8, wherein said dividing the plurality of valid corpora into a plurality of recognition classes according to the stored initial classes and the intention classes of the plurality of valid corpora comprises:

10. The method according to claim 8, wherein the generating a white list of the corpus according to the number of the valid corpora included in each recognition class comprises:

11. A corpus identification method is characterized by comprising the following steps:

when a target corpus to be identified is received, identifying the target corpus by adopting a plurality of NLU engines to respectively obtain intention categories corresponding to each NLU engine, wherein each intention category of each NLU engine has a corresponding intention score, and the intention scores are positively correlated with the accuracy of identifying the sample corpus of the corresponding intention category by each NLU engine;

extracting a white list matched with the intention category from a preset corpus, wherein effective corpora stored in the white list are marked with an initial category and the intention category identified by each NLU engine;

identifying the target corpus according to the intention category contained in the white list, wherein the intention category of the target corpus is the intention category corresponding to the maximum intention score contained in the white list;

The extracting of the whitelist matched with the intention category from the preset corpus includes:

acquiring an initial category of the target corpus;

12. The method of claim 11, wherein the white list is generated by:

13. The method according to claim 12, wherein said dividing the plurality of valid corpora into a plurality of recognition classes according to the initial classes and the intention classes of the plurality of valid corpora stored in the corpus comprises:

14. The method according to claim 12, wherein the generating a white list of the corpus according to the number of the valid corpora included in each recognition class comprises:

15. The method according to claim 14, wherein the intention categories included in the white list include a plurality of categories, and the identifying the corpus according to the intention categories included in the white list includes:

16. A corpus recognition apparatus, comprising:

the intention category identification module is used for identifying the original linguistic data by adopting a plurality of Natural Language Understanding (NLU) engines to respectively obtain intention categories corresponding to the NLU engines, each intention category of each NLU engine has a corresponding intention score, and the intention scores are positively correlated with the accuracy of identifying the sample linguistic data of the corresponding intention category by each NLU engine;

An intention credibility determining module, configured to determine intention credibility of the original corpus according to the intention category of each NLU engine, where the intention credibility is used to characterize the original corpus as an effective corpus or an invalid corpus;

the original corpus identification module is used for identifying the original corpus according to the intention reliability and judging whether a plurality of intention categories corresponding to the invalid corpus are empty or not for the identified invalid corpus; if at least one of the intention categories corresponding to the invalid corpus is not empty, dividing the invalid corpus into a plurality of corpus classes according to the intention categories, identifying the invalid corpus in each corpus class by adopting a plurality of NLU engines again, and identifying the invalid corpus in the corpus classes as the valid corpus if the intention categories identified by each NLU engine are unchanged.

17. A corpus recognition apparatus, comprising:

the system comprises an intention category identification module, a query processing module and a query processing module, wherein the intention category identification module is used for identifying a target corpus by adopting a plurality of NLU engines when the target corpus to be identified is received, and respectively obtaining intention categories corresponding to each NLU engine, each intention category of each NLU engine has a corresponding intention score, and the intention scores are positively correlated with the accuracy of identifying sample corpora corresponding to the intention categories by each NLU engine;

A white list extraction module, configured to extract a white list matching the intention category from a preset corpus, where effective corpora stored in the white list are labeled with an initial category and the intention category identified by each NLU engine;

a target corpus identification module, configured to identify the corpus according to an intention category included in the white list, where the intention category of the target corpus is an intention category corresponding to a maximum intention score value included in the white list;

wherein, the white list extraction module comprises:

18. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the corpus identification method according to any one of claims 1 to 15 when executing the computer program.

19. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the corpus recognition method according to any one of claims 1 to 15.