CN108597503B

CN108597503B - Test corpus generation method, device and equipment and readable and writable storage medium

Info

Publication number: CN108597503B
Application number: CN201810437036.4A
Authority: CN
Inventors: 杨博昌; 黄燕; 施展
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2021-04-30
Anticipated expiration: 2038-05-09
Also published as: CN108597503A

Abstract

The application discloses a test corpus generating method, a device, equipment and a readable and writable storage medium, wherein the method comprises the following steps: acquiring historical user interaction corpora in a human-computer interaction scene; performing semantic analysis on each historical user interaction corpus, and determining the composition mode of each historical user interaction corpus; determining the occurrence probability of each composition mode according to the composition mode of each historical user interaction corpus; and generating test corpora one by one according to the occurrence probability of each composition mode. According to the method and the device, the occurrence probability of each corpus composition mode is determined according to historical user interaction corpuses, and the test corpuses are generated one by one based on the occurrence probability, so that the interaction process of a user and a machine can be simulated really, enough test corpuses are generated, and the accuracy and the reliability of the test result of a man-machine interaction system are guaranteed.

Description

Test corpus generation method, device and equipment and readable and writable storage medium

Technical Field

The present application relates to the field of natural language understanding technologies, and in particular, to a test corpus generating method, device, apparatus, and readable/writable storage medium.

Background

With the continuous improvement of the related technology of artificial intelligence, the way of understanding and interacting with natural language is more and more complex. In order to bring more convenience to users, a human-computer interaction system is introduced under various service scenes. Such as vehicle-mounted service scenes, music service scenes, and the like.

Taking a vehicle-mounted service scene as an example, the man-machine interaction process is as follows:

the user: navigation to science news fly

A machine: where do you intend to go?

The user: from Sanlian

According to the interaction process, the vehicle-mounted map terminal can automatically provide the navigation service from science university to Sanfom for the user, the user does not need to operate the terminal to select the starting position, the target position and the like, and the use of the user is greatly facilitated.

Before the man-machine interaction system is in online service, a test is needed to check whether the semantic understanding of the man-machine system to the user input corpus is accurate. However, the number of the actual corpus collected by the user is limited, and the corpus coverage is insufficient, so that the test result of the human-computer interaction system is distorted, and the test result is unreliable.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, a device and a readable and writable storage medium for generating a test corpus, which are used to solve the problem that the test result of a human-computer interaction system is distorted and unreliable due to insufficient test corpus.

In order to achieve the above object, the following solutions are proposed:

a test corpus generation method comprises the following steps:

acquiring historical user interaction corpora in a human-computer interaction scene;

performing semantic analysis on each historical user interaction corpus, and determining the composition mode of each historical user interaction corpus;

determining the occurrence probability of each composition mode according to the composition mode of each historical user interaction corpus;

and generating test corpora one by one according to the occurrence probability of each composition mode.

Preferably, the performing semantic analysis on each historical user interaction corpus to determine a composition manner of each historical user interaction corpus includes:

performing semantic analysis on each historical user interaction corpus, and determining the service to which each historical user interaction corpus belongs;

the determining the occurrence probability of each composition mode according to the composition modes of the historical user interaction corpora comprises the following steps:

and determining the occurrence probability of each service according to the service to which each historical user interaction corpus belongs.

Preferably, the determining the occurrence probability of each service according to the service to which each historical user interaction corpus belongs includes:

and calculating the ratio of the number of the historical user interaction corpora belonging to each service to the total number of the historical user interaction corpora as the occurrence probability of the service.

Preferably, the performing semantic analysis on each historical user interaction corpus to determine a composition manner of each historical user interaction corpus further includes:

performing semantic analysis on each historical user interaction corpus, and determining operation corresponding to each historical user interaction corpus;

the determining the occurrence probability of each composition mode according to the composition modes of the historical user interaction corpora further comprises:

and determining the occurrence probability of each operation under the same service according to the service to which each historical user interaction corpus belongs and the corresponding operation.

Preferably, the determining, according to the service to which each historical user interaction corpus belongs and the corresponding operation, the occurrence probability of each operation under the same service includes:

in the historical user interaction corpora of the same service, aiming at each operation, calculating the number of the historical user interaction corpora corresponding to the operation, and taking the ratio of the number of the historical user interaction corpora of the same service to the total number of the historical user interaction corpora of the same service as the occurrence probability of the operation under the same service.

performing semantic analysis on each historical user interaction corpus, and determining a semantic slot and a semantic slot value contained in each historical user interaction corpus;

and determining the occurrence probability of each semantic slot under the same service according to the service to which each historical user interaction corpus belongs and the semantic slots.

Preferably, the determining, according to the service to which each historical user interaction corpus belongs and the included semantic slots, the occurrence probability of each semantic slot under the same service includes:

in the historical user interaction corpora of the same service, aiming at each semantic slot, calculating the ratio of the number of the historical user interaction corpora containing the semantic slot to the total number of the historical user interaction corpora of the same service, and taking the ratio as the occurrence probability of the semantic slot under the same service.

Preferably, the method further comprises the following steps:

and performing word expansion on the semantic slot value of each semantic slot to obtain an expanded semantic slot value.

performing semantic analysis on each historical user interaction corpus, and determining a word of a specified type contained in each historical user interaction corpus;

and determining the occurrence probability of each appointed type of words under the same service according to the service to which each historical user interaction corpus belongs and the included appointed type of words.

Preferably, the determining, according to the service to which each historical user interaction corpus belongs and the included words of the designated type, the occurrence probability of each word of the designated type under the same service includes:

in the historical user interaction corpora of the same service, aiming at each appointed type of word, calculating the number of the historical user interaction corpora containing the appointed type of word, and taking the ratio of the number of the historical user interaction corpora containing the appointed type of word to the total number of the historical user interaction corpora of the same service as the occurrence probability of the appointed type of word under the same service.

Preferably, the generating the test corpus one by referring to the occurrence probability of each composition manner includes:

determining a target service to which each test corpus belongs in a one-pass test corpus to be generated currently by referring to the occurrence probability of each service;

randomly selecting a user intention under the target service in a preset user intention library, wherein the user intention records the target service to which the corresponding historical user interaction corpus belongs, the corresponding target operation and a target semantic slot contained in the target service;

determining a first inclusion condition of each test corpus to be generated for the target operation and the target semantic slot by referring to the occurrence probability of the target operation under the target service and the occurrence probability of the target semantic slot under the target service;

generating test corpora one by one at least according to the first inclusion condition until each generated test corpus includes the target operation and the target semantic groove;

and combining the generated test corpora into a universal test corpus.

Preferably, the generating the test corpus one by referring to the occurrence probability of each composition manner further includes:

determining a second inclusion condition of each test corpus to be generated for each specified type of word by referring to the occurrence probability of each specified type of word under the target service;

generating test corpora one by one according to at least the first inclusion condition until each generated test corpus includes the target operation and the target semantic slot, including:

and generating test corpora one by one according to the first inclusion condition and the second inclusion condition until each generated test corpus includes the target operation and the target semantic slot.

Preferably, the generating test corpora one by one according to at least the first inclusion condition includes:

if the first inclusion condition indicates that the target semantic slot is included, randomly selecting a semantic slot value from the semantic slot value of the target semantic slot or the expanded semantic slot value, and generating a test corpus of a bar corresponding to the inclusion condition by using the selected semantic slot value.

Preferably, the combining the generated test corpora into a universal test corpus includes:

and organizing the test corpora into a universal test corpus according to the reverse sequence of the generation sequence of the test corpora.

Preferably, the method further comprises the following steps:

and testing the human-computer interaction system by using the generated test corpus.

A test corpus generation apparatus, comprising:

the history corpus acquiring unit is used for acquiring history user interaction corpus in a human-computer interaction scene;

the semantic analysis unit is used for carrying out semantic analysis on each historical user interaction corpus and determining the composition mode of each historical user interaction corpus;

the probability determining unit is used for determining the occurrence probability of each composition mode according to the composition modes of the historical user interaction linguistic data;

and the test corpus generating unit is used for generating the test corpus one by referring to the occurrence probability of each composition mode.

A test corpus generating device comprises a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the test corpus generating method described above.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the test corpus generation method as introduced above.

According to the technical scheme, the test corpus generating method provided by the embodiment of the application obtains the historical user interaction corpus in a human-computer interaction scene; performing semantic analysis on each historical user interaction corpus, and determining the composition mode of each historical user interaction corpus; determining the occurrence probability of each composition mode according to the composition mode of each historical user interaction corpus; and generating test corpora one by one according to the occurrence probability of each composition mode. According to the method and the device, the occurrence probability of each corpus composition mode is determined according to historical user interaction corpuses, and the test corpuses are generated one by one based on the occurrence probability, so that the interaction process of a user and a machine can be simulated really, enough test corpuses are generated, and the accuracy and the reliability of the test result of a man-machine interaction system are guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for generating test corpus according to an embodiment of the present application;

FIG. 2 is a flowchart of another test corpus generation method disclosed in the present application;

FIG. 3 is a schematic structural diagram of a test corpus generating device according to an embodiment of the present application;

fig. 4 is a block diagram of a hardware structure of a test corpus generating device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a method for generating a test corpus disclosed in an embodiment of the present application is described, as shown in fig. 1, the method includes:

and S100, obtaining historical user interaction linguistic data in a human-computer interaction scene.

Specifically, for a service scenario applicable to the human-computer interaction system to be tested, in this step, a historical user interaction corpus of human-computer interaction in the corresponding service scenario may be obtained.

The obtained historical user interaction corpus can be text data or voice data, and when the historical user interaction corpus is the voice data, the historical user interaction corpus can be transcribed into the text data by adopting a voice transcription method.

Step S110, performing semantic analysis on each historical user interaction corpus, and determining the composition mode of each historical user interaction corpus.

Specifically, the composition manner of the user interaction corpus may be various, for example, for different service scenes, semantic slots contained in the user interaction corpus are different, and semantic slot values may also be different. Taking a vehicle-mounted service scene as an example, the first user interaction corpus is: navigate to science news. The second user interaction corpus is: from Sanlian. Obviously, the first user interaction corpus only contains the destination semantic slot, and the value is "science news flight". The second user interaction corpus only contains the origin semantic slot and takes the value of 'Sanriun'.

And for each historical user interaction corpus obtained in the previous step, performing semantic analysis respectively in the step to determine the composition mode of each historical user interaction corpus.

And step S120, determining the occurrence probability of each composition mode according to the composition modes of the historical user interaction linguistic data.

Specifically, there are multiple composition modes of the user interaction corpus, and in this step, the occurrence probability of each composition mode may be counted according to the determined composition mode of each historical user interaction corpus. And simulating the user interaction mode in a real scene by counting the occurrence probability of various composition modes.

And S130, generating test corpora one by referring to the occurrence probability of each composition mode.

Specifically, the foregoing steps have determined the occurrence probability of each composition manner, and may generate test corpora one by one based on the real simulation of the interaction process between the user and the machine, so as to obtain a sufficient number of test corpora.

The test corpus generating method provided by the embodiment of the application obtains historical user interaction corpus in a human-computer interaction scene; performing semantic analysis on each historical user interaction corpus, and determining the composition mode of each historical user interaction corpus; determining the occurrence probability of each composition mode according to the composition mode of each historical user interaction corpus; and generating test corpora one by one according to the occurrence probability of each composition mode. According to the method and the device, the occurrence probability of each corpus composition mode is determined according to historical user interaction corpuses, and the test corpuses are generated one by one based on the occurrence probability, so that the interaction process of a user and a machine can be simulated really, enough test corpuses are generated, and the accuracy and the reliability of the test result of a man-machine interaction system are guaranteed.

It should be noted that the user and machine interaction process can be divided into a multi-turn dialog and a single-turn dialog. One continuous interaction process of a user with a machine is called one-pass interaction, the process of each interaction of the user with the machine in the one-pass interaction is called one-pass interaction, the one-pass interaction corresponds to one user interaction corpus, and the one-pass interaction corresponds to one general user interaction corpus. It will be appreciated that a generic user interaction corpus may include multiple user interaction corpuses.

Referring to Table 1 below, a multi-turn user interaction process with a machine is illustrated:

first-pass interaction:

the user: navigation to science news fly

A machine: where do you intend to go?

The user: from Sanlian

A machine: is there a need to avoid congestion?

The user: avoiding congestion

And a second interaction:

the user: forgetting water for listening to Liudebhua

A machine: forgetting water for playing Liudebua for you

The user: chang head dawn singing

A machine: forgetting water of dawn is playing for you

And (3) interaction of a third way:

the user: searching for nearby restaurants

A machine: searching for nearby restaurants for you

TABLE 1

It can be understood that, in the embodiment, when the test corpus is generated piece by piece with reference to the occurrence probability of each composition manner, each user interaction corpus included in a general user interaction corpus may be generated piece by piece according to the general number as a unit.

In another embodiment of the present application, the above step S110 is introduced, and a process of performing semantic parsing on each historical user interaction corpus and determining a composition manner of each historical user interaction corpus is described.

In an optional case, in this embodiment, semantic parsing may be performed on each historical user interaction corpus to determine a service to which each historical user interaction corpus belongs.

The historical user interaction corpora belong to different services, and the corresponding composition modes are different.

The services to which the historical user interaction corpora obtained by the application belong may be various, and the services to which the historical user interaction corpora belong are determined for each historical user interaction corpus in the step.

Still, the user interaction corpus exemplified in table 1 is described as an example, wherein the services of each user interaction corpus are shown in table 2 below:

TABLE 2

Based on this, in step S120, the process of determining the occurrence probability of each composition manner may include:

and determining the occurrence probability P1 of each service according to the services to which the historical user interaction corpora belong.

Specifically, for each service, a ratio of the number of the historical user interaction corpuses belonging to the service to the total number of the historical user interaction corpuses may be calculated as the occurrence probability P1 of the service.

As shown in the example of table 2 above, there are 4 corpora in total belonging to the map service, 2 corpora in total belonging to the music service, and 6 corpora in total. The occurrence probability of the map service is 4/6 and the occurrence probability of the music service is 2/6.

On this basis, the step S130 may specifically include, with reference to the occurrence probability of each composition manner, a process of generating the test corpus item by item:

1) and determining the target service to which each to-be-generated test corpus belongs by referring to the occurrence probability of each service.

Specifically, a target service to which each test corpus to be generated belongs may be determined in a random number manner by combining the occurrence probability of each service.

Examples are as follows:

there are two kinds of services, a map service and a music service, wherein the probability of occurrence of the map service is 4/6, and the probability of occurrence of the music service is 2/6. Before each test corpus is generated, a number in the range of [0,1] may be randomly generated, and if the generated random number is in the range of [0,4/6), it is determined that the target service to which the test corpus belongs is a map service. On the contrary, if the generated random number is in the range of [4/6,1], the target service to which the test corpus belongs is determined to be the music service.

2) And generating test corpora corresponding to the target services one by one according to the determined target services to which each test corpus to be generated belongs.

In another optional case, in the process of determining the composition manner of each historical user interaction corpus, the present embodiment may further perform semantic analysis on each historical user interaction corpus to determine an operation corresponding to each historical user interaction corpus.

The operation corresponding to the historical user interaction corpus is different, and the corpus composition modes are also different.

And the operation corresponding to the user interaction corpus is the operation which is embodied by the user interaction corpus and is required to be executed by the user. The operations corresponding to the historical user interaction corpora obtained by the application may be various, and in the step, the corresponding operations are determined for each historical user interaction corpus.

Still, the user interaction corpus exemplified in table 1 is described as an example, wherein the operation corresponding to each user interaction corpus is shown in table 3 below:

TABLE 3

Based on this, in step S120, the process of determining the occurrence probability of each composition manner may further include:

and determining the occurrence probability P2 of each operation under the same service according to the service to which each historical user interaction corpus belongs and the corresponding operation.

Specifically, in the historical user interaction corpus of the same service, for each operation, a ratio of the number of the historical user interaction corpus corresponding to the operation to the total number of the historical user interaction corpus of the same service is calculated as an occurrence probability P2 of the operation under the same service.

As shown in the example in table 3 above, for the map service, the number of the corpuses corresponding to the navigation operation is 2, the number of the corpuses corresponding to the route planning operation is 1, the number of the corpuses corresponding to the POI search operation is 1, and the number of the corpuses corresponding to the map service is 4 in total. The probability of occurrence of the navigation operation under the map service is 2/4, the probability of occurrence of the path planning operation under the map service is 1/4, and the probability of occurrence of the POI search operation under the map service is 1/4.

For the music service, where the number of corpora corresponding to the playing operation is 2, and the number of corpora corresponding to the music service is 2 in total, the occurrence probability of the playing operation under the music service is 2/2.

On this basis, the step S130, referring to the occurrence probability of each composition manner, may include the step of generating the test corpus item by item, including:

2) And randomly selecting a user intention under the target service in a preset user intention library, wherein the user intention records the target service to which the corresponding historical user interaction corpus belongs and the corresponding target operation.

It will be appreciated that for a single round of interaction process, a one-pass interaction contains only one interaction corpus that contains the user's full intent. Therefore, the method and the device can collect single-round historical user interaction corpora in advance, determine the user intention according to the single-round historical user interaction corpora, and store the determined user intention into the user intention library.

Taking a single round of historical user interaction corpus as 'flying from science news to Sanfom', the target service corresponding to the user intention can be determined as a map service, and the target operation is a navigation operation.

In this step, a user intention under the target service is randomly selected from the user intention library, and the target operation corresponding to the test corpus to be generated is determined according to the user intention.

3) And determining the inclusion condition of each test corpus to be generated on the target operation by referring to the occurrence probability of the target operation under the target service.

Wherein the inclusion case comprises: target operation is included, target operation is not included.

Specifically, the inclusion condition of each test corpus to be generated for the target operation may be determined in a random number manner by combining the occurrence probability of the target operation under the target service.

4) And generating test corpora one by one at least according to the inclusion condition until the generated test corpora include the target operation.

Specifically, starting from a first test corpus, if it is determined that the first test corpus does not contain the target operation, the first test corpus is generated without containing the target operation, and the process is repeated to generate a next test corpus until the generated test corpus contains the target operation.

5) And combining the generated test corpora into a universal test corpus.

In another optional case, in the process of determining the composition manner of each historical user interaction corpus, the embodiment may further perform semantic analysis on each historical user interaction corpus to determine a semantic slot and a semantic slot value included in each historical user interaction corpus.

The semantic slots and/or semantic slots contained in the historical user interaction corpus are different in value, and the corpus composition modes are also different.

The semantic slots and/or semantic slot values contained in the historical user interaction corpora obtained by the method may be various, and the semantic slots and the semantic slot values contained in each historical user interaction corpus are determined in the step.

Still, the user interaction corpus exemplified in table 1 is described as an example, where semantic slots and semantic slot values included in each user interaction corpus are shown in table 4 below:

TABLE 4

and determining the occurrence probability P3 of each semantic slot under the same service according to the service to which each historical user interaction corpus belongs and the semantic slots.

Specifically, in the historical user interaction corpus of the same service, for each semantic slot, the ratio of the number of the historical user interaction corpus containing the semantic slot to the total number of the historical user interaction corpus of the same service is calculated as the occurrence probability P3 of the semantic slot under the same service.

As shown in the example in table 4 above, for a map service, the number of the linguistic data including the destination semantic slot, the departure semantic slot, the condition semantic slot, and the POI semantic slot is 1, and the total number of the linguistic data corresponding to the map service is 4. The probability of occurrence of the four semantic slots under the map service is 1/4.

Aiming at the music service, the corpus comprising the singer semantic groove is 2, the corpus comprising the song semantic groove is 1, the total number of the corpora corresponding to the music service is 2, the occurrence probability of the singer semantic groove under the music service is 2/2, and the occurrence probability of the song semantic groove under the music service is 1/2.

Further, after semantic parsing is performed on each historical user interaction corpus to obtain each semantic slot and semantic slot value included in the corpus, the embodiment of the application may further include:

Taking the destination semantic slot as an example, the corresponding semantic slot value is the science news flyer. On the basis, the method can further expand words to obtain: meya photoelectricity, city square, etc.

Further taking a conditional semantic slot as an example, the corresponding semantic slot takes a value as avoiding congestion. On the basis, the method can further comprise the following steps: high speed first, no high speed, overhead, etc.

When the words are expanded, the words of the same type as the semantic slot values of the semantic slots can be obtained from word banks such as an input method word bank, a user log and the like, and the expanded semantic slot values are formed.

2) And randomly selecting a user intention under the target service in a preset user intention library, wherein the user intention records the target service to which the corresponding historical user interaction corpus belongs, the corresponding target operation and the target semantic slot contained in the target service.

Taking a single round of historical user interaction corpus as 'flying from science news to Sanfom', the target service corresponding to the user intention can be determined as a map service, the corresponding target operation is a navigation operation, and the included target semantic slot is a departure point and a destination.

In this step, a user intention under the target service is randomly selected from the user intention library, and the target operation corresponding to the test corpus to be generated and the target semantic slot included therein are determined according to the user intention.

3) And determining the first inclusion condition of each test corpus to be generated on the target operation and the target semantic slot by referring to the occurrence probability of the target operation under the target service and the occurrence probability of the target semantic slot under the target service.

Wherein the first inclusion case comprises: whether it contains a target operation and whether it contains a target semantic slot.

Specifically, a random number manner may be adopted, and in combination with the occurrence probability of the target operation under the target service and the occurrence probability of the target semantic slot under the target service, a first inclusion condition of each test corpus to be generated for the target operation and the target semantic slot is determined.

Taking the target semantic slot as an example, two random numbers can be used to respectively represent that the target semantic slot appears and does not appear in the test corpus to be generated currently, for example, 1 represents that the target semantic slot appears, and 0 represents that the target semantic slot does not appear. And determining the generation probability of the random number 1 as the occurrence probability of the target semantic slot under the target service, wherein the sum of the generation probability of the random number 0 and the generation probability of the random number 1 is 1. If the occurrence probability of the target semantic slot under the target service is 1/3, the generation probability of the random number 1 is 1/3, and the generation probability of the random number 0 is 2/3. Further, according to the generation probability of the two random numbers, the random numbers are automatically generated, whether the test corpus to be generated currently contains the target semantic slot or not is determined according to the generated random numbers, if the generated random number is 1, the test corpus to be generated currently contains the target semantic slot is determined, and when the generated random number is 0, the test corpus to be generated currently does not contain the target semantic slot.

4) And generating test corpora one by one at least according to the first inclusion condition until each generated test corpus includes the target operation and the target semantic slot.

Specifically, starting from the first test corpus, the test corpuses are generated one by one according to the target operation indicated by the first inclusion condition and the inclusion condition of the target semantic slot until the generated test corpus includes the target operation and the target semantic slot.

It should be noted that, if the first inclusion condition indicates that the target semantic slot is included, a semantic slot value is randomly selected from the semantic slot value of the target semantic slot or the expanded semantic slot value, and the selected semantic slot value is used to generate the test corpus of the bar corresponding to the inclusion condition.

For example, if the first inclusion condition of the nth test corpus to be generated indicates that the start place semantic slot needs to be included, one of the values of the expanded semantic slot corresponding to the start place semantic slot may be randomly selected, for example, "meya photoelectricity" is selected, and the selected "meya photoelectricity" is used to generate the nth test corpus: "starting from Meiya photoelectricity".

5) And combining the generated test corpora into a universal test corpus.

In another optional case, in the process of determining the composition manner of each historical user interaction corpus, the embodiment may further perform semantic analysis on each historical user interaction corpus to determine a word of a specified type included in each historical user interaction corpus.

The user interaction corpus comprises different words of specified types, and the corpus composition modes are different.

The words of the designated type contained in each historical user interaction corpus obtained in the method may be multiple, and in the step, the words of the designated type contained in each historical user interaction corpus are determined. The words of the specified type may refer to words of resolution type, personalized words of the user, and the like.

and determining the occurrence probability P4 of each specified type of word under the same service according to the service to which each historical user interaction corpus belongs and the contained specified type of word.

Specifically, in the historical user interaction corpus of the same service, for each specified type of word, the ratio of the number of the historical user interaction corpus including the specified type of word to the total number of the historical user interaction corpus of the same service is calculated as the probability P4 of occurrence of the specified type of word in the same service.

For example, a one-way interactive corpus includes:

the user: opening a map

A machine: is opening a map for you

The user: De-Shi Li Miao

A machine: where asking to start from

The user: from the Meiya light to the

It can be seen that the above-mentioned one-pass interactive corpus includes three sentences of user interactive corpuses, wherein only the last sentence contains the word "there" indicating the resolution type, and the probability of occurrence of the word under the map service is 1/4.

It will be appreciated that for a single round of interaction, a one-pass interaction comprises only one corpus of user interactions that contains the user's full intent. Therefore, the method and the device can collect single-round historical user interaction corpora in advance, determine the user intention according to the single-round historical user interaction corpora, and store the determined user intention into the user intention library.

4) And determining the second inclusion condition of each test corpus to be generated for each specified type of word by referring to the occurrence probability of each specified type of word under the target service.

Wherein the second inclusion case includes: whether or not to include words of the specified type.

Specifically, a random number manner may be adopted, and a second inclusion condition of each test corpus to be generated for each specified type of word may be determined in combination with an occurrence probability of each specified type of word under the target service. The specific implementation process may refer to the process of determining whether the target semantic slot appears as described in the foregoing embodiment.

5) And generating test corpora one by one according to the first inclusion condition and the second inclusion condition until each generated test corpus includes the target operation and the target semantic slot.

Specifically, starting from the first test corpus, generating the test corpuses item by item according to the target operation indicated by the first inclusion condition and the inclusion condition of the target semantic slot, and the inclusion condition of the specified type of words indicated by the second inclusion condition until the generated test corpus contains the target operation and the target semantic slot.

For example, if the first inclusion condition of the nth test corpus to be generated indicates that the start place semantic slot needs to be included, one semantic slot value corresponding to the start place semantic slot may be randomly selected from the expanded semantic slot values, for example, "meya photoelectricity" is selected, and the selected "meya photoelectricity" is used to generate the nth test corpus: "starting from Meiya photoelectricity".

6) And combining the generated test corpora into a universal test corpus.

Optionally, the test corpora may be organized into a universal test corpus according to a reverse order of the generation order of the test corpora.

That is, the first generated test corpus is used as the last corpus of the one-pass test corpus, and the last generated test corpus is used as the first corpus of the one-pass test corpus.

According to the method and the device, the one-pass test corpus is generated according to the user intention, so that the user intention can be reflected by the test corpus generated in advance, and the test corpus is used as the later corpus of the one-pass test corpus and is more reasonable.

Optionally, before generating the test corpus, the method may further include a step of determining the number of the one-pass test corpus to be generated. Specifically, according to the determined target service to which the test corpus to be generated belongs, the average number of each general user history interaction corpus in the user history interaction corpus belonging to the target service may be used as the number of the one-pass test corpus to be generated. Alternatively, the number of pieces of the one-pass test corpus to be generated may be determined in a random manner. The determined number of the one-pass test corpus to be generated can be used as a reference number, and the one-pass test corpus is generated according to the above process.

The present solution is described below by way of a specific example.

1. And determining the occurrence probability of each service, the occurrence probability of each operation under the same service, the occurrence probability of each semantic slot under the same service and the occurrence probability of each appointed type of words under the same service according to the historical interactive linguistic data of the user.

2. And determining that the target service to which the one-pass test corpus to be generated belongs is a map service according to the occurrence probability P1 of each service.

3. Randomly selecting a user intention under a map service in a user intention library, wherein the target service recorded by the user intention is the map service, the target operation is navigation, and the included target semantic slot is a destination and a departure place.

4. And determining each test corpus in the one-pass test corpus to be generated according to the occurrence probability P2 of the navigation operation in the map service, the occurrence probability P31 of the destination semantic slot in the map service, the occurrence probability P32 of the departure semantic slot in the map service and the occurrence probability P4 of the specified type of words in the map service.

4a, creating a tail sentence of a universal test corpus:

determining from P2, P31, P32 and P4 that the tail sentence includes navigation operations, includes a starting place semantic slot, and includes a word referring to a resolution type: "that". Randomly selecting one value from the expanded semantic slot values corresponding to the departure place semantic slot: meya photoelectricity.

Determining that the tail sentence comprises the participles of navigation, Meiya photoelectricity and the No based on the method, forming the participles into the sentence to obtain the tail sentence: "navigate to that from Meiya photoconduction".

4b, creating a sentence before the tail sentence of the universal test corpus:

and determining that the sentence before the tail sentence contains the destination semantic slot according to P2, P31, P32 and P4. Randomly selecting one value from the expanded semantic slot values corresponding to the destination semantic slot: shi Li Temple.

Determining that a sentence before the tail sentence comprises a word segmentation of 'temple in ten' based on the word segmentation, and forming the word segmentation into a sentence to obtain the sentence before the tail sentence: 'go Shi Li Temple'.

The two generated test corpuses already contain navigation operation, destination semantic slots and departure semantic slots, so that the navigation operation can be stopped.

The one-pass test corpus and the corresponding semantic parsing results are shown in the following table 5:

TABLE 5

In another embodiment of the present application, another test corpus generating method is disclosed, and as shown in fig. 2, the method may include:

and S200, acquiring historical user interaction linguistic data in a human-computer interaction scene.

Step S210, performing semantic analysis on each historical user interaction corpus, and determining the composition mode of each historical user interaction corpus.

And S220, determining the occurrence probability of each composition mode according to the composition modes of the historical user interaction linguistic data.

And step S230, generating test corpora one by referring to the occurrence probability of each composition mode.

In this embodiment, steps S200 to S230 correspond to steps S100 to S130 in the previous embodiment one to one, and reference is made to the foregoing description for details, which are not repeated herein.

And S240, testing the human-computer interaction system by using the generated test corpus.

Compared with the foregoing embodiments, the present embodiment further increases a process of testing the human-computer interaction system by using the generated test corpus.

The testing process of the man-machine interaction system mainly tests whether the analysis result of the man-machine interaction system on each test corpus is accurate. The testing accuracy can be used for measuring whether the human-computer interaction system is available, if the accuracy exceeds a preset threshold value, the human-computer interaction system is considered to be available, otherwise, the human-computer interaction system is considered to be unavailable. And during specific testing, inputting each generated testing corpus into a human-computer interaction system, comparing the semantic analysis result of the input testing corpus with the semantic analysis result (real semantic analysis result) during the generation of the testing corpus by the human-computer interaction system, and if the semantic analysis result is consistent with the real semantic analysis result, judging that the analysis result is accurate, otherwise, judging that the analysis result is inaccurate.

It should be noted that, when there is an analysis error of a test corpus in a one-pass test corpus, the whole-pass test corpus is considered to have an analysis error.

After the test of all the test corpora is finished, the correct rate of semantic analysis of the human-computer interaction system can be calculated by taking each test corpus as a unit, namely, the ratio of the correct analysis of the interactive number to all the interactive numbers is analyzed. The correct rate of semantic analysis of the human-computer interaction system can also be calculated by taking each test corpus as a unit, namely the ratio of the number of correctly analyzed corpora to the number of all corpora.

For example, in the test result of the one-pass test corpus generated in the above example table 5, 1 corpus in the one-pass test corpus is analyzed correctly, and 1 corpus is analyzed incorrectly; the analysis accuracy was calculated in units of corpus number to be 1/2 ═ 0.5.

TABLE 6

The following describes the test corpus generating device provided in the embodiment of the present application, and the test corpus generating device described below and the test corpus generating method described above may be referred to in a corresponding manner.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a test corpus generating device disclosed in the embodiment of the present application. As shown in fig. 3, the apparatus may include:

a historical corpus obtaining unit 11, configured to obtain historical user interaction corpuses in a human-computer interaction scene;

the semantic analysis unit 12 is configured to perform semantic analysis on each historical user interaction corpus and determine a composition manner of each historical user interaction corpus;

a probability determination unit 13, configured to determine an occurrence probability of each composition mode according to the composition modes of each historical user interaction corpus;

and the test corpus generating unit 14 is configured to generate the test corpus one by referring to the occurrence probability of each composition manner.

Optionally, the semantic parsing unit may include:

and the service analysis unit is used for performing semantic analysis on each historical user interaction corpus and determining the service to which each historical user interaction corpus belongs. Based on this, the probability determination unit may include:

and the service probability calculation unit is used for determining the occurrence probability of each service according to the service to which each historical user interaction corpus belongs.

Optionally, the service probability calculating unit may include:

and the service probability calculating subunit is used for calculating the ratio of the number of the historical user interaction corpora belonging to each service to the total number of the historical user interaction corpora as the occurrence probability of the service.

Optionally, the semantic parsing unit may further include:

and the operation analysis unit is used for performing semantic analysis on each historical user interaction corpus and determining the operation corresponding to each historical user interaction corpus. Based on this, the probability determination unit may further include:

and the operation probability calculation unit is used for determining the occurrence probability of each operation under the same service according to the service to which each historical user interaction corpus belongs and the corresponding operation.

Optionally, the operation probability calculating unit may include:

and the operation probability calculating subunit is configured to calculate, for each operation, the number of the historical user interaction corpus corresponding to the operation, and a ratio of the number of the historical user interaction corpus corresponding to the operation to the total number of the historical user interaction corpus of the same service, as an occurrence probability of the operation in the same service.

Optionally, the semantic parsing unit may further include:

and the semantic slot analyzing unit is used for performing semantic analysis on each historical user interaction corpus and determining a semantic slot and a semantic slot value contained in each historical user interaction corpus. Based on this, the probability determination unit may further include:

and the semantic slot probability calculating unit is used for determining the occurrence probability of each semantic slot under the same service according to the service to which each historical user interaction corpus belongs and the semantic slots contained in the historical user interaction corpus.

Optionally, the semantic slot probability calculating unit may include:

and the semantic slot calculating subunit is used for calculating the number of the historical user interaction linguistic data containing the semantic slot and the ratio of the number of the historical user interaction linguistic data containing the same service to the total number of the historical user interaction linguistic data containing the same service in the historical user interaction linguistic data of the same service as the occurrence probability of the semantic slot under the same service.

Optionally, the apparatus of the present application may further include:

and the word expansion unit is used for performing word expansion on the semantic slot value of each semantic slot to obtain the expanded semantic slot value.

Optionally, the semantic parsing unit may further include:

and the appointed word analysis unit is used for performing semantic analysis on each historical user interaction corpus and determining appointed type words contained in each historical user interaction corpus. Based on this, the probability determination unit may further include:

and the appointed word probability calculating unit is used for determining the occurrence probability of each appointed type of word under the same service according to the service to which each historical user interaction corpus belongs and the included appointed type of words.

Optionally, the unit for calculating probability of specified word may include:

and the appointed word probability calculating subunit is used for calculating the number of the historical user interaction linguistic data containing the appointed type of words in the historical user interaction linguistic data of the same service aiming at each appointed type of words, and taking the ratio of the number of the historical user interaction linguistic data containing the appointed type of words to the total number of the historical user interaction linguistic data of the same service as the occurrence probability of the appointed type of words in the same service.

Optionally, the test corpus generating unit may include:

a target service determining unit, configured to determine, in a one-pass test corpus to be generated currently, a target service to which each test corpus belongs, with reference to an occurrence probability of each service;

the user intention selecting unit is used for randomly selecting a user intention under the target service in a preset user intention library, and the user intention records the target service to which the corresponding historical user interaction corpus belongs, the corresponding target operation and a target semantic slot contained in the target service;

a first inclusion condition determining unit, configured to determine, with reference to the occurrence probability of the target operation under the target service and the occurrence probability of the target semantic groove under the target service, a first inclusion condition of each test corpus to be generated for the target operation and the target semantic groove;

a test corpus generating unit, configured to generate test corpora one by one according to at least the first inclusion condition until each generated test corpus includes the target operation and the target semantic slot;

and the test corpus organizing unit is used for combining the generated test corpuses into a universal test corpus.

Optionally, the test corpus generating unit may further include:

and the second inclusion condition determining unit is used for determining the second inclusion condition of each test corpus to be generated for each specified type of word by referring to the occurrence probability of each specified type of word under the target service. Based on this, the test corpus itemization generating unit may include:

and the test corpus generating subunit is used for generating the test corpuses one by one according to the first inclusion condition and the second inclusion condition until each generated test corpus contains the target operation and the target semantic slot.

Optionally, the process of generating the test corpus item by the test corpus item-by-item generating unit at least according to the first inclusion condition may specifically include:

Optionally, the test corpus organizing unit may include:

and the reverse order organization unit is used for organizing the test corpora into one-pass test corpora according to the reverse order of the generation order of the test corpora.

Optionally, the apparatus of the present application may further include:

and the system testing unit is used for testing the human-computer interaction system by utilizing the generated testing corpus.

The test corpus generating device provided by the embodiment of the application can be applied to test corpus generating equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Optionally, fig. 4 shows a block diagram of a hardware structure of the test corpus generating device, and referring to fig. 4, the hardware structure of the test corpus generating device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A test corpus generation method is characterized by comprising the following steps:

and generating the test corpora one by referring to the occurrence probability of each composition mode to obtain a sufficient number of test corpora.

2. The method according to claim 1, wherein said semantically parsing each of said historical user interaction corpus to determine a composition manner of each of said historical user interaction corpus comprises:

3. The method according to claim 2, wherein said semantically parsing each of said historical user interaction corpus to determine a composition manner of each of said historical user interaction corpus further comprises:

4. The method according to claim 3, wherein said semantically parsing each of said historical user interaction corpus to determine a composition manner of each of said historical user interaction corpus further comprises:

5. The method of claim 4, further comprising:

6. The method according to claim 4, wherein said semantically parsing each of said historical user interaction corpus to determine a composition manner of each of said historical user interaction corpus further comprises:

7. The method according to claim 6, wherein said generating the test corpus item by item with reference to the occurrence probability of each composition manner comprises:

and combining the generated test corpora into a universal test corpus.

8. The method according to claim 7, wherein the generating test corpus item by item with reference to the occurrence probability of each composition manner further comprises:

9. The method according to claim 7, wherein the step of combining the generated test corpora into a universal test corpus comprises:

10. The method according to any one of claims 1-9, further comprising:

11. A test corpus generating device, comprising:

and the test corpus generating unit is used for generating the test corpuses one by referring to the occurrence probability of each composition mode so as to obtain enough test corpuses.

12. The test corpus generating device is characterized by comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the test corpus generating method according to any one of claims 1 to 10.

13. A readable storage medium having stored thereon a computer program for implementing the steps of the test corpus generation method according to any one of claims 1-10 when being executed by a processor.