CN113449092A

CN113449092A - Corpus obtaining method and device, electronic equipment and storage medium

Info

Publication number: CN113449092A
Application number: CN202110772111.4A
Authority: CN
Inventors: 任文墨; 李博; 张晓�
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2021-09-28

Abstract

The disclosure provides a corpus acquiring method, a corpus acquiring device, electronic equipment and a storage medium. The specific scheme is as follows: receiving a dialogue statement and determining a response statement corresponding to the dialogue statement; processing the dialogue sentences and the answer sentences respectively to obtain first sentence information corresponding to the dialogue sentences and second sentence information corresponding to the answer sentences; and extracting target linguistic data corresponding to a preset linguistic data label from the first statement information and the second statement information. By the method and the device, the corpus acquisition efficiency and the corpus acquisition effect can be effectively improved, and the acquired target corpus can effectively meet the individual requirements of actual service scenes.

Description

Corpus obtaining method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a corpus obtaining method, an apparatus, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence technology, various model algorithms in the Natural Language Processing (NLP) field are more advanced. The model training depends on the corpus, the higher the corpus quality is, the more the corpus quantity is, the closer to the current real service scene is, the closer to the real user language expression habit is, and the generalization effect of the trained model in the real service scene is better.

In the related art, external purchase is usually relied on to obtain rich corpora.

In this way, the corpus acquisition efficiency is low, the cost is high, the personalized requirements of the real service scene cannot be met, and the corpus source cannot be controlled, which may cause the acquired corpus not to be matched with the real service scene, and the corpus quality is poor.

Disclosure of Invention

The present disclosure is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the present disclosure aims to provide a corpus acquiring method, device, electronic device and storage medium, which can effectively improve corpus acquiring efficiency and acquiring effect, and enable the acquired corpus to effectively meet personalized requirements of actual service scenes.

In order to achieve the above object, an embodiment of the first aspect of the present disclosure provides a corpus acquiring method, including: receiving a dialogue statement and determining a response statement corresponding to the dialogue statement; processing the dialogue sentences and the answer sentences respectively to obtain first sentence information corresponding to the dialogue sentences and second sentence information corresponding to the answer sentences; and extracting target linguistic data corresponding to a preset linguistic data label from the first statement information and the second statement information.

According to the corpus acquiring method provided by the embodiment of the first aspect of the disclosure, the dialogue statement is received, the response statement corresponding to the dialogue statement is determined, the dialogue statement and the response statement are respectively processed to obtain the first statement information corresponding to the dialogue statement and the second statement information corresponding to the response statement, and the target corpus corresponding to the preset corpus tag is extracted from the first statement information and the second statement information, so that corpus acquiring efficiency and acquiring effect can be effectively improved, and the acquired corpus can effectively meet the personalized requirement of an actual service scene.

In order to achieve the above object, an embodiment of a second aspect of the present disclosure provides a corpus acquiring apparatus, including: the first receiving module is used for receiving a conversation statement and determining a response statement corresponding to the conversation statement; the first processing module is used for respectively processing the conversation statement and the response statement to obtain first statement information corresponding to the conversation statement and second statement information corresponding to the response statement; and the extraction module is used for extracting target linguistic data corresponding to a preset linguistic data label from the first statement information and the second statement information.

The corpus acquiring device provided by the embodiment of the second aspect of the disclosure receives the dialogue statement, determines the response statement corresponding to the dialogue statement, processes the dialogue statement and the response statement respectively to obtain the first statement information corresponding to the dialogue statement and the second statement information corresponding to the response statement, and extracts the target corpus corresponding to the preset corpus tag from the first statement information and the second statement information, so that corpus acquiring efficiency and acquiring effect can be effectively improved, and the acquired corpus can effectively meet the personalized requirement of an actual service scene.

An embodiment of a third aspect of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the corpus acquiring method as set forth in the embodiment of the first aspect of the present disclosure.

A fourth aspect of the present disclosure provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the corpus acquiring method as set forth in the first aspect of the present disclosure.

A fifth aspect of the present disclosure provides a computer program product, which when executed by an instruction processor in the computer program product, performs the corpus acquiring method as set forth in the first aspect of the present disclosure.

Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a corpus acquiring method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an architecture of a corpus retrieval device according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a corpus acquiring method according to another embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a corpus acquiring method according to another embodiment of the present disclosure;

FIG. 5 is a corpus access interface diagram according to an embodiment of the present disclosure;

FIG. 6 is a corpus acquisition record interface diagram according to an embodiment of the present disclosure;

FIG. 7 is a flow chart illustrating a corpus retrieval method according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a corpus acquiring device according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a corpus acquiring device according to another embodiment of the present disclosure;

FIG. 10 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of illustrating the present disclosure and should not be construed as limiting the same. On the contrary, the embodiments of the disclosure include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a schematic flow chart of a corpus acquiring method according to an embodiment of the present disclosure.

It should be noted that the main execution body of the corpus acquiring method of this embodiment is a corpus acquiring device, the device may be implemented by software and/or hardware, the device may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

As shown in fig. 1, the corpus acquiring method includes:

s101: a conversational sentence is received and a reply sentence corresponding to the conversational sentence is determined.

The method comprises the steps that a user provides a consultation sentence through a client side of an intelligent dialogue system, the consultation sentence can be called as a dialogue sentence, correspondingly, a service side of the intelligent dialogue system receives the consultation sentence of the user, a corresponding reply sentence can be generated according to the consultation sentence of the user, and the reply sentence can be called as a response sentence.

The intelligent dialogue system is a computer system capable of realizing coherent dialogue with people, a user can initiate dialogue and input consultation sentences through a client of the intelligent dialogue system, and a server of the intelligent dialogue system can respond to the consultation dialogue of the client and make corresponding response to realize human-computer interaction.

Optionally, the corpus acquiring method of the embodiment of the present disclosure may be applied to a human-computer interaction system (a human-computer interaction system that implements an intelligent dialog, that is, the above-mentioned intelligent dialog system) that implements an intelligent dialog, and of course, may also be applied to any other possible human-computer interaction systems, which is not limited to this.

That is to say, in the embodiment of the present disclosure, it is supported to directly collect the corpus from the online human-computer interaction system supporting the artificial intelligence technology, and the online human-computer interaction system supporting the artificial intelligence technology usually exists in the actual service scene, so that the collected target corpus can effectively meet the personalized requirements of the actual service scene.

The above may trigger the execution of the subsequent steps after receiving the dialogue statement and determining the answer statement corresponding to the dialogue statement.

S102: the dialogue sentence and the response sentence are processed respectively to obtain first sentence information corresponding to the dialogue sentence and second sentence information corresponding to the response sentence.

After receiving the dialogue statement and determining the response statement corresponding to the dialogue statement, the dialogue statement and the response statement may be processed respectively to obtain first statement information corresponding to the dialogue statement and second statement information corresponding to the response statement.

The conversational sentence may have some associated sentence information, which may be referred to as first sentence information, and the answer sentence may have some associated sentence information, which may be referred to as second sentence information, and the sentence information may be used to describe sentence-related information, which may be, for example, context information, semantics, scenario information related to the sentence, keywords included in the sentence, and so on.

For example, the first statement information or the second statement information may specifically be, for example: the dialog time information, the dialog content information, and the like related to the dialog sentence or the response sentence are not limited to these.

For example, if the content of the dialog statement is: if "query the beijing today weather" and the time corresponding to the dialog statement is time t1, the semantic content of the dialog statement and the time t1 corresponding to the dialog statement may be used together as the first statement information, which is not limited to this.

Accordingly, if the contents of the answer sentence are: if the time corresponding to the answer sentence is t2, "beijing today is fine," the semantic content of the answer sentence and the time corresponding to the answer sentence may be used together as the second sentence information, which is not limited to this.

For example, after receiving a dialogue statement and determining a response statement corresponding to the dialogue statement, the server may perform analysis processing on first statement information corresponding to the dialogue statement and second statement information corresponding to the response statement to obtain the first statement information corresponding to the dialogue statement and the second statement information corresponding to the response statement.

The parsing process may be, for example, a model parsing process, a text analysis process, or the like, which is not limited thereto.

S103: and extracting target linguistic data corresponding to the preset linguistic data tags from the first statement information and the second statement information.

The corpus tag may be used to describe a category of a corpus, and the preset corpus tag may refer to a corpus tag pre-configured by a human-computer interaction system, and the preset corpus tag may be, for example: conversation type labels, conversation scene labels and conversation keyword labels are preset, and the linguistic data labels can also be configured in a self-adaptive mode according to actual linguistic data acquisition requirements.

That is, after the dialogue statement and the response statement are processed respectively to obtain the first statement information corresponding to the dialogue statement and the second statement information corresponding to the response statement, the corpus corresponding to the preset corpus tag may be extracted from the first statement information and the second statement information, and the corpus may be referred to as a target corpus.

For example, if the preset corpus tag is a dialog scene tag, the corpus corresponding to the dialog scene tag may be extracted from the obtained first sentence information corresponding to the dialog sentence and the obtained second sentence information corresponding to the response sentence, where the pair of corpuses corresponding to the dialog scene tag is the target corpus.

In this embodiment, by receiving a dialog sentence, determining a response sentence corresponding to the dialog sentence, processing the dialog sentence and the response sentence respectively to obtain first sentence information corresponding to the dialog sentence and second sentence information corresponding to the response sentence, and extracting a target corpus corresponding to a preset corpus tag from the first sentence information and the second sentence information, corpus acquisition efficiency and acquisition effect can be effectively improved, and the acquired corpus can effectively meet personalized requirements of an actual service scene.

Fig. 2 is a schematic diagram of an architecture of a corpus acquiring device according to an embodiment of the present disclosure, and the following embodiments of the present disclosure may be described in conjunction with fig. 2, so that an online human-computer interaction system supporting an artificial intelligence technology is an example of an intelligent dialog system, which is not limited by the present disclosure.

After the consulting user initiates a consulting session to the intelligent dialog system, the consulting user can extract the target corpus from the consulting session according to the corpus acquisition method provided by the embodiment of the disclosure, and push the extracted target corpus to the big data platform for storage, so that the corpus acquisition user can acquire the target corpus, and accordingly, the corpus acquisition device can comprise:

an artificial intelligence model algorithm platform: the corpus acquisition user can input the corpus acquisition request to the artificial intelligence model algorithm platform, and the artificial intelligence model algorithm platform can convert the corpus acquisition request of the user into a corpus acquisition request adaptive to the corpus acquisition device.

The Query engine Presto is an open-source distributed Structured Query Language (SQL) and is suitable for interactive analysis and Query, and the Query engine Presto is used for accessing a corpus acquisition request converted by an artificial intelligence model algorithm platform to a big data processing platform, and then the big data processing platform can feed back a corpus acquisition result to the artificial intelligence model algorithm platform so that a corpus acquisition user can download the corpus.

Fig. 3 is a schematic flow chart of a corpus acquiring method according to another embodiment of the present disclosure, and the description of fig. 3 may be combined with fig. 2.

As shown in fig. 3, the corpus acquiring method includes:

s301: a conversational sentence is received and a reply sentence corresponding to the conversational sentence is determined.

S302: the dialogue sentence and the response sentence are processed respectively to obtain first sentence information corresponding to the dialogue sentence and second sentence information corresponding to the response sentence.

For the description of S301 to S302, reference may be made to the above embodiments, which are not described herein again.

S303: and determining the conversation description information according to the first statement information and the second statement information.

After the dialogue statement and the response statement are processed respectively to obtain the first statement information corresponding to the dialogue statement and the second statement information corresponding to the response statement, the dialogue description information can be determined according to the first statement information and the second statement information.

The information for describing the general nature of the dialog content may be referred to as dialog description information, where the above-mentioned statement information is information related to a statement, that is, information for describing a statement, and the dialog description information may be information for describing an overall dialog, and the dialog description information may be, for example, a dialog scene, a dialog semantic, dialog result information, dialog feedback information, and the like, which are not limited thereto.

For example, if the first statement information is: "inquire about the amount of passenger flow this day at xinzheng airport", the second statement information is: if the today passenger flow volume of the xinzheng airport is 2 ten thousand times, the summarized nature information of the whole dialogue can be determined as the dialogue description information (for example, the airport passenger flow volume inquiry, the inquiry time "today", the airport is the "xinzheng airport", the inquiry result is 2 ten thousand times) according to the inquiry, the xinzheng airport, the today passenger flow volume and the 2 ten thousand times in the first statement information, and the information is combined with the "today", "passenger flow volume" and "2 ten thousand times" in the second statement information, which is not limited to this.

S304: and determining an initial corpus tag according to the conversation description information.

After the dialog description information is obtained, the corpus that may be related to the dialog description information may be classified in advance with reference to the dialog description information based on a classification rule to determine an initial corpus tag, that is, the initial corpus tag is obtained by initial analysis according to the dialog description information, and the preset corpus tag is set in the intelligent dialog system as a buried point tag in advance.

For example, after obtaining the dialog description information (e.g., airport traffic volume query, query time "today", airport "Xinzheng airport", query result "2 ten thousand times"), it may be determined that the initial corpus tag relates to: a conversation type label, a conversation scene label, a conversation time label and a conversation keyword label, namely, a label related to the conversation description information can be determined to be used as an initial corpus label.

S305: and if the initial corpus tag is matched with the preset corpus tag, determining that a mapping relation exists between the conversation description information and the preset corpus tag.

The initial corpus tag can be compared with the preset corpus tag after being determined according to the conversation description information, if the initial corpus tag is matched with the preset corpus tag, the mapping relation between the conversation description information and the preset corpus tag can be determined, so that the content included in the conversation description information can be determined, the target corpus matched with the preset corpus tag can be contained with a high probability, the target corpus matched with the preset corpus tag can be effectively acquired from the conversation description information in an auxiliary mode, the accuracy of the target corpus is improved in an auxiliary mode, and the corpus acquisition effect is guaranteed.

Optionally, in some embodiments, a similarity value between the dialog description information and the tag content of the preset corpus tag may be further determined, and if the similarity value is greater than a set threshold, it may be determined that a mapping relationship exists between the dialog description information and the preset corpus tag, so that it can be accurately and flexibly determined whether the content included in the dialog description information includes a target corpus matched with the preset corpus tag, and thus, it can effectively assist in obtaining the target corpus from the dialog description information.

The value for measuring the similarity between the dialog description information and the tag content of the preset corpus tag may be referred to as a similarity value.

For example, feature extraction may be performed on the dialog description information and the tag content of the preset corpus respectively to obtain a feature vector corresponding to the dialog description information and a feature vector corresponding to the tag content of the preset corpus tag, a similarity value between the dialog description information and the tag content of the preset corpus tag may be obtained by calculating a vector cosine between the two types of feature vectors, and then the calculated similarity value may be compared with a preset similarity threshold, and if the calculated similarity value is greater than the preset threshold, it may be determined that a mapping relationship exists between the dialog description information and the preset tag corpus.

In this embodiment, the setting threshold may be set correspondingly in combination with the real-time service requirement of the corpus acquisition task, and the setting mode may also be set adaptively, which is not limited to this.

S306: and if the mapping relation exists between the conversation description information and the preset corpus tag, extracting the target corpus corresponding to the preset corpus tag from the first sentence information and the second sentence information.

If a mapping relationship exists between the dialog description information and the preset corpus tag, it is indicated that the content included in the dialog description information includes a target corpus matched with the preset corpus tag with a high probability, and at this time, the target corpus corresponding to the preset corpus tag may be triggered to be extracted from the dialog description information, or the target corpus corresponding to the preset corpus tag may be extracted from the first sentence information and the second sentence information, which is not limited to this.

Therefore, the opportunity of extracting the target corpus is determined according to the actual condition whether the initial corpus tag is matched with the preset corpus tag or not, the extracted target corpus can be more adaptive to the preset corpus tag, and the preset corpus tag is configured as a buried point tag in advance, so that the required target corpus can be extracted in the middle of the application process of a human-computer interaction system supporting the artificial intelligence technology on line, the efficiency and the accuracy of corpus acquisition are guaranteed, and the adaptive performance of the acquired target corpus and an actual service scene is guaranteed.

In this embodiment, by receiving a dialog sentence, determining a response sentence corresponding to the dialog sentence, processing the dialog sentence and the response sentence respectively to obtain first sentence information corresponding to the dialog sentence and second sentence information corresponding to the response sentence, and extracting a target corpus corresponding to a preset corpus tag from the first sentence information and the second sentence information, corpus acquisition efficiency and acquisition effect can be effectively improved, and the acquired corpus can effectively meet personalized requirements of an actual service scene. The method and the device also realize that the opportunity of extracting the target corpus is determined according to the actual condition whether the initial corpus tag is matched with the preset corpus tag or not, the extracted target corpus can be more adaptive to the preset corpus tag, and the preset corpus tag is configured as a buried point tag in advance, so that the required target corpus can be extracted in the middle of the application process of a human-computer interaction system supporting the artificial intelligence technology on line, the efficiency and the accuracy of corpus acquisition are ensured, and the adaptive performance of the acquired target corpus and an actual service scene is ensured.

Fig. 4 is a flowchart illustrating a corpus acquiring method according to another embodiment of the present disclosure.

As shown in fig. 4, the corpus acquiring method includes:

s401: a conversational sentence is received and a reply sentence corresponding to the conversational sentence is determined.

S402: the dialogue sentence and the response sentence are processed respectively to obtain first sentence information corresponding to the dialogue sentence and second sentence information corresponding to the response sentence.

S403: and extracting target linguistic data corresponding to the preset linguistic data tags from the first statement information and the second statement information.

For the description of S401 to S303, reference may be made to the above embodiments, which are not described herein again.

S404: and storing the target corpus corresponding to the preset corpus tag to a big data processing platform, and configuring the preset corpus tag as an index item corresponding to the target corpus.

After the target corpus corresponding to the preset corpus tag is extracted from the first sentence information and the second sentence information, the target corpus corresponding to the preset corpus tag may be stored in the big data processing platform, and the preset corpus tag is configured as an index item corresponding to the target corpus.

The big data processing platform is a data processing platform which can complete various service requirements by processing tasks such as mass data storage, calculation, stream data real-time calculation and the like.

Optionally, in order to implement storage of the target corpus in this embodiment, the big data processing platform described in this embodiment may be a big data platform built on the basis of a data warehouse tool (hive), and certainly, a data storage platform may also be built by itself to store the target corpus, which is not limited to this.

The data warehouse tool (hive) is a data warehouse tool based on Hadoop (Hadoop) and is used for data extraction, transformation and loading, and the data warehouse tool (hive) is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop. The hive data warehouse tool can map the structured data file into a database table, provide a structured query language query function and convert a query statement into a mapping function to execute corresponding business.

The index can provide pointers to the corpora stored in the designated columns of the table, the index items are equivalent to a pointer list of a set of one or more columns of corpora in a certain table, and the preset corpus tags are configured into corresponding index items, so that the target corpora matched with the preset corpus tags can be rapidly positioned based on the preset corpus tags in an auxiliary mode, and the target corpora can be rapidly obtained from a large data platform.

In this embodiment, the target corpus corresponding to the preset corpus tag is stored in the big data processing platform, and the preset corpus tag is configured as an index item corresponding to the target corpus, so that the target corpus can be effectively expanded, the corpus acquiring user can acquire a higher-quality corpus in the corpus acquiring process, and the corpus acquiring user is supported to index and position out the matched target corpus based on the preset corpus tag, and the corpus acquiring quality is improved, and meanwhile, the corpus acquiring convenience is improved.

S405: providing a corpus acquiring interface, wherein the corpus acquiring interface comprises: at least one candidate corpus tag, a corpus acquisition time range and a corpus number.

The corpus acquiring device may provide an operation interface, the operation interface may be referred to as a corpus acquiring interface, some corpus tags selectable by a user are displayed in the corpus acquiring interface, and the corpus tags selectable by the user may be referred to as candidate corpus tags.

As shown in fig. 5, fig. 5 is a corpus access interface diagram according to an embodiment of the disclosure.

The above-mentioned target corpus that will correspond with preset corpus label is saved to big data processing platform to after will presetting the corpus label configuration and the index item that the target corpus corresponds, can provide the corpus and acquire the interface, the corpus acquires the interface and includes: the method comprises the steps of selecting at least one candidate corpus tag, the corpus acquiring time range and the corpus quantity, then supporting a corpus acquiring user, and creating a corresponding corpus acquiring request in a corpus acquiring interface so as to realize a corresponding corpus acquiring task.

Optionally, in some embodiments, the data communication between the configured corpus acquiring device and the artificial intelligence model algorithm platform may also be supported, the corpus acquiring device may provide a corpus acquiring interface corresponding to the corpus acquiring user, and when the corpus acquiring interface receives a corpus acquiring request input by the user, the corpus acquiring request may be sent to the artificial intelligence model algorithm platform to acquire a target corpus of a demand.

S406: and receiving a corpus configuration request based on the corpus acquisition interface, taking a candidate corpus tag selected by the corpus configuration request as a to-be-processed corpus tag, and determining a target time range and a target corpus number of the corpus configuration request.

After the corpus obtaining interface is provided, the user may select one or more candidate corpus tags from the corpus obtaining interface as the corpus tags to be processed, and determine the target time range and the target corpus quantity of the corpus configuration request.

The user selects the candidate corpus tags from the corpus acquisition interface as a request of the to-be-processed corpus tags, which may be referred to as a corpus configuration request.

The to-be-processed corpus tag is obtained by configuring according to actual corpus acquisition requirements of a user.

For example, if the user needs to extract the corpus corresponding to the dialog type tag at present, the user may set the dialog type tag as the to-be-processed corpus tag, and the corpus acquiring device may refer to the to-be-processed corpus tag selected by the user to match with a configured corpus tag (a preset corpus tag) in the big data processing platform, so as to trigger acquiring the corpus indexed by the matched preset corpus tag as the target corpus.

S407: and generating a corpus acquisition request according to the corpus tag to be processed, the target time range and the target corpus quantity.

The target time range and the target corpus number are obtained by the user according to actual requirements, and the target time range can be used for expressing the time range corresponding to the corpus which needs to be extracted by the user.

For example, if the corpus tag to be processed currently selected by the user is a dialog scene tag, the time range corresponding to the dialog scene corpus to be extracted is from t1 to t2, and the corpus number corresponding to the dialog scene corpus to be extracted is 30000, the corresponding corpus acquisition request may be generated according to the corpus extraction requirement of the user.

Optionally, in some embodiments, after receiving the corpus configuration request based on the corpus acquisition interface, the corpus acquisition request may be generated according to the to-be-processed corpus tag, the target time range, and the target corpus quantity.

For example, the corpus acquiring device may generate a corpus acquiring request, and then the corpus acquiring request is sent to the artificial intelligence model algorithm platform, and the artificial intelligence model algorithm platform analyzes the corpus acquiring request to obtain the corpus tag to be processed, the target time range, the target corpus number, and the like, and generates a corresponding corpus query sentence by referring to the corpus tag to be processed, the target time range, and the target corpus number, so as to obtain the target corpus by querying from the artificial intelligence model algorithm platform based on the corpus query sentence.

Thereby, by providing a corpus acquisition interface, the corpus acquisition interface includes: the method comprises the steps of receiving a corpus configuration request based on a corpus acquisition interface, using a corpus candidate tag selected by the corpus configuration request as the corpus tag to be processed, determining a target time range and a target corpus number configured by the corpus configuration request, and generating the corpus acquisition request according to the corpus tag to be processed, the target time range and the target corpus number, so that the corpus acquisition operation of a corpus acquisition user is more convenient, the corpus acquisition method can be applicable to various application scenes with different real-time requirements, the applicability of the corpus acquisition method is improved, the application scene of the corpus acquisition method is expanded, and the corpus acquisition experience of the user is improved.

S408: and sending the corpus acquisition request to an artificial intelligence model algorithm platform.

S409: and determining a preset corpus tag corresponding to the corpus tag to be processed by an artificial intelligence model algorithm platform.

After the corpus acquisition request is sent to the artificial intelligence model algorithm platform, the artificial intelligence model algorithm platform can determine the preset corpus tag corresponding to the corpus tag to be processed.

Optionally, in some embodiments, a similarity value between the to-be-processed corpus tag and the preset corpus tag may be determined, and if the similarity value between the to-be-processed corpus tag and the preset corpus tag is greater than a preset similarity threshold, it may be determined that a mapping relationship exists between the to-be-processed corpus tag and the preset corpus tag, and then the preset corpus tag having a mapping relationship with the to-be-processed corpus tag may be determined as the corresponding preset corpus tag.

S410: and acquiring a plurality of target linguistic data obtained by corresponding preset linguistic data label indexes, and selecting part of the target linguistic data from the plurality of target linguistic data according to the target time range and the target linguistic data quantity.

After the preset corpus tag corresponding to the corpus tag to be processed is determined, the target corpora obtained by indexing the corresponding preset corpus tag can be obtained, and part of the target corpus can be selected from the target corpora according to the target time range and the target corpus number.

For example, a plurality of target corpora pre-stored in the big data platform can be obtained through a preset corpus tag index, and then a part of the target corpora can be selected from the plurality of target corpora according to a target time range and the target corpus number.

S411: and taking part of the target language material as the language material obtained based on the language material obtaining request.

After the plurality of target corpora obtained by acquiring the corresponding preset corpus tag indexes are obtained and part of the target corpora are selected from the plurality of target corpora according to the target time range and the target corpus number, the part of the target corpora can be used as the corpora obtained based on the corpus acquisition request.

Optionally, in some embodiments, after obtaining a part of target corpora from a plurality of target corpora according to a corpus obtaining request, the obtained part of the corpus may be written into a record file, and a corresponding file download link is generated, and then the file download link may be written back into the corpus obtaining record, as shown in fig. 6, where fig. 6 is a schematic diagram of a corpus obtaining record interface according to an embodiment of the present disclosure, and after a corpus obtaining task is completed, a user may download a record file corresponding to the corpus obtaining request on the corpus obtaining record interface.

Fig. 7 is a schematic flow chart of a corpus acquiring method according to an embodiment of the present disclosure, as shown in fig. 7: the corpus acquiring method can be composed of two parts of tasks, namely a target corpus storage task and a corpus acquiring task. When a target corpus storage task is executed, a user can provide a dialogue sentence through a client side of an intelligent dialogue system, then a server side of the intelligent dialogue system can generate a corresponding answer sentence when receiving the dialogue sentence of the user, then the dialogue sentence and the answer sentence can be correspondingly processed to obtain a target corpus, and the target corpus is pushed to a big data platform to be stored, so that the target corpus storage task is completed.

When executing the corpus acquiring task, a user can input corpus extracting conditions through a corpus acquiring interface, then the artificial intelligence model algorithm platform can convert the corpus extracting conditions into a corpus acquiring request, the artificial intelligence model algorithm platform can establish the corpus acquiring task and record the query conditions of the user, a query engine (presto) is accessed to a big data platform to execute the corpus acquiring request, after the corpus acquiring task is completed, the acquired corpus is written into a recording file, a downloading link is generated and written back into a task record, and the user can download the corpus file, so that the corpus acquiring task is completed.

Thus, in this embodiment, by receiving a corpus acquiring request, the corpus acquiring request includes: the method comprises the steps of determining a to-be-processed corpus tag, a target time range and a target corpus quantity, determining a preset corpus tag corresponding to the to-be-processed corpus tag, obtaining a plurality of target corpora obtained by indexing the corresponding preset corpus tag, selecting part of target corpora from the target corpora according to the target time range and the target corpus quantity, and taking the part of target corpora as the corpora obtained based on the corpus obtaining request. Therefore, partial target linguistic data required by a user can be accurately selected from the plurality of target linguistic data, so that the efficiency of linguistic data acquisition can be effectively improved, and the acquired target linguistic data can be more fit with the current actual service scene of the user.

In this embodiment, a dialog sentence is received, a response sentence corresponding to the dialog sentence is determined, the dialog sentence and the response sentence are respectively processed to obtain first sentence information corresponding to the dialog sentence and second sentence information corresponding to the response sentence, a target corpus corresponding to a preset corpus tag is extracted from the first sentence information and the second sentence information, the target corpus corresponding to the preset corpus tag is stored in a big data processing platform, the preset corpus tag is configured as an index item corresponding to the target corpus, so that the target corpus can be effectively expanded, a corpus obtaining user can obtain a corpus with higher quality in a corpus obtaining process, the corpus obtaining user is supported to index and position a matched target corpus based on the preset corpus tag, while the corpus obtaining quality is improved, the convenience of corpus acquisition is improved. The corpus acquiring method comprises the steps of providing a corpus acquiring interface, receiving a corpus configuration request based on the corpus acquiring interface, taking a candidate corpus tag selected by the corpus configuration request as a to-be-processed corpus tag, determining a target time range and a target corpus quantity configured by the corpus configuration request, and generating the corpus acquiring request according to the to-be-processed corpus tag, the target time range and the target corpus quantity.

Fig. 8 is a schematic structural diagram of a corpus acquiring device according to an embodiment of the present disclosure.

As shown in fig. 8, the corpus acquiring device 80 includes:

a first receiving module 801, configured to receive a dialogue statement and determine a response statement corresponding to the dialogue statement;

a first processing module 802, configured to process the dialogue statement and the response statement respectively to obtain first statement information corresponding to the dialogue statement and second statement information corresponding to the response statement;

the extracting module 803 is configured to extract a target corpus corresponding to a preset corpus tag from the first sentence information and the second sentence information.

In some embodiments of the present disclosure, as shown in fig. 9, the extracting module 803 is specifically configured to:

determining dialog description information according to the first statement information and the second statement information;

and if a mapping relation exists between the conversation description information and the preset corpus tag, extracting a target corpus corresponding to the preset corpus tag from the first sentence information and the second sentence information.

In some embodiments of the present disclosure, the extracting module 803 is specifically configured to:

after determining dialog description information according to the first statement information and the second statement information, determining an initial corpus tag according to the dialog description information;

and if the initial corpus tag is matched with the preset corpus tag, determining that a mapping relation exists between the conversation description information and the preset corpus tag.

after determining dialog description information according to the first statement information and the second statement information, determining a similarity value between the dialog description information and tag contents of the preset corpus tag;

and if the similarity value is larger than a set threshold value, determining that a mapping relation exists between the conversation description information and the preset corpus tag.

In some embodiments of the present disclosure, the corpus acquiring device 80 further includes:

a storage module 804, configured to, after extracting the target corpus corresponding to the preset corpus tag from the first sentence information and the second sentence information, store the target corpus corresponding to the preset corpus tag to the big data processing platform, and configure the preset corpus tag as an index item corresponding to the target corpus.

In some embodiments of the present disclosure, the number of the preset corpus tags is multiple, wherein the corpus acquiring device 80 further includes:

a second receiving module 805, configured to receive a corpus obtaining request, where the corpus obtaining request includes: the method comprises the steps of (1) obtaining a corpus tag to be processed, a target time range and a target corpus quantity;

a determining module 806, configured to determine a preset corpus tag corresponding to the corpus tag to be processed;

an obtaining module 807, configured to obtain a plurality of target corpora obtained by indexing corresponding preset corpus tags, and select a part of the target corpora from the plurality of target corpora according to a target time range and a target corpus number;

the second processing module 808 is configured to use a part of the target corpus as the corpus acquired based on the corpus acquisition request.

a providing module 809, configured to provide a corpus obtaining interface, where the corpus obtaining interface includes: at least one candidate corpus tag, a corpus acquisition time range and a corpus number;

a third receiving module 810, configured to receive a corpus configuration request received based on the corpus obtaining interface, use a candidate corpus tag selected in the corpus configuration request as a to-be-processed corpus tag, and determine a target time range and a target corpus number of the corpus configuration request configuration;

the generating module 811 is configured to generate a corpus acquiring request according to the to-be-processed corpus tag, the target time range, and the target corpus quantity.

In some embodiments of the present disclosure, the preset corpus tag comprises any one or a combination of:

a conversation type tag, a conversation scene tag, a conversation keyword tag.

The embodiment of the corpus acquiring method is also applicable to the corpus acquiring device provided in the embodiment of the present disclosure, and the detailed description is omitted in the embodiment of the present disclosure, because the corpus acquiring device provided in the embodiment of the present disclosure corresponds to the corpus acquiring method provided in the embodiment of the fig. 1 to 7.

In order to implement the above embodiments, the present disclosure also provides an electronic device, including: the present invention relates to a corpus acquiring method, and more particularly to a corpus acquiring method and a corpus acquiring device for acquiring corpus data.

In order to achieve the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the corpus acquisition method as proposed by the foregoing embodiments of the present disclosure.

In order to implement the foregoing embodiments, the present disclosure further provides a computer program product, which when executed by an instruction processor in the computer program product, executes the corpus acquiring method according to the foregoing embodiments of the present disclosure.

FIG. 10 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present disclosure. The electronic device 12 shown in fig. 10 is only an example and should not bring any limitations to the function and scope of use of the disclosed embodiments.

As shown in FIG. 10, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, and commonly referred to as a "hard drive").

Although not shown in FIG. 10, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described in this disclosure.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via the Network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the corpus acquisition method mentioned in the foregoing embodiments, by running a program stored in the system memory 28.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

It should be noted that, in the description of the present disclosure, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure.

It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.

Claims

1. A corpus acquiring method is characterized by comprising the following steps:

receiving a dialogue statement and determining a response statement corresponding to the dialogue statement;

processing the dialogue sentences and the answer sentences respectively to obtain first sentence information corresponding to the dialogue sentences and second sentence information corresponding to the answer sentences;

and extracting target linguistic data corresponding to a preset linguistic data label from the first statement information and the second statement information.

2. The method of claim 1, wherein the extracting the target corpus corresponding to the preset corpus tag from the first sentence information and the second sentence information comprises:

3. The method of claim 2, after said determining dialog description information from the first statement information and the second statement information, further comprising:

determining an initial corpus tag according to the conversation description information;

4. The method of claim 2, after said determining dialog description information from the first statement information and the second statement information, further comprising:

determining a similarity value between the conversation description information and the label content of the preset corpus label;

5. The method according to claim 1, further comprising, after said extracting target corpora corresponding to preset corpus tags from the first sentence information and the second sentence information:

and storing the target corpus corresponding to the preset corpus tag to a big data processing platform, and configuring the preset corpus tag as an index item corresponding to the target corpus.

6. The method as claimed in claim 5, wherein the predetermined number of corpus tags is plural, further comprising:

receiving a corpus acquiring request, wherein the corpus acquiring request comprises: the method comprises the steps of (1) obtaining a corpus tag to be processed, a target time range and a target corpus quantity;

determining a preset corpus tag corresponding to the corpus tag to be processed;

obtaining a plurality of target linguistic data obtained by the corresponding preset linguistic data label index, and selecting part of the target linguistic data from the plurality of target linguistic data according to the target time range and the target linguistic data quantity;

and taking the part of the target corpus as the corpus acquired based on the corpus acquisition request.

7. The method of claim 6, further comprising:

providing a corpus acquiring interface, wherein the corpus acquiring interface comprises: at least one candidate corpus tag, a corpus acquisition time range and a corpus number;

receiving a corpus configuration request based on the corpus acquisition interface, taking a candidate corpus tag selected by the corpus configuration request as the to-be-processed corpus tag, and determining a target time range and a target corpus number of the corpus configuration request configuration;

and generating the corpus acquiring request according to the to-be-processed corpus tag, the target time range and the target corpus quantity.

8. The method according to any one of claims 1-7, wherein the preset corpus tag comprises any one or a combination of more than one of:

a conversation type tag, a conversation scene tag, a conversation keyword tag.

9. A corpus acquiring apparatus, comprising:

the first receiving module is used for receiving a conversation statement and determining a response statement corresponding to the conversation statement;

the first processing module is used for respectively processing the conversation statement and the response statement to obtain first statement information corresponding to the conversation statement and second statement information corresponding to the response statement;

and the extraction module is used for extracting target linguistic data corresponding to a preset linguistic data label from the first statement information and the second statement information.

10. The apparatus of claim 9, wherein the extraction module is specifically configured to:

11. The apparatus of claim 10, wherein the extraction module is further configured to:

12. The apparatus of claim 10, wherein the extraction module is further configured to:

13. The apparatus of claim 9, further comprising:

and the storage module is used for storing the target corpus corresponding to the preset corpus tag to a big data processing platform after the target corpus corresponding to the preset corpus tag is extracted from the first sentence information and the second sentence information, and configuring the preset corpus tag as an index item corresponding to the target corpus.

14. The apparatus of claim 13, wherein the predetermined number of corpus tags is plural, further comprising:

a second receiving module, configured to receive a corpus acquiring request, where the corpus acquiring request includes: the method comprises the steps of (1) obtaining a corpus tag to be processed, a target time range and a target corpus quantity;

the determining module is used for determining a preset corpus tag corresponding to the corpus tag to be processed;

an obtaining module, configured to obtain a plurality of target corpora obtained by the corresponding preset corpus tag indexes, and select a part of the target corpora from the plurality of target corpora according to the target time range and the target corpus number;

and the second processing module is used for taking the part of the target corpus as the corpus acquired based on the corpus acquisition request.

15. The apparatus of claim 14, further comprising:

a providing module, configured to provide a corpus obtaining interface, where the corpus obtaining interface includes: at least one candidate corpus tag, a corpus acquisition time range and a corpus number;

a third receiving module, configured to receive a corpus configuration request received based on the corpus acquisition interface, use a candidate corpus tag selected in the corpus configuration request as the to-be-processed corpus tag, and determine a target time range and a target corpus number of the corpus configuration request;

and the generating module is used for generating the corpus acquiring request according to the to-be-processed corpus tag, the target time range and the target corpus quantity.

16. The apparatus according to any one of claims 9-15, wherein the preset corpus tag comprises any one or a combination of:

a conversation type tag, a conversation scene tag, a conversation keyword tag.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.