CN110209804B

CN110209804B - Target corpus determining method and device, storage medium and electronic device

Info

Publication number: CN110209804B
Application number: CN201810361798.0A
Authority: CN
Inventors: 周辉阳
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2023-11-21
Anticipated expiration: 2038-04-20
Also published as: CN110209804A

Abstract

The invention discloses a method and a device for determining target corpus, a storage medium and an electronic device. Wherein the method comprises the following steps: acquiring query corpus received in a time period, wherein the query corpus comprises query information and access resource locator (URL) accessed in response to the query information; acquiring a first query corpus from the query corpus, wherein the first query corpus contains target keywords corresponding to a target field, and access URLs included in the first query corpus contain at least one of target URLs, and the target URLs are URLs corresponding to the target field; determining target corpus in the first query corpus, wherein the target corpus is corpus which cannot be read by an existing template in the target field. The method solves the technical problem that the corpus which is determined by the related technology and cannot be read by the existing template is low in accuracy.

Description

Target corpus determining method and device, storage medium and electronic device

Technical Field

The present invention relates to the field of computers, and in particular, to a method and apparatus for determining a target corpus, a storage medium, and an electronic apparatus.

Background

In the process of determining the corpus, the corpus which is not supported by the existing templates for reading the meaning of the corpus in the specific field is usually determined according to the keywords. Specifically, keywords specified by people and belonging to a specific field are used first, all the corpora belonging to the specific field are roughly recalled, and then the corpora truly belonging to the specific field are screened out from the roughly recalled corpora, so that the corpora which cannot be read by the templates in the specific field can be determined.

When the corpus in the real specific field is determined in the related technology, the corpus is determined only according to the expectation of rough recall of the keywords, that is, the factors considered in the determining process are single, so that the accuracy of the determined corpus which cannot be read by the existing template is low.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining target corpus, a storage medium and an electronic device, which are used for at least solving the technical problem that the accuracy of the corpus which is determined by the related technology and cannot be read by the existing template is low.

According to an aspect of the embodiment of the present invention, there is provided a method for determining a target corpus, including: acquiring query corpus received in a time period, wherein the query corpus comprises query information and access resource locator (URL) accessed in response to the query information; acquiring a first query corpus from the query corpus, wherein the first query corpus contains target keywords corresponding to a target field, and access URLs included in the first query corpus contain at least one of target URLs, and the target URLs are URLs corresponding to the target field; and determining target corpus in the first query corpus, wherein the target corpus is corpus which cannot be read by the existing templates in the target field.

According to another aspect of the embodiment of the present invention, there is also provided a device for determining a target corpus, including: a first obtaining unit, configured to obtain a query corpus received in a time period, where the query corpus includes query information and an access resource locator URL accessed in response to the query information; the second obtaining unit is used for obtaining a first query corpus from the query corpus, wherein the first query corpus contains target keywords corresponding to the target field, and access URLs included in the first query corpus contain at least one of target URLs, and the target URLs are URLs corresponding to the target field; the determining unit is used for determining target corpus from the first query corpus, wherein the target corpus is corpus which cannot be read by the existing templates in the target field.

According to a further aspect of embodiments of the present invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the above method when run.

According to still another aspect of the embodiments of the present invention, there is also provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above method by the computer program.

In the embodiment of the application, the first query corpus is obtained from the query corpus according to the target URL and the target keyword, wherein the first query corpus contains the target keyword corresponding to the target domain, the access URL included in the first query corpus contains at least one of the target URLs, the target URL is the URL corresponding to the target domain, and then the target corpus which cannot be read by the existing template in the target domain is determined in the first query corpus, so that the first query corpus (namely, the corpus truly belonging to the domain) for determining the target corpus which is read by the existing template in the target domain can be determined according to the target URL in combination with the target keyword, and the accuracy of the target corpus which cannot be read by the existing template determined in the first query corpus is improved, and the technical problem that the accuracy of the corpus which cannot be read by the existing template determined in the related technology is lower is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of an application environment of a method for determining a target corpus according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative method of determining a target corpus according to an embodiment of the invention;

FIG. 3 is a schematic illustration of an alternative speech sound recognition speech process application environment in accordance with an embodiment of the present invention;

FIG. 4 is a schematic illustration of an application environment of another alternative method for determining a target corpus according to an embodiment of the invention;

FIG. 5 is a schematic diagram of an alternative acquisition domain URL according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another alternative method of determining a target corpus according to an embodiment of the invention;

FIG. 7 is a schematic diagram of the acquisition of an alternative target keyword according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of an alternative target corpus determining apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural view of an alternative electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For the sake of easy understanding of the following examples, several meanings are given below.

Entity: refers to the basic unit representing a concept.

And (3) a template: is a generic sentence pattern with an expanded sample.

And (3) model: the semantic classifier uses deep learning to determine the classifier for predicting the intention of the corpus belonging to a certain field.

Trie tree: the word search tree or key tree is a tree structure, which is a variation of hash tree.

Aho-Corasick automata: the AC automaton is realized by realizing Knuth-Morris-Pratt algorithm, which is called KMP algorithm for short, on a Trie tree, and can complete the matching of multi-mode strings.

According to one aspect of the embodiment of the invention, a method for determining a target corpus is provided. Alternatively, the method for determining the target corpus may be applied to, but not limited to, an application environment as shown in fig. 1. As shown in fig. 1, the user A, B all wants to acquire yesterday's weather through the voice assistant on the terminal 102, the user a inputs "the weather of the past day" through voice, the user B inputs "yesterday's weather" through voice, and after the voice assistant recognizes the voice as text, the voice assistant only matches the template corresponding to the text of the user B, and the voice assistant reads the question of the user B according to the template and gives an answer. The user a's file, the voice assistant, cannot match the corresponding template, and cannot know the meaning of the user a, so that the corresponding answer cannot be given. After the voice assistant matches the corresponding template, the corresponding text is sent to the server 106, while the text one of user B is concurrently sent to the server 106. Then, the server 106 obtains a query corpus within a predetermined time, where the query corpus includes query information (e.g., "yesterday's weather", "last day's weather query") and a uniform resource locator (Uniform Resource locator, abbreviated as URL) that is accessed in response to the query information, and it should be noted that the query information includes, but is not limited to, the above examples, and for example, may also include those not shown in fig. 1: "weather of the past day", "weather of the yesterday", etc. The server 106 may obtain the first query corpus from the query corpus based on the target keywords (e.g., weather) and the target URL (e.g., XX weather office website) of the weather domain. Further, the server 106 determines a target corpus which cannot be read by the existing templates in the weather field from the first query corpus, and generates a template corresponding to the target corpus through the Aho-Corasick automaton, so that when the user inputs corresponding voice next time, the voice sound can recognize and give an answer

In the embodiment of the invention, the first query corpus is obtained from the query corpus according to the target URL and the target keyword, wherein the first query corpus contains the target keyword corresponding to the weather field, the access URL included in the first query corpus contains at least one of the target URLs, the target URL is the URL corresponding to the weather field, and then the target corpus which cannot be read by the template existing in the weather field is determined in the first query corpus, so that the first query corpus (namely, the corpus truly belonging to the weather field) for determining the target corpus which is read by the template existing in the weather field can be determined according to the target URL in combination with the target keyword, and the manual determination is not performed only according to the keyword, thereby improving the accuracy of determining the target corpus which is read by the template existing in the weather field in the first query corpus, and further solving the technical problem that the accuracy of determining the corpus which cannot be read by the template existing in the weather field provided by the related technology is lower.

Alternatively, in this embodiment, the above terminal may include, but is not limited to, at least one of: a mobile phone, a tablet computer, etc. The network may include, but is not limited to, a wireless network, wherein the wireless network includes: bluetooth, WIFI, and other networks that enable wireless communications. The server may include, but is not limited to, at least one of: PCs and other devices for computing services. The above is merely an example, and the present embodiment is not limited thereto.

Optionally, in this embodiment, as an optional implementation manner, as shown in fig. 2, the method for determining the target corpus may include:

s202, acquiring query corpus received in a time period, wherein the query corpus comprises query information and access resource locator (URL) accessed in response to the query information;

s204, acquiring a first query corpus from the query corpus, wherein the first query corpus contains target keywords corresponding to a target field, and access URLs contained in the first query corpus contain at least one of target URLs, and the target URLs are URLs corresponding to the target field;

s206, determining target corpus in the first query corpus, wherein the target corpus is corpus which cannot be read by an existing template in the target field.

It should be noted that the above embodiment may be applied to a voice assistant or a voice sound. When the method is applied to the voice sound, the voice sound receives voice input of a user (such as playing the next song), a corresponding text is identified through a voice recognition technology, the text is matched with a corresponding template according to the text, if the text is successfully matched, the voice sound searches a locally stored answer and plays the next song to the user (shown in fig. 3); if the matching is unsuccessful, the voice sound sends the text to the server. The server obtains a query corpus within a predetermined time, where the query corpus includes query information (e.g., "play next song") and URLs that are accessed in response to the query information, and it should be noted that the query information includes, but is not limited to, the above examples, for example, may further include: "play last song", "pause play", etc. Then, determining a target corpus which cannot be read by the existing templates in the field from the first query corpus, and generating a template corresponding to the target corpus through an Aho-Corasick automaton, so that when the user inputs the voice next time, the voice sound can identify and play the next song (shown in fig. 4).

In the embodiment of the invention, the first query corpus is obtained from the query corpus according to the target URL and the target keyword, wherein the first query corpus contains the target keyword corresponding to the target domain, the access URL included in the first query corpus contains at least one of the target URLs, the target URL is the URL corresponding to the target domain, and then the target corpus which cannot be read by the existing template in the target domain is determined in the first query corpus, so that the first query corpus (namely, the corpus truly belonging to the domain) for determining the target corpus which is read by the existing template in the target domain can be determined according to the target URL in combination with the target keyword, and the accuracy of determining the target corpus which is read by the existing template in the target domain in the first query corpus is improved, and the technical problem that the accuracy of determining the target corpus which cannot be read by the existing template in the target domain is lower is solved.

It should be noted that, the first query corpus may be obtained by, but not limited to, the following ways: and inquiring access URL containing server names in target URLs or protocol IP addresses interconnected between networks in the inquiry corpus, and then acquiring a first inquiry corpus according to the inquired access URL and target keywords.

It should be noted that, the target URL may be obtained by, but not limited to, the following ways: and determining a preset corpus which is read for times greater than a first preset threshold value but still cannot be read by the existing template, and then acquiring a target URL corresponding to the field in which the preset corpus is located.

It should be noted that, the target keywords may be obtained by, but not limited to, the following ways: inquiring an access URL including a server name or an IP address in a target URL in the inquiry corpus, acquiring a second inquiry corpus corresponding to the access URL, segmenting query information included in the second inquiry corpus, counting segmentation results, acquiring words with occurrence times larger than a second preset threshold value, and taking the words as target keywords.

Optionally, the above determined target keywords may be further checked, and the matching degree between the target keywords and the target domain may be precisely determined, and the checking manner may be, but is not limited to, the following manners: and acquiring a phrase which is displayed in the search engine and comprises the target keyword after the target keyword is input, deleting the target keyword in the phrase, determining whether the rest words still belong to the target field, and if the result is that the words belong to the target field, determining that the matching degree of the target keyword and the target field is good.

It should be noted that, the target corpus that cannot be read by the existing templates in the target domain may be obtained by, but not limited to, the following ways: and acquiring the attribute of the target field, such as the air ticket field, wherein the attribute can comprise a boarding gate, boarding time and the like, determining whether the current corpus in the first query corpus comprises the attribute, and if the current corpus does not comprise the attribute, determining that the current corpus is the target corpus which cannot be read by the existing template.

It should be noted that, after determining the target corpus that cannot be read by the templates existing in the target domain, the Aho-coralick automaton may also generate a target template for reading the target corpus, so that the target template may be used to read the subsequently received query corpus, for example, read the target corpus that cannot be read by the templates existing in the target domain.

As an alternative embodiment, obtaining the first query corpus from the query corpus includes:

s1, inquiring a first access URL in an inquiry corpus, wherein the first access URL comprises a server name or an IP address in a target URL;

s2, acquiring a second query corpus from the query corpus, wherein the access URLs accessed in response to the query information in the second query corpus comprise first access URLs;

S3, acquiring a first query corpus from the second query corpus, wherein the first query corpus contains target keywords corresponding to the target field.

For example, the target URL may be a web site commonly found in the target domain. Assuming that the target domain is a financial exchange rate domain, the target URL may include, but is not limited to, the following websites: "www.forex.hexun.com..," www.boc.cn/sourcedeb., "and the like. Then, the useless suffix of the front 'www' and the rear of the found website can be removed, and the left part is used as a reference of the query access URL. For example, the reference for a query access URL in the exchange rate domain may be selected from, but is not limited to, the following: "forex. Hexun. Com", "boc. Cn/sourcedeb", "usd-cny. Com", "zhijinwang. Com/huilv", "cngold. Org/fx/huanscan", as long as the access URL including the above can be the above first access URL. And further, the second query corpus belonging to the exchange rate field can be roughly recalled in the query corpus according to the first access URL.

According to the embodiment of the invention, the access URL comprising the server name or the IP address is obtained according to the server name or the IP address in the target URL, and the corpus corresponding to the access URL is obtained in the query corpus, so that the more comprehensive corpus in the field corresponding to the target URL can be obtained.

As an alternative embodiment, before querying the first visited URL in the query corpus, the method further includes:

s1, determining a target field to which a received predetermined corpus belongs, wherein the number of times the predetermined corpus is requested to be read is greater than a first predetermined threshold value and cannot be read by an existing template;

s2, obtaining a target URL corresponding to the target field.

For example, the first predetermined threshold may be set to 2, where the user A, B, C has entered "rate conversion" into the voice assistant, but after receiving three "rate conversions", the voice assistant may split the phrase after receiving the phrase, determine that "rate conversion" belongs to the rate domain (i.e., the target domain described above) according to the "rate" of the phrase, then, as shown in fig. 5, may enter "rate" into the search engine, and then find a common web site for viewing the rate (i.e., the target URL described above) according to the search result, if possible, as complete as possible, for example, "www.forex.hexun.com..," www.boc.cn/sourcedeb., and so on.

According to the embodiment of the invention, the corpus which is requested to be read for times larger than the first preset threshold and cannot be read by the existing template is adopted to determine the target field, instead of determining the target field once received, the reason for determining the failure of reading can be more accurately determined because the corpus cannot be read by the existing template, but not other reasons.

As an alternative embodiment, before obtaining the first query corpus from the query corpus, the method further includes:

s1, acquiring a second query corpus from the query corpus, wherein an access URL accessed in response to query information in the second query corpus comprises a server name or an IP address in a target URL;

s2, segmenting query information included in the second query corpus to obtain target words;

s3, acquiring target keywords from the target words, wherein the occurrence times of the target keywords in the second query corpus are larger than a second preset threshold value.

For example, as shown in fig. 5, the exchange rate field is also described as an example. After the second query corpus is recalled roughly, word segmentation is performed on query information in the second query corpus, word frequency statistics is performed on all word segmentation results, namely the occurrence times of the word segmentation results in the second query corpus are counted, 100 words with the highest occurrence times can be selected as keywords, for example, the occurrence times of the "exchange rate" are higher than 100 times, and the "exchange rate" can be used as target keywords. It should be noted that, the word segmentation tool may use a pre-developed c++ word segmentation tool, where the tool has calling interfaces (such as python, java) of other languages, and the word segmentation is accurate, and related part-of-speech labels; a general nub word segmentation (jieba) tool may also be used.

In the related art, keywords are manually selected, because the keywords are limited by manual knowledge and capability, people cannot comprehensively master the keywords in the field, and therefore, the manually selected keywords are not comprehensive, so that a lot of useful corpora are omitted, and the process needs to be manually participated, so that a lot of manpower and time are required. According to the embodiment of the invention, the URL is firstly used for roughly searching the second query corpus, then the words with the occurrence times larger than the second preset threshold value are counted in the second query corpus, and the words are used as the target keywords, so that the target keywords are obtained by combining the domain-specific URL with word frequency statistics, the determined target keywords are more comprehensive, manual participation is not needed in the process, and a large amount of manpower and time are saved.

As an alternative embodiment, obtaining the target keyword in the target word includes:

s1, acquiring a first keyword from target words, wherein the occurrence frequency of the first keyword in a second query corpus is larger than a second preset threshold;

s2, acquiring hot phrases corresponding to the first keywords, wherein the hot phrases comprise phrases which are displayed after the first keywords are input in a search engine and comprise the keywords;

S3, acquiring a target keyword from the first keyword, wherein a word obtained after deleting the target keyword from a hot phrase corresponding to the target keyword belongs to a target field, the target field is a target field to which a predetermined corpus belongs, and the number of times that the predetermined corpus is requested to be read is greater than a first predetermined threshold and cannot be read by an existing template.

For example, in order to determine whether the keyword actually belongs to the target domain, the determined keyword may be checked, and the checking process will be described by taking the keyword "exchange rate" with the highest occurrence number as an example. It should be noted that, in this embodiment, an interface of a search engine is used to determine the domain of the keyword, and the related parameters not belonging to the target domain are filtered. The interface is a website of http:// m.baidu.com/from=8625 & ie=utf-8 &; action=opensearch & wd=query term), when a query term is entered at the interface, a hot phrase related to the query term may be returned, for example, by entering: exchange rate, may be returned from the interface: "exchange Rate conversion", "exchange Rate dollars", "dollar exchange Rate", "exchange Rate query", "harbor money exchange Rate", "Euro exchange Rate", "Japanese exchange Rate", "British pound exchange Rate", "American gold exchange Rate", "Talc exchange Rate", and the like. After the phrase is obtained, the input exchange rate is removed, and the conversion, dollar, inquiry, harbor coin, euro, japanese, english, american gold and Tai are remained, at this time, the rest words including dollar, euro, japanese and other currency word eyes are found, and the word for explaining the exchange rate is really the target key word of the exchange rate field in financial and financial. And similarly, carrying out rule verification on all 100 keywords, and leaving the domain keywords conforming to the rules as target keywords.

According to the embodiment of the invention, under the condition that the words obtained after the keywords are deleted from the hot phrases corresponding to the keywords are determined to belong to the target field, the keywords can be used as the target keywords, so that the determined target keywords more accurately belong to the target field.

As an alternative embodiment, determining the target corpus in the first query corpus includes:

s1, determining whether a current corpus in a first query corpus comprises information belonging to target attributes, wherein the target attributes are configured in target fields, the target fields are target fields to which predetermined corpus belongs, and the number of times the predetermined corpus is requested to be read is greater than a first predetermined threshold value and cannot be read by an existing template;

s2, under the condition that the current corpus does not include information belonging to the target attribute, determining that the current corpus is a target corpus which cannot be read by the existing template.

For example, the exchange rate field may include, but is not limited to, the following target attributes: dollar rate, euro rate, japanese rate, pound rate, taylor rate. If the corpus includes the information of the target attribute, determining that the corpus can be read by an existing template, for example, what is the dollar exchange rate in 2018 can be considered to include the information of the target attribute, that is, the existing template can read the corpus, and the client can know the meaning of the corpus; if the corpus does not include the information of the target attribute, it is determined that the corpus cannot be read by the used template, for example, "what is the conversion of the exchange rate in 2018" may be considered as not including the information of the target attribute, that is, the existing template cannot read the corpus, and the client may not be aware of the meaning of the corpus.

According to the embodiment of the invention, as the real corpus (and the first query corpus) may be too much, whether the current corpus is the target corpus which cannot be read by the existing template is determined by determining whether the current corpus comprises the information mode belonging to the target attribute, so that whether the real corpus belongs to the attribute under the target field can be conveniently and rapidly determined based on the existing model.

As an optional implementation manner, after determining, in the first query corpus, a target corpus that cannot be read by an existing template in the target domain, the method further includes: generating a target template for reading target corpus, wherein the target template is used for reading received query corpus after a target time point, and the target time point is later than a time period, and the received query corpus comprises the target corpus.

For example, after the foregoing "exchange rate conversion" is determined as the target corpus, a corresponding template is generated, so that when the "exchange rate conversion" is read later, the meaning of the corpus corresponding to the phrase can be correctly read, and an answer is given, for example, an exchange rate conversion value between countries is given.

According to the embodiment of the invention, after the target corpus which cannot be read by the existing template in the target field is generated, the target template for reading the target corpus can be successfully read when the target corpus is subsequently read, and the condition that the target corpus cannot be read by the existing template is avoided.

As an alternative embodiment, generating the target template for reading the target corpus includes: inputting the target corpus into an Aho-Corasick automaton to generate a target template for reading the target corpus.

In the related art, after selecting the corpus which is not supported by the model, products and technical responsible persons summarize the languages in a summary way, and further, a plurality of general templates are obtained, therefore, in the related art, the template generation is not intelligent, if the number of the corpora which are not supported is large, the problem of template generation of the corpus is difficult to be handled manually, and a plurality of useful templates can be missed. In the embodiment of the invention, the target corpus is input into the Aho-Corasick automaton to generate the target template for reading the target corpus, so that the manual participation can be avoided, the template generation is more intelligent, and a large amount of manpower and material resources are saved.

The embodiment of the invention utilizes the mode of combining domain exclusive URl with word frequency statistics to select the keywords of the domain, and then performs verification of the keywords to determine the keywords belonging to a certain domain. The query (first query corpus) belonging to a certain field can be accurately found by reversely combining the keywords with the domain-specific URL, and the queries can be used for deep work such as later template mining. Finally, the unsupported corpus and related templates in a certain field can be mined.

In order to facilitate understanding of the foregoing embodiments, in this embodiment, the system introduces a method for mining a domain unsupported template based on keyword mining, where a flowchart of the method is shown in fig. 6, and the whole process includes the following parts: selecting a domain URL (namely a target URL), mining domain keywords (namely target keywords), selecting a domain real corpus, mining a template and manually checking.

Step S601, obtaining a mass query (equivalent to the query corpus);

step S602, obtaining a domain URL;

this step is the starting point of the whole process and is critical to the whole process. For example, for the financial exchange rate field, a common official network for viewing the exchange rate is first found, as shown in fig. 5, an "exchange rate" may be input in a search engine, and then a common website for viewing the exchange rate (i.e. the target URL) is found according to the search result, if possible, so that the website is as complete as possible. Such as the target URL may include, but is not limited to, the following websites: "www.forex.hexun.com..," www.boc.cn/sourcedeb., "and the like. The "www" and useless suffix in front of and behind the found web address can then be removed, and the remaining part is then used as a reference for the query access URL. For example, the reference for a query access URL in the exchange rate domain may be selected from, but is not limited to, the following: "forex. Hexun. Com", "boc. Cn/sourcedeb", "usd-cny. Com", "zhijinwang. Com/huilv", "cngold. Org/fx/huanscan".

Step S603, mining domain keywords (namely target keywords);

step S604, obtaining a domain query, which may also be referred to as domain real corpus selection (i.e., the first query corpus);

in this step, we combine two constraint conditions of domain URL and domain keyword to screen corpus, and the corpus satisfying these two conditions is considered as the real corpus belonging to the domain. That is, the real corpus needs to contain both the above target keywords and the access URL needs to contain the content of the target URL, such as: http:// zhijinwang.com/huilv/? from=usd & to=cny & num=100 includes "zhijinwang.

Step 605, the domain does not support corpus;

in the last step, the real corpus of the field is selected, and in the step, an online real model is needed to be carried out on the real corpus, for example, the real model can be used for testing, the online service can support the corpora at present, the corpora can not be supported, and the corpora which can not be supported (the corpora which can not be read by the existing template) are independently output. It should be noted that the model may be a semantic classifier, where the semantic classifier uses deep learning to determine a classifier that predicts that a corpus belongs to a certain field and is intended.

Step 606, template mining;

inputting the unsupported corpus into an Aho-Corasick automaton to generate a template for reading the corpus. It should be noted that, the AC automaton actually realizes Knuth-Morris-Pratt algoritm, abbreviated as KMP algorithm, on the Trie tree, and can complete the matching of the multi-mode strings; the Trie is also called word search tree or key tree, is a tree structure, and is a variation of hash tree.

Step 607, manually checking a check template and adding the related fields;

for newly generated unsupported templates in the field, manual verification is needed, and after the manual verification, new unsupported corpora and templates are added into the field.

Step 608, training a new model;

training a new deep learning model according to the added unsupported corpus and templates, and improving the semantic recognition capability of the whole field.

It should be noted that, the target keywords may be obtained in a manner as shown in fig. 7, and in the embodiment of the present invention, the obtaining of the target keywords may include the following aspects:

step S701, obtaining a massive query (which is equivalent to the query corpus), wherein the massive query obtained in step S601 can be adopted, and the massive query can be obtained again;

Step S702, selecting a domain URL;

the domain URL may be the URL determined in step 602, and the domain URL may be retrieved as described above.

Step S703, determining a domain query (corresponding to the first query expectation) according to the domain URL;

step S704, word segmentation and word frequency statistics are carried out;

according to the corpus of the rough target field in the massive query (namely the received query corpus in the time period) of the URL, a word segmentation tool is used for word segmentation on the rough corpus (namely the second query corpus), word frequency statistics is carried out on all word segmentation results, and 100 words with highest occurrence frequency are selected. (the word segmentation tool can use a c++ word segmentation tool, and has calling interfaces of other languages (such as python, java), so that word segmentation is accurate, and related part-of-speech labels are also used

Step S705, rule checking;

the interface of a search engine is adopted to judge whether the word with the highest occurrence number belongs to the target field or not, and the words which do not belong to the target field are filtered. It should be noted that, the interface is a website:

http:// m.baidu.com/from=8625 & ie=utf-8 &; action=opensearch & wd=query term

After a query term is input in the interface, a hot phrase related to the query term is returned, for example, input: exchange rate, the following hot words may be returned from the interface: the method comprises the steps of [ "exchange rate conversion", "exchange rate dollars", "dollar exchange rate", "exchange rate inquiry", "harbor currency exchange rate", "euro exchange rate", "japanese yen exchange rate", "pound exchange rate", "U.S. exchange rate" ], then, after the input "exchange rate" is removed, leaving [ "conversion", "dollars", "inquiry", "harbor currency", "euro", "japanese yen", "pound", "U.S. exchange rate" ], if the fact that the currency eyes such as dollars, euro, japanese are found, the word for explaining the exchange rate is really a keyword in the financial exchange rate field), and carrying out rule verification on all 100 keywords, and leaving the domain keywords conforming to the rule as target keywords.

Step S706, determining domain keywords (i.e., target keywords);

it should be noted that, the embodiment of the invention adopts the template method for mining the non-support corpus in the field based on the keywords and the field URL. Firstly, the keywords of the embodiment of the invention are not manually formulated, and the mining of the keywords is realized by an algorithm (target URL combined with word frequency statistics); secondly, selecting the corpus belonging to the field by using an algorithm of URl filtering and keyword filtering, accurately finding the corpus question method belonging to the field, and automatically generating a template through an AC automaton after model selection and filtering. Only in the last step, the manual is needed, namely, the manual selects the templates from the generated templates to be added into the field, and then a new deep learning model is trained. The whole process is fully automated, and the last step needs to manually check to ensure the on-line model effect so as to improve the efficiency.

It should be noted that the invention is applied to the unsupported template mining in various fields of the voice intelligent assistant. When a certain field is newly created, a large amount of corpus of the field can be collected, but because of the ever-changing question of users, the limited sentences cannot contain all the question methods. Therefore, the template is very important for the function of a field, and a good template can cover all the variable questions of the field, but a limited template cannot cover all the variable questions of a user, or a plurality of questions which belong to the field but cannot be identified by the template and the model are very important for improving the model and the template capability, so that the corpus which belongs to the field but cannot be successfully identified is valued and collected for background training related templates and models.

Through the embodiment of the invention, the following beneficial effects can be obtained: 1. the unsupported template of the new field is excavated, which is very important for the construction of the new field; 2. the support of uncovered semantics of the old field and the template mining are improved; 3. the mining of unsupported corpus and templates has very remarkable effect improvement on the classification of the whole online model; 4. mining of domain keywords is also very important for keyword mining of other domains.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present invention.

According to another aspect of the embodiment of the present invention, there is further provided a target corpus determining apparatus for implementing the method for determining a target corpus, as shown in fig. 8, the apparatus includes:

(1) A first obtaining unit 802, configured to obtain a query corpus received in a time period, where the query corpus includes query information and an access resource locator URL accessed in response to the query information;

(2) A second obtaining unit 804, configured to obtain a first query corpus from the query corpus, where the first query corpus includes target keywords corresponding to a target domain, and an access URL included in the first query corpus includes at least one of target URLs, and the target URL is a URL corresponding to the target domain;

(3) The determining unit 806 is configured to determine a target corpus from the first query corpus, where the target corpus is a corpus that cannot be read by an existing template in the target domain.

It should be noted that the above embodiment may be applied to a voice assistant or a voice sound. When the method is applied to the voice sound, the voice sound receives voice input (such as playing the next song) of a user, a corresponding text is identified through a voice recognition technology, the text is matched with a corresponding template, if the text is successfully matched, the voice sound searches a locally stored answer, and the next song is played to the user; if the matching is unsuccessful, the voice sound sends the text to the server. The server obtains a query corpus within a predetermined time, where the query corpus includes query information (e.g., "play next song") and URLs that are accessed in response to the query information, and it should be noted that the query information includes, but is not limited to, the above examples, for example, may further include: "play last song", "pause play", etc. Then, a target corpus which cannot be read by the existing templates in the field is determined from the first query corpus, and a template corresponding to the target corpus is generated through an Aho-Corasick automaton, so that when a user inputs the voice next time, the voice sound can identify and play the next song.

As an alternative embodiment, the second obtaining unit 804 includes:

(1) The query module is used for querying a first access URL in the query corpus, wherein the first access URL comprises a server name or an IP address in a target URL;

(2) The first acquisition module is used for acquiring a second query corpus from the query corpus, wherein the access URL accessed in response to the query information in the second query corpus comprises a first access URL;

(3) The second obtaining module is used for obtaining a first query corpus from a second query corpus, wherein the first query corpus contains target keywords corresponding to the target field.

As an alternative embodiment, the second obtaining unit 804 includes:

(1) The first determining module is used for determining the target field to which the received predetermined corpus belongs, wherein the number of times that the predetermined corpus is requested to be read is larger than a first predetermined threshold value and cannot be read by an existing template;

(2) And the third acquisition module is used for acquiring the target URL corresponding to the target field.

As an alternative embodiment, the apparatus further comprises:

(1) A third obtaining unit, configured to obtain a second query corpus from the query corpus, where an access URL that is accessed in response to query information in the second query corpus includes a server name or an IP address in a target URL;

(2) The word segmentation unit is used for segmenting query information included in the second query corpus to obtain target words;

(3) And a fourth obtaining unit, configured to obtain a target keyword from the target word, where the number of occurrences of the target keyword in the second query corpus is greater than a second predetermined threshold.

For example, as shown in fig. 4, the exchange rate field will be described as an example. After the second query corpus is recalled roughly, word segmentation is performed on query information in the second query corpus, word frequency statistics is performed on all word segmentation results, namely the occurrence times of the word segmentation results in the second query corpus are counted, 100 words with the highest occurrence times can be selected as keywords, for example, the occurrence times of the "exchange rate" are higher than 100 times, and the "exchange rate" can be used as target keywords. It should be noted that, the word segmentation tool may use a pre-developed c++ word segmentation tool, where the tool has calling interfaces (such as python, java) of other languages, and the word segmentation is accurate, and related part-of-speech labels; a general nub word segmentation (jieba) tool may also be used.

As an alternative embodiment, the fourth acquisition unit includes:

(1) A fourth obtaining module, configured to obtain a first keyword from the target term, where the number of occurrences of the first keyword in the second query corpus is greater than a second predetermined threshold;

(2) A fifth obtaining module, configured to obtain a hot phrase corresponding to the first keyword, where the hot phrase includes a phrase of the keyword displayed after the first keyword is input in the search engine;

(3) A sixth obtaining module, configured to obtain a target keyword from the first keyword, where a word obtained after deleting the target keyword in a popular phrase corresponding to the target keyword belongs to a target domain, where the target domain is a target domain to which a predetermined corpus belongs, and the number of times that the predetermined corpus is requested to be read is greater than a first predetermined threshold and cannot be read by an existing template.

As an alternative embodiment, the determining unit 606 includes:

(1) The second determining module is used for determining whether the current corpus in the first query corpus comprises information belonging to target attributes, wherein the target attributes are configured in target fields, the target fields are target fields to which the predetermined corpus belongs, and the number of times that the predetermined corpus is requested to be read is larger than a first predetermined threshold value and cannot be read by an existing template;

(2) And the third determining module is used for determining that the current corpus is the target corpus which cannot be read by the existing template under the condition that the current corpus does not comprise the information belonging to the target attribute.

As an alternative embodiment, the apparatus further comprises:

the generating unit is used for generating a target template for reading target corpus, wherein the target template is used for reading received query corpus after a target time point, and the target time point is later than a time period, and the received query corpus comprises the target corpus.

As an alternative embodiment, the generating unit includes:

the generating module is used for inputting the target corpus into the Aho-Corasick automaton and generating a target template for reading the target corpus.

According to a further aspect of embodiments of the present invention there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

s1, acquiring query corpus received in a time period, wherein the query corpus comprises query information and access resource locator (URL) accessed in response to the query information;

s2, acquiring a first query corpus from the query corpus, wherein the first query corpus contains target keywords corresponding to a target field, and access URLs contained in the first query corpus contain at least one of target URLs, and the target URLs are URLs corresponding to the target field;

s3, determining target corpus in the first query corpus, wherein the target corpus is corpus which cannot be read by an existing template in the target field.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: obtaining the first query corpus from the query corpus comprises:

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: before querying the first access URL in the query corpus, further comprising:

s2, obtaining a target URL corresponding to the target field.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: before the first query corpus is obtained from the query corpus, the method further comprises:

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: the obtaining the target keyword in the target word comprises the following steps:

s2, acquiring hot phrases corresponding to the first keywords, wherein the hot phrases comprise phrases which are displayed after the first keywords are input in a search engine and comprise keywords;

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: determining the target corpus in the first query corpus comprises:

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: after determining the target corpus which cannot be read by the existing templates in the target field in the first query corpus, the method further comprises the following steps:

s2, generating a target template for reading target corpus, wherein the target template is used for reading received query corpus after a target time point, and the target time point is later than a time period, and the received query corpus comprises the target corpus.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of: generating a target template for reading a target corpus includes:

S1, inputting the target corpus into an Aho-Corasick automaton to generate a target template for reading the target corpus.

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

According to still another aspect of the embodiment of the present invention, there is further provided an electronic device for implementing the method for determining a target corpus, as shown in fig. 9, where the electronic device includes: the processor 902, the memory 904, and optionally, the apparatus further comprises: a display 906, a user interface 908, a transmission device 910, a sensor 912, and the like. The memory has stored therein a computer program, the processor being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: obtaining the first query corpus from the query corpus comprises:

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: before querying the first access URL in the query corpus, further comprising:

s2, obtaining a target URL corresponding to the target field.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: before the first query corpus is obtained from the query corpus, the method further comprises:

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: the obtaining the target keyword in the target word comprises the following steps:

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: determining the target corpus in the first query corpus comprises:

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: after determining the target corpus which cannot be read by the existing templates in the target field in the first query corpus, the method further comprises the following steps:

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program: generating a target template for reading a target corpus includes:

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 9 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 9 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

The memory 904 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for determining a target corpus in the embodiment of the present invention, and the processor 902 executes the software programs and modules stored in the memory 904 to perform various functional applications and data processing, that is, implement the method for determining a target corpus. The memory 904 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 904 may further include memory located remotely from the processor 902, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 910 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 910 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 910 is a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The method for determining the target corpus is characterized by comprising the following steps of:

acquiring query information received within a time period and an access resource locator URL accessed in response to the query information;

Generating a query corpus based on the query information and an access resource locator URL accessed in response to the query information;

querying a first access URL in the query corpus, wherein the first access URL comprises server names in target URLs or protocol IP addresses interconnected between networks, the target URLs are URLs of target fields, the target fields are target fields to which predetermined corpus belongs, the number of times of reading the predetermined corpus is greater than a first predetermined threshold value, and the predetermined corpus cannot be read by templates existing in the target fields;

acquiring a second query corpus from the query corpus, wherein access URLs accessed in response to query information in the second query corpus comprise the first access URL;

word segmentation is carried out on query information included in the second query corpus, and target words are obtained;

acquiring a first keyword from the target word, wherein the occurrence frequency of the first keyword in the second query corpus is larger than a second preset threshold;

acquiring a hot phrase corresponding to the first keyword, wherein the hot phrase comprises a phrase of the keyword displayed after the first keyword is input in a search engine;

Acquiring a target keyword from the first keyword, wherein a word obtained after deleting the target keyword from a hot word group corresponding to the target keyword belongs to the target field;

determining that the corpus including the target keywords in the second query corpus is a first query corpus, wherein the visit URL included in the first query corpus contains at least one of the target URLs;

determining target corpus which cannot be read by the existing templates in the target field from the first query corpus;

and generating a target template for reading the target corpus.

2. The method of claim 1, wherein prior to querying the first visited URL in the query corpus, further comprising:

determining the target field to which the received predetermined corpus belongs;

and acquiring the target URL corresponding to the target field.

3. The method of claim 1, wherein the determining, in the first query corpus, a target corpus that cannot be read by an existing template in the target domain comprises:

determining whether the current corpus in the first query corpus comprises information belonging to target attributes, wherein the target attributes are configured in the target field;

And under the condition that the current corpus does not comprise the information belonging to the target attribute, determining that the current corpus is a target corpus which cannot be read by the existing template.

4. The method of claim 1, wherein the target template is used to read the received query corpus after a target point in time, the target point in time being later than the time period, the received query corpus comprising the target corpus.

5. The method of claim 1, wherein the generating a target template for reading the target corpus comprises:

inputting the target corpus into an Aho-Corasick automaton, and generating a target template for reading the target corpus.

6. The device for determining the target corpus is characterized by comprising the following components:

a first acquisition unit configured to acquire query information received in a time period and an access resource locator URL accessed in response to the query information; generating a query corpus based on the query information and an access resource locator URL accessed in response to the query information;

the second obtaining unit is used for inquiring a first access URL in the inquiring corpus, wherein the first access URL comprises server names in target URLs or protocol IP addresses interconnected between networks, the target URLs are URLs of target fields, the target fields are target fields to which predetermined corpus belongs, the number of times of reading the predetermined corpus is greater than a first predetermined threshold value, and the predetermined corpus cannot be read by templates existing in the target fields; acquiring a second query corpus from the query corpus, wherein access URLs accessed in response to query information in the second query corpus comprise the first access URL;

The word segmentation unit is used for segmenting the query information included in the second query corpus to obtain target words;

a fourth obtaining unit, configured to obtain a first keyword from the target term, where the number of occurrences of the first keyword in the second query corpus is greater than a second predetermined threshold; acquiring a hot phrase corresponding to the first keyword, wherein the hot phrase comprises a phrase of the keyword displayed after the first keyword is input in a search engine; acquiring a target keyword from the first keyword, wherein a word obtained after deleting the target keyword from a hot word group corresponding to the target keyword belongs to the target field;

the second obtaining unit is further configured to determine that a corpus including the target keyword in the second query corpus is a first query corpus, where an access URL included in the first query corpus includes at least one of the target URLs;

the determining unit is used for determining target corpus which cannot be read by the template existing in the target field from the first query corpus;

and the generating unit is used for generating a target template for reading the target corpus.

7. The apparatus of claim 6, wherein the second acquisition unit comprises:

the first determining module is used for determining the target field to which the received predetermined corpus belongs;

and the third acquisition module is used for acquiring the target URL corresponding to the target field.

8. The apparatus of claim 6, wherein the target template is configured to read the received query corpus after a target point in time, the target point in time being later than the time period, the received query corpus comprising the target corpus.

9. The apparatus of claim 6, wherein the generating unit comprises:

the generation module is used for inputting the target corpus into an Aho-Corasick automaton and generating a target template for reading the target corpus.

10. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when run.

11. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 5 by means of the computer program.