CN109033142B - Data processing method and device and server - Google Patents

Data processing method and device and server Download PDF

Info

Publication number
CN109033142B
CN109033142B CN201810593240.5A CN201810593240A CN109033142B CN 109033142 B CN109033142 B CN 109033142B CN 201810593240 A CN201810593240 A CN 201810593240A CN 109033142 B CN109033142 B CN 109033142B
Authority
CN
China
Prior art keywords
data
term set
query
query data
associated term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810593240.5A
Other languages
Chinese (zh)
Other versions
CN109033142A (en
Inventor
程晓虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810593240.5A priority Critical patent/CN109033142B/en
Publication of CN109033142A publication Critical patent/CN109033142A/en
Application granted granted Critical
Publication of CN109033142B publication Critical patent/CN109033142B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method, a data processing device and a server, wherein the method comprises the following steps: acquiring query data; matching an associated term set contained in the query data from an associated term set of index data, wherein the associated term set comprises a plurality of associated terms appearing in association; determining similarity between an associated term set contained in the query data and the query data; sequencing the service data corresponding to the index data of the associated term set according to the similarity; and taking the sequenced service data as a query result of the query data. By utilizing the technical scheme provided by the invention, the accuracy of the matched query result can be improved.

Description

Data processing method and device and server
Technical Field
The present invention relates to the field of internet communication technologies, and in particular, to a data processing method, an apparatus, and a server.
Background
With the development of the internet and artificial intelligence, an intelligent service system is widely applied to daily life of people. In the application of the intelligent service system, the requirements of consulting problems or acquiring knowledge and the like are often met. In order to meet the requirement, a service system often provides a database including common data related to the service and corresponding index data, so that a user can find out required data based on searching the index data.
At present, a business system often needs to match out a proper query result according to the similarity between query data and index data. Specifically, a similarity model can be trained by using a large amount of corpora in the business system, and the similarity between words can be measured based on semantic association between words in the corpora in the similarity model training process. Specifically, the characteristics of a word having associated semantics with the word are characterized, and the similarity between the words is calculated by using the distance between the characteristics of the word. The similarity measurement mode based on semantic association is often suitable for similarity calculation in the general field and is not suitable for specific vertical fields, such as 'how to handle credit cards' and 'how to handle debit cards', which are just the same but different from the expressed problem. In the existing measurement mode based on the similarity of semantic association between words, because the words are similar to the semantic association between the debit card and the credit card, the extracted features of the debit card and the credit card have high similarity, so that the difference between the debit card and the credit card is difficult to distinguish, and the difficulty in matching out a proper query result is greatly increased. Therefore, there is a need to provide a more reliable or efficient solution.
Disclosure of Invention
The invention provides a data processing method, a data processing device and a server, which can improve the accuracy of matched query results.
In a first aspect, the present invention provides a data processing method, including:
acquiring query data;
matching an associated term set contained in the query data from an associated term set of index data, wherein the associated term set comprises a plurality of associated terms appearing in association;
determining similarity between an associated term set contained in the query data and the query data;
sequencing the service data corresponding to the index data of the associated term set according to the similarity;
and taking the sequenced service data as a query result of the query data.
A second aspect provides a data processing apparatus, the apparatus comprising:
the query data acquisition module is used for acquiring query data;
the system comprises an association term set matching module, a query data matching module and a query data matching module, wherein the association term set matching module is used for matching an association term set contained in the query data from an association term set of the index data, and the association term set comprises a plurality of association terms appearing in association;
a similarity determining module, configured to determine a similarity between an associated term set included in the query data and the query data;
the sorting module is used for sorting the service data corresponding to the index data of the associated term set according to the similarity;
and the query result determining module is used for taking the sequenced service data as the query result of the query data.
A third aspect provides a data processing server comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the data processing method according to the first aspect.
A fourth aspect provides a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the data processing method according to the first aspect.
The data processing method, the data processing device and the server have the following technical effects:
the invention directly matches the query data with the associated term set which can accurately reflect the frequently-occurring problems of the service system, and can determine the index data matched with the query data of the user in the service system. In addition, the business data in the query result are ranked according to the matching degree of the index data and the query data corresponding to the business data by calculating the similarity between the associated term set of the matched index data and the query data and ranking the result based on the similarity. Therefore, the service data which better meets the query requirement can be preferentially pushed subsequently, and the accuracy of the matched query result is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the invention;
FIG. 2 is a flow diagram illustrating one embodiment of index data generation provided by the present invention;
FIG. 3 is a flowchart illustrating an embodiment of extracting a related term set from corpus information based on a correlation rule according to the present invention;
FIG. 4 is a flow diagram illustrating another embodiment of index data generation provided by the present invention;
FIG. 5 is a flow chart illustrating an embodiment of a data processing method provided by the present invention;
FIG. 6 is a flow chart illustrating a data processing method according to another embodiment of the present invention;
fig. 7 is a block diagram of a hardware structure of a server of a data processing method according to an embodiment of the present invention;
FIG. 8 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 9 is a block diagram of another data processing apparatus according to an embodiment of the present invention;
FIG. 10 is a flowchart illustrating an embodiment of index data mining and matching query results based on the mined index data.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is a schematic diagram of an application environment according to an embodiment of the present invention, and as shown in fig. 1, the application environment may include a service server 01, an index data generation component 02, and a client 03.
In this embodiment of the present specification, the service server 01 may be configured to obtain query data; matching an associated term set contained in the query data from an associated term set of the index data, determining the similarity between the associated term set contained in the query data and the query data, sequencing the service data corresponding to the index data of the associated term set according to the similarity, and taking the sequenced service data as a query result of the query data.
In this embodiment of the present specification, the index data may include a title in a text, question data in a question-and-answer system, and the like, which may be data indexed by querying a certain piece of data. Accordingly, the service data may include result data corresponding to the index data, for example, answer data of a question commonly found in the service system.
In this embodiment, the service server 01 may be an electronic device with computing and network interaction functions; software may also be provided that runs in the electronic device to support data processing and network interaction.
In the present embodiment, the service server 01 does not specifically limit the number of servers. The service server 01 may be one server, several servers, or a server cluster formed by several servers.
In this embodiment, the service server 01 may be a website platform or a service server of an intelligent device. The client 03 can communicate with the service server 01 directly through the network. The query data is sent to the service server 01, and the service server 01 can directly send the obtained service data to the client 03.
In this embodiment, the client 03 may be an electronic device having functions of voice processing, displaying, computing, and network access. Specifically, for example, the client 03 may be a smart speaker, a desktop computer, a tablet computer, a laptop computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, or a television with a network access function. Alternatively, the client 03 may be software that can run on the electronic device.
The index data generating component 02 may be configured to generate index data corresponding to the associated term set, where the index data generating component 02 may be located in the service server 01, or may be located in another service server, and when the index data generating component 02 is located in another service server, the service server 01 may access the another service server through a network or the like to obtain index data corresponding to the associated term set, which is generated by the index data generating component 02 in the another service server.
In an embodiment of this specification, an index database may be maintained, where the index database may store index data, an associated term set corresponding to the index data, and a mapping relationship between the index data and the associated term set. In addition, the index database can also store business data corresponding to the index data and the mapping relation between the index data and the business data. Specifically, the index database may be in the service server 01, or in another service server. When the index database is located in another service server, the service server 01 may access the another service server through a network or the like to obtain data in the index database.
Referring to fig. 2, fig. 2 is a schematic flow chart of an embodiment of index data generation provided by the present invention, which specifically includes:
s201: and extracting an associated term set from the corpus information based on the association rule.
In practical applications, the process of extracting the associated term set based on the association rule is to mine the possible association or connection between different words from the corpus information, for example, in banking, a transaction is usually followed by a service type, such as "debit card", "financial product", and the like. Thus, in a large amount of corpus information, if a plurality of different types of words always appear together frequently, we can consider that these different types of words can constitute a set of associated terms with associations.
Specifically, as shown in fig. 3, fig. 3 is a flowchart illustrating an embodiment of extracting an associated term set from corpus information based on an association rule, which is provided in the present invention, and specifically, the method may include:
s301: and determining a frequent lexical item set from the corpus information.
In this embodiment of the present specification, the corpus information may include a large amount of corpora corresponding to a certain service system. In addition, it is considered that extracting the associated term set based on the association rule depends on the word frequency. Therefore, in the embodiment of the present specification, before extracting the associated term set based on the association rule, some high-frequency nonsense words, such as common words (high-frequency nonsense words) like "ask", "hello", and the like, may be removed. Correspondingly, the corpus information may include corpus information obtained by performing high-frequency nonsense word screening processing on a large amount of corpuses in a certain service system.
Specifically, the frequent term set in the embodiment of the present specification may include a set of terms whose occurrence times are both greater than a preset threshold, for example, in a banking transaction, occurrences of "transact" and "credit card" are both greater than a preset threshold, and accordingly, "transact" and "credit card" may constitute a frequent term set.
In addition, it should be noted that, in the embodiment of the present specification, the preset threshold may be set in combination with a situation that a term appears in an actual application.
S303: a confidence of the association rule between the frequent terms in the set of frequent terms is determined.
In this embodiment, the confidence of the association rule between the frequent terms in the frequent term set may include a probability that the frequent terms in the frequent term set except for one or more frequent terms appear when the one or more frequent terms appear in the frequent term set. In a specific embodiment, assuming that the frequent term set includes "transaction" and "credit card", wherein the number of times "transaction" occurs in the corpus information is 100, and the number of times "credit card" also occurs in the corpus information is 30 in the case of "transaction" occurring in the corpus information, the confidence of the association rule between the frequent terms in the frequent term set including "transaction" and "credit card" may be 30/100 ═ 0.3.
S305: and taking the frequent term set corresponding to the association rule with the confidence coefficient larger than the preset minimum confidence coefficient as an association term set.
In this embodiment of the present specification, a preset minimum confidence may be set in combination with a requirement for occurrence of associations between associated terms in an associated term set in an actual application, and generally, the greater the preset minimum confidence is, the higher the probability of occurrence of associations between associated terms in the associated term set is determined to be; and conversely, the smaller the preset minimum confidence coefficient is, the lower the probability of occurrence of association among the determined associated terms in the associated term set is.
The associated term set extracted based on the association rule can be quickly and accurately determined in the processing process of the associated term set, and data support is provided for subsequently mining index data frequently used by users.
In addition, it should be noted that the multiple associated terms that appear in association may include multiple terms that satisfy frequent occurrence (the number of occurrences is greater than a preset threshold) and a confidence of an association rule between the associated terms is greater than a preset minimum confidence, and correspondingly, an associated term set including the multiple associated terms that appear in association may represent query data that frequently appears in a business system of corpus information corresponding to the associated term set.
S203: and performing dependency syntax analysis processing on the associated terms in the associated term set to obtain index data comprising the associated term set.
In practical application, the associated terms in the associated term set extracted by the association rule are often frequently occurring terms, taking banking as an example, the associated term set including "deposit" and "interest" is easily extracted, however, it is difficult to extract the low frequency modifier of "2 years", and accordingly, in the embodiments of the present specification, dependency parsing may be performed on the associated terms in the set of associated terms to extend some modifiers and, in addition, the dependency parsing process may determine word order relationships based on the predicate analysis, and obtain index data including associated term sets, where an associated term set may correspond to one or more index data, such as, for example, when the occurrence sequence of the terms is different during the analysis of the main predicate object, the different semantics can be expressed to be completely different, the expressions such as "Renminbi exchange for USD exchange rate" and "Renminbi exchange for Renminbi exchange rate" are completely opposite in meaning.
Specifically, the dependency parsing process in the embodiments of the present specification may include, but is not limited to, being implemented by an ltp (Language Technology Platform).
In addition, in order to ensure that the index data can cover different word forms with the same semantics, in this embodiment of the present description, the associated terms in the associated term set of the index data may be further processed corresponding to semantic expansion, as shown in fig. 4, where fig. 4 is a schematic flow diagram of another embodiment of index data generation provided by the present invention, specifically, the method may include:
s401: and extracting an associated term set from the corpus information based on the association rule.
S403: and performing same-semantic expansion processing on the associated terms in the associated term set to obtain a plurality of associated term sets with the semantics of the associated term sets.
In this embodiment of the present specification, the semantic-matching expansion processing may include synonym and related term expansion, for example, "transacting" may perform synonym and related term expansion into "how to do", "apply" and "apply", and the like, and generally here the semantic-matching expansion processing may determine synonyms and related terms of each related term in a related term set by performing similarity calculation between Word vectors by Word2vector and the like, and specifically, may use a term corresponding to a term vector whose similarity with the term vector of the related term is greater than a certain preset similarity threshold as the related term synonym and related term.
S405: and performing dependency syntax analysis processing on the associated terms in the associated term sets to obtain a plurality of index data comprising the associated term sets.
In addition, in the embodiment of the present specification, manual review may be performed on the obtained index data, the index data with wrong semantics is removed, and the wrong index data is returned to the corpus information to perform index data iterative optimization mining, so that the accuracy of the index data is better ensured.
In addition, it should be noted that the step sequence is only one of the execution sequences of the steps, and does not represent a unique execution sequence, for example, the dependency parsing process of step S405 may be performed before the semantic expansion process of step S403.
According to the technical scheme of the embodiment of the specification, the query data frequently appearing in the service system can be accurately obtained through dependency syntax analysis after the association term set is extracted from the corpus information of the service based on the association rule, so that the query data can be accurately identified subsequently, and the accurate query result can be determined.
An embodiment of the present invention for data processing based on the index data corresponding to the associated term set generated by the index data generation component is described below, and fig. 5 is a flow chart of an embodiment of the data processing method provided by the present invention, and the present specification provides the method operation steps as described in the embodiment or the flow chart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 5, the method may include:
s501: query data is obtained.
In this embodiment, the service server may obtain the query data through the query data of the user obtained from the client, or may obtain the query data from a database or other service systems.
When the query data provides information input by the client for the user, in the embodiment of the present specification, the user may input the query data in a manner of inputting text information in an input page provided by the client, or may input the query data in a manner of inputting voice through a voice input interface provided by the client. Accordingly, the service server may receive the query data of the user sent by the client.
In addition, in the embodiment of the present specification, the form in which the user inputs the query data at the client is not limited to the form of text and voice, and may also include the form of picture and the like. Correspondingly, the client can perform voice recognition, image recognition and other processing to obtain query data and send the query data to the service server. In addition, the client can also directly send query data in the form of voice, pictures and the like input by the user to the service server, and correspondingly, the service server can obtain the query data after processing such as voice recognition, image recognition and the like.
S503: and matching an associated term set contained in the query data from the associated term set of the index data, wherein the associated term set comprises a plurality of associated terms which appear in association.
In this embodiment, the set of associated terms included in the query data may include a set of associated terms in which each term in the set of associated terms appears in the query data.
In this embodiment of the present specification, the step of determining the associated term set including the multiple associated terms appearing in association and the index data of the associated term set may refer to the step of determining the associated term set and the index data, which are not described herein again.
The plurality of associated terms in the associated term set can represent the frequently-occurring problem of a business system (namely, the business system of the corpus information corresponding to the associated term set), query data of a user is directly connected with the associated term set, the probability of matching accurate index data can be greatly increased, and the accuracy of the matched business data can be further improved.
S505: and determining the similarity between the associated term set contained in the query data and the query data.
In this embodiment of the present specification, a similarity between the associated term set included in the query data and the query data may be determined, so that business data conforming to the query data may be determined as a query result.
In this specification, the similarity of the associated term set included in the query data may represent the matching degree between the index data of the associated term set included in the query data and the query data of the user. When the similarity between the associated term set contained in the query data and the query data is higher, the matching degree between the index data of the associated term set contained in the query data and the query data of the user is higher; on the contrary, when the similarity between the associated term set contained in the query data and the query data is lower, the matching degree between the index data of the associated term set contained in the query data and the query data of the user is lower.
Specifically, in the embodiments of the present specification, the similarity may include at least one of: word weight, word coverage.
Correspondingly, the determining the similarity between the associated term set contained in the query data and the query data at least includes one of the following:
calculating the word weight of the associated terms in the associated term set contained in the query data, and taking the word weight as the similarity between the associated term set contained in the query data and the query data;
and calculating the word coverage rate of the associated terms in the associated term set contained in the query data, and taking the word coverage rate as the similarity between the associated term set contained in the query data and the query data.
Specifically, in this embodiment, the term weight may include a quantitative value based on the importance degree of the associated term in the associated term set included in the query data. When the importance degree of the associated terms in the associated term set contained in the query data is higher, the word weight of the associated terms in the associated term set in the query data is higher, and correspondingly, the similarity between the associated term set and the query data is higher; on the contrary, when the importance degree of the associated terms in the associated term set contained in the query data is lower, the word weight of the associated terms in the associated term set in the query data is lower, and correspondingly, the similarity between the associated term set and the query data is lower. Specifically, the word weight in the embodiment of the present disclosure may include, but is not limited to, being calculated by using a statistical method TF-IDF (term frequency-inverse document frequency), ltp-based principal predicate analysis processing, and the like.
Specifically, the word coverage rate may include a ratio of the number of words of the associated term in the associated term set included in the query data to the number of words in the query data. Correspondingly, the word coverage rate of the associated term set contained in the query data is in direct proportion to the matching degree between the index data of the associated term set contained in the query data and the query data of the user.
In addition, it should be noted that, based on the word weight and the word coverage, the sentence length of the query data is inversely proportional to the similarity of the associated term set included in the query data. The longer the sentence length of the query data is, the lower the similarity between the associated term set contained in the query data and the query data is; conversely, the shorter the sentence length of the query data is, the higher the similarity between the associated term set contained in the query data and the query data is.
S507: and sequencing the service data corresponding to the index data of the associated term set according to the similarity.
In this embodiment of the present specification, the service data corresponding to the index data of the associated term set included in the query data may generally be a plurality of service data, and correspondingly, when there are a plurality of service data, the service data corresponding to the index data of the associated term set may be sorted according to the size of the similarity.
In the embodiment of the present specification, the greater the similarity between the associated term set and the query data is, the more forward the ranking of the service data corresponding to the index data of the associated term set is; otherwise, the smaller the similarity between the associated term set and the query data is, the later the ordering of the business data corresponding to the index data of the associated term set is.
S509: and taking the sequenced service data as a query result of the query data.
In the embodiment of the present specification, the service server may store a mapping relationship between the index data and the service data, and the service data. Of course, the mapping relationship between the index data and the service data may also be stored in other service servers and other devices that the service server can access.
In addition, in the embodiment of the present specification, the service data may also be sent to a client, so as to implement pushing to a corresponding user. Specifically, the service server may send the query result to the client; or, the service data in the query result is sent to the client in sequence.
In this embodiment of the present specification, the form of the service data pushed to the user may include, but is not limited to, text, voice, picture, and the like.
In the embodiment of the description, the business data are sorted according to the importance degree of the corresponding associated term set in the query data, so that the business data which better meets the user requirements can be preferentially pushed to the user in the follow-up process, and the user experience is greatly improved.
As can be seen from the technical solutions provided in the embodiments of the present specification, the present specification directly matches query data with an associated term set that can accurately reflect frequently occurring problems of a business system, and can determine index data in the business system that matches the query data of a user. In addition, the business data in the query result are sorted according to the matching degree of the index data and the query data corresponding to the business data by calculating the similarity between the matched associated term set and the query data and sorting the corresponding business data based on the similarity. Therefore, the service data which better meets the query requirement can be preferentially pushed to the user subsequently, the accuracy of the matched query result is greatly improved, and the user experience is effectively improved.
FIG. 6 is a flow diagram of another embodiment of a data processing method provided by the present invention, which provides the method steps as described in the embodiments or flowcharts, but may include more or less steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 6, the method may include:
s601: query data is obtained.
S603: matching an associated term set contained in the query data from an associated term set of index data, wherein the associated term set comprises a plurality of associated terms appearing in association;
s605: and determining the similarity between the associated term set contained in the query data and the query data.
S607: and analyzing and processing the intentions of the query data to determine the intentions of the query data.
S609: and performing intention analysis processing on the index data of the associated term set contained in the query data, and determining the intention of the index data of the associated term set contained in the query data.
In this embodiment of the present specification, performing intent analysis on the query data and the index data of the associated term set included in the query data may include, but is not limited to, using text classification models such as fasttext, svm (Support Vector Machine), and the like.
S611: adjusting a similarity of the respective set of associated terms to the query data based on a degree of match between the intent of the index data and the intent of the query data.
In the embodiment of the present specification, the matching degree between intentions can be measured by calculating the similarity between Word vectors of intentions by Word2vector or the like.
In the embodiments of the present specification, a matching degree between an intention of index data of an associated term set included in the query data and an intention of the query data is proportional to a similarity between the corresponding associated term set and the query data. When the matching degree between the intention of the index data of the associated term set contained in the query data and the intention of the query data is higher, the similarity between the associated term set and the query data is higher; conversely, when the degree of matching between the intention of the index data of the associated term set contained in the query data and the intention of the query data is lower, the similarity between the associated term set and the query data is lower.
S613: and sequencing the service data corresponding to the index data of the associated term set according to the adjusted similarity.
S615: and taking the sequenced service data as a query result of the query data.
In the embodiment of the present specification, through performing intent analysis processing on the matched index data and query data, the intentions of the matched index data and query data can be further clarified, and the similarity of the associated term set corresponding to the index data in the query data is adjusted through the matching degree between the intentions of the index data and the query data, so that the determined query result can be ensured to better meet the user requirements.
The method provided by the embodiment of the invention can be executed in a mobile terminal, a computer terminal, a server or a similar operation device. Taking the example of running on a server, fig. 7 is a hardware structure block diagram of the server of the data processing method provided by the embodiment of the present invention. As shown in fig. 7, the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 710 (the processor 710 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 730 for storing data, and one or more storage media 720 (e.g., one or more mass storage devices) for storing applications 723 or data 722. Memory 730 and storage medium 720 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 720 may include one or more modules, each of which may include a series of instruction operations for the server. Still further, central processor 710 may be configured to communicate with storage medium 720 and execute a series of instruction operations in storage medium 720 on server 700. The server 700 may also include one or more power supplies 760, one or more wired or wireless network interfaces 750, one or more input-output interfaces 740, and/or one or more operating systems 721, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The input/output interface 740 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 700. In one example, the transmission module 703 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the input/output interface 740 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
It will be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 700 may also include more or fewer components than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
An embodiment of the present invention further provides a data processing apparatus, as shown in fig. 8, the apparatus may include:
a query data acquisition module 810, which may be configured to acquire query data;
an associated term set matching module 820, configured to match an associated term set included in the query data from an associated term set of index data, where the associated term set includes multiple associated terms occurring in association;
a similarity determination module 830, configured to determine a similarity between the associated term set included in the query data and the query data;
the sorting module 840 may be configured to sort, according to the similarity, the service data corresponding to the index data of the associated term set;
the query result determining module 850 may be configured to use the sorted service data as a query result of the query data.
In another embodiment, the similarity may include at least one of: word weight, word coverage.
Correspondingly, the determining the similarity between the associated term set contained in the query data and the query data at least includes one of the following:
calculating the word weight of the associated terms in the associated term set contained in the query data, and taking the word weight as the similarity between the associated term set contained in the query data and the query data;
and calculating the word coverage rate of the associated terms in the associated term set contained in the query data, and taking the word coverage rate as the similarity between the associated term set contained in the query data and the query data.
In another embodiment, the index data may be determined using the following modules:
the association term set extraction module can be used for extracting an association term set from the corpus information based on an association rule;
the dependency parsing module may be configured to perform dependency parsing on the associated terms in the associated term set to obtain index data including the associated term set.
In another embodiment, the index data may be determined by:
the semantic-similar expansion processing module can be used for performing semantic-similar expansion processing on the associated terms in the associated term set to obtain a plurality of associated term sets with the semantics of the associated term set;
correspondingly, the dependency parsing processing module may be further configured to perform dependency parsing processing on the associated terms in the associated term sets to obtain a plurality of index data including the associated term sets.
In another embodiment, the associated term set extraction module may include:
the frequent lexical item set determining unit may be configured to determine a frequent lexical item set from the corpus information;
a confidence determining unit, configured to determine a confidence of an association rule between frequent terms in the frequent term set;
the association term set determining unit may be configured to use, as the association term set, a frequent term set corresponding to an association rule with a confidence greater than a preset minimum confidence.
In another embodiment, the apparatus may further include:
the first query result sending module may be configured to send the query result to the client;
or the like, or, alternatively,
the second query result sending module may be configured to send the service data in the query result to the client in sequence.
The device and method embodiments in the device embodiment described are based on the same inventive concept.
An embodiment of the present invention further provides another data processing apparatus, as shown in fig. 9, the apparatus may include:
query data obtaining module 910, configured to obtain query data;
an associated term set matching module 920, configured to match an associated term set included in the query data from an associated term set of index data, where the associated term set includes multiple associated terms occurring in association;
a similarity determination module 930, configured to determine a similarity between the associated term set included in the query data and the query data;
a first intention analysis processing module 940, which may be configured to perform intention analysis processing on the query data to determine an intention of the query data;
a second intention analysis processing module 950, configured to perform intention analysis processing on the index data of the associated term set included in the query data, and determine an intention of the index data of the associated term set included in the query data;
a similarity adjustment module 960 for adjusting the similarity of the corresponding associated term set and the query data based on the degree of matching between the intent of the index data and the intent of the query data;
the matching degree between the index data intention of the associated term set contained in the query data and the intention of the query data is in direct proportion to the similarity between the associated term set and the query data;
the sorting module 970 may be configured to sort, according to the adjusted similarity, the service data corresponding to the index data of the associated term set.
The query result determining module 980 may be configured to use the sorted service data as a query result of the query data.
The device and method embodiments in the device embodiment described are based on the same inventive concept.
An embodiment of mining index data by the service system and matching a query result based on the mined index data is described below based on the embodiments of fig. 4 and fig. 6. Specifically, please refer to fig. 10.
A service system (which may be a service server) may perform offline mining, specifically, may obtain corpus information of the service system (the expected information may include corpus information after high-frequency nonsense word screening processing), and extract a related term set from the corpus information based on a related rule; then, performing dependency syntax analysis processing on the associated terms in the associated term set based on the LTP to obtain index data comprising the associated term set; then, the same semantic expansion processing can be carried out on the index data, and the index data can cover different word forms with the same semantic. And then, manually checking the obtained index data, and adding the correct index data, the keyword item set corresponding to the index data and the mapping relation between the index data and the associated term set into an index database. In addition, the wrong index data can be returned to the corpus information for index data iterative optimization mining, and the accuracy of the index data is better ensured.
Further, the business system may perform online recall calculation, specifically, may load the index database, match the acquired query data with the associated term set in the index database, and recall the business data corresponding to the index data of the matched associated term set. Then, the similarity between the matched associated term set and the query data can be calculated, and the recalled business data are sorted based on the similarity; and then, based on the matching degree between the matched index data and the query data intention, the similarity between the corresponding associated term set and the query data is adjusted, the recalled business data are reordered, the ordering of the business data which are more matched with the query data is ensured to be advanced, and finally the ordered business data are returned as a query result.
In the embodiment of the business system provided by the invention for mining the index data and matching the query result based on the mined index data, the index data frequently appearing in the business system can be accurately obtained through the steps of dependency syntax analysis, semantic expansion and manual review after extracting the associated term set from the corpus information of the business based on the association rule in the mining process of the index data. In the process of matching the query result based on the mined index data, the query data is directly matched with the associated term set which can accurately reflect the frequently-occurring problems of the service system, so that the index data matched with the query data of the user in the service system can be determined. In addition, the business data in the query result are sorted according to the matching degree of the index data and the query data corresponding to the business data by calculating the similarity between the matched associated term set and the query data and sorting the results based on the similarity. Therefore, the service data which more meets the query requirement can be preferentially pushed, and the accuracy of the matched query result is greatly improved.
An embodiment of the present invention provides a data processing server, where the data processing server includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the data processing method provided in the foregoing method embodiment.
The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.
Embodiments of the present invention also provide a storage medium, which may be disposed in a server to store at least one instruction, at least one program, a code set, or a set of instructions related to implementing a data processing method in the method embodiments, where the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the data processing method provided by the above method embodiments.
Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
It can be seen from the above embodiments of the data processing method, apparatus, server or storage medium provided by the present invention that, in the present invention, query data is directly matched with an associated term set that can accurately reflect frequently occurring problems of a business system, and index data matched with query data of a user in the business system can be determined. In addition, the business data in the query result are sorted according to the matching degree of the index data and the query data corresponding to the business data by calculating the similarity between the matched associated term set and the query data and sorting the corresponding business data based on the similarity. Therefore, the service data which better meets the query requirement can be preferentially pushed to the user subsequently, the accuracy of the matched query result is greatly improved, and the user experience is effectively improved.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the device and server embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (14)

1. A method of data processing, the method comprising:
acquiring query data;
matching an associated term set contained in the query data from an associated term set of index data, wherein the associated term set comprises a plurality of associated terms appearing in association, and the index data is determined in advance by adopting the following method: extracting an associated term set from the corpus information based on an association rule; performing dependency syntax analysis processing on the associated terms in the associated term set to obtain index data comprising the associated term set, wherein the associated term set represents frequently-appearing query data in a business system corresponding to the corpus information;
determining similarity between an associated term set contained in the query data and the query data;
sequencing the service data corresponding to the index data of the associated term set according to the similarity;
and taking the sequenced service data as a query result of the query data.
2. The method of claim 1, wherein the similarity includes at least one of: word weight, word coverage;
correspondingly, the determining the similarity between the associated term set contained in the query data and the query data at least includes one of the following:
calculating the word weight of the associated terms in the associated term set contained in the query data, and taking the word weight as the similarity between the associated term set contained in the query data and the query data;
and calculating the word coverage rate of the associated terms in the associated term set contained in the query data, and taking the word coverage rate as the similarity between the associated term set contained in the query data and the query data.
3. The method of claim 1, further comprising:
performing intention analysis processing on the query data to determine the intention of the query data;
performing intention analysis processing on index data of the associated term set contained in the query data, and determining the intention of the index data of the associated term set contained in the query data;
adjusting the similarity of the respective associated term sets to the query data based on a degree of match between the intent of the index data and the intent of the query data;
the matching degree between the index data intention of the associated term set contained in the query data and the intention of the query data is in direct proportion to the similarity between the associated term set and the query data;
correspondingly, sorting the service data corresponding to the index data of the associated term set according to the similarity comprises sorting the service data corresponding to the index data of the associated term set according to the adjusted similarity.
4. The method of claim 1, further comprising:
performing same-semantic expansion processing on the associated terms in the associated term set to obtain a plurality of associated term sets with the semantics of the associated term sets;
correspondingly, the performing dependency parsing processing on the associated terms in the associated term set to obtain index data includes: and performing dependency syntax analysis processing on the associated terms in the associated term sets to obtain a plurality of index data comprising the associated term sets.
5. The method of claim 1, wherein extracting the set of associated terms from the corpus information based on the association rule comprises:
determining a frequent lexical item set from the corpus information;
determining confidence of association rules among frequent terms in the frequent term set;
and taking the frequent term set corresponding to the association rule with the confidence coefficient larger than the preset minimum confidence coefficient as an association term set.
6. The method of claim 1, further comprising:
sending the query result to a client;
or the like, or, alternatively,
and sending the service data in the query result to the client in sequence.
7. A data processing apparatus, characterized in that the apparatus comprises:
the query data acquisition module is used for acquiring query data;
the system comprises an association term set matching module, a query data matching module and a query data matching module, wherein the association term set matching module is used for matching an association term set contained in the query data from an association term set of the index data, and the association term set comprises a plurality of association terms appearing in association; the index data is determined using the following modules: the association lexical item set extraction module is used for extracting an association lexical item set from the corpus information based on an association rule; the dependency syntax analysis processing module is used for carrying out dependency syntax analysis processing on the associated terms in the associated term set to obtain index data comprising the associated term set; the related term set represents frequently-appearing query data in a service system corresponding to the corpus information;
a similarity determining module, configured to determine a similarity between an associated term set included in the query data and the query data;
the sorting module is used for sorting the service data corresponding to the index data of the associated term set according to the similarity;
and the query result determining module is used for taking the sequenced service data as the query result of the query data.
8. The apparatus of claim 7, wherein the similarity comprises at least one of: word weight, word coverage;
correspondingly, the determining the similarity between the associated term set contained in the query data and the query data at least includes one of the following:
calculating the word weight of the associated terms in the associated term set contained in the query data, and taking the word weight as the similarity between the associated term set contained in the query data and the query data;
and calculating the word coverage rate of the associated terms in the associated term set contained in the query data, and taking the word coverage rate as the similarity between the associated term set contained in the query data and the query data.
9. The apparatus of claim 7, further comprising:
the first intention analysis processing module is used for carrying out intention analysis processing on the query data and determining the intention of the query data;
the second intention analysis processing module is used for performing intention analysis processing on the index data of the associated term set contained in the query data and determining the intention of the index data of the associated term set contained in the query data;
a similarity adjustment module for adjusting the similarity between the corresponding associated term set and the query data based on the matching degree between the intention of the index data and the intention of the query data;
the matching degree between the index data intention of the associated term set contained in the query data and the intention of the query data is in direct proportion to the similarity between the associated term set and the query data;
correspondingly, the sorting module is further configured to sort the service data corresponding to the index data of the associated term set according to the adjusted similarity.
10. The apparatus of claim 7, wherein the index data is further determined using:
the semantic-similar expansion processing module is used for performing semantic-similar expansion processing on the associated terms in the associated term set to obtain a plurality of associated term sets with the semantics of the associated term sets;
correspondingly, the dependency parsing processing module is further configured to perform dependency parsing processing on the associated terms in the associated term sets to obtain a plurality of index data including the associated term sets.
11. The apparatus of claim 7, wherein the related term set extraction module comprises:
a frequent lexical item set determining unit, configured to determine a frequent lexical item set from the corpus information;
the confidence coefficient determining unit is used for determining the confidence coefficient of the association rule among the frequent terms in the frequent term set;
and the association term set determining unit is used for taking the frequent term set corresponding to the association rule with the confidence coefficient greater than the preset minimum confidence coefficient as the association term set.
12. The apparatus of claim 7, further comprising:
the first query result sending module is used for sending the query result to the client;
or the like, or, alternatively,
and the second query result sending module is used for sending the service data in the query result to the client in sequence.
13. A data processing server, characterized in that the server comprises a processor and a memory, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by the processor to implement the data processing method according to any one of claims 1 to 6.
14. A computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the data processing method according to any one of claims 1 to 6.
CN201810593240.5A 2018-06-11 2018-06-11 Data processing method and device and server Active CN109033142B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810593240.5A CN109033142B (en) 2018-06-11 2018-06-11 Data processing method and device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810593240.5A CN109033142B (en) 2018-06-11 2018-06-11 Data processing method and device and server

Publications (2)

Publication Number Publication Date
CN109033142A CN109033142A (en) 2018-12-18
CN109033142B true CN109033142B (en) 2021-02-12

Family

ID=64612551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810593240.5A Active CN109033142B (en) 2018-06-11 2018-06-11 Data processing method and device and server

Country Status (1)

Country Link
CN (1) CN109033142B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201838A (en) * 2007-08-21 2008-06-18 新百丽鞋业(深圳)有限公司 Method for improving searching engine based on keyword index using phrase index technique
CN103235812B (en) * 2013-04-24 2015-04-01 中国科学院计算技术研究所 Method and system for identifying multiple query intents
CN106777957A (en) * 2016-12-12 2017-05-31 吉林大学 The new method of biomedical many ginseng event extractions on unbalanced dataset
CN107885718A (en) * 2016-09-30 2018-04-06 腾讯科技(深圳)有限公司 Semanteme determines method and device
CN107993724A (en) * 2017-11-09 2018-05-04 易保互联医疗信息科技(北京)有限公司 A kind of method and device of medicine intelligent answer data processing
CN108052583A (en) * 2017-11-17 2018-05-18 康成投资(中国)有限公司 Electric business body constructing method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6275758B2 (en) * 2016-03-01 2018-02-07 Necパーソナルコンピュータ株式会社 Information processing system, information processing method, and program
CN106897334B (en) * 2016-06-24 2020-07-14 阿里巴巴集团控股有限公司 Question pushing method and equipment
CN107766498B (en) * 2017-10-19 2022-01-07 北京百度网讯科技有限公司 Method and apparatus for generating information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201838A (en) * 2007-08-21 2008-06-18 新百丽鞋业(深圳)有限公司 Method for improving searching engine based on keyword index using phrase index technique
CN103235812B (en) * 2013-04-24 2015-04-01 中国科学院计算技术研究所 Method and system for identifying multiple query intents
CN107885718A (en) * 2016-09-30 2018-04-06 腾讯科技(深圳)有限公司 Semanteme determines method and device
CN106777957A (en) * 2016-12-12 2017-05-31 吉林大学 The new method of biomedical many ginseng event extractions on unbalanced dataset
CN107993724A (en) * 2017-11-09 2018-05-04 易保互联医疗信息科技(北京)有限公司 A kind of method and device of medicine intelligent answer data processing
CN108052583A (en) * 2017-11-17 2018-05-18 康成投资(中国)有限公司 Electric business body constructing method

Also Published As

Publication number Publication date
CN109033142A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN107436875B (en) Text classification method and device
US11093854B2 (en) Emoji recommendation method and device thereof
CN108153901B (en) Knowledge graph-based information pushing method and device
US11468342B2 (en) Systems and methods for generating and using knowledge graphs
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN111104526A (en) Financial label extraction method and system based on keyword semantics
US20160189047A1 (en) Method and System for Entity Linking
CN110147425B (en) Keyword extraction method and device, computer equipment and storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN110968684A (en) Information processing method, device, equipment and storage medium
CN113326420B (en) Question retrieval method, device, electronic equipment and medium
CN109002432B (en) Synonym mining method and device, computer readable medium and electronic equipment
CN113204621B (en) Document warehouse-in and document retrieval method, device, equipment and storage medium
CN106294505B (en) Answer feedback method and device
CN111310440A (en) Text error correction method, device and system
CN109885651B (en) Question pushing method and device
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
CN113051380A (en) Information generation method and device, electronic equipment and storage medium
CN114756570A (en) Vertical search method, device and system for purchase scene
CN110489740B (en) Semantic analysis method and related product
CN111563361A (en) Text label extraction method and device and storage medium
CN109033142B (en) Data processing method and device and server
CN116383234A (en) Search statement generation method and device, computer equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information
WO2018171499A1 (en) Information detection method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant