CN118052228A

CN118052228A - Domain word determining method and device, electronic equipment and storage medium

Info

Publication number: CN118052228A
Application number: CN202410318832.1A
Authority: CN
Inventors: 张海霞; 李斌; 谢鸣晓; 刘峻杉; 谷利峰
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Priority date: 2024-03-20
Filing date: 2024-03-20
Publication date: 2024-05-17

Abstract

The application relates to the field of natural language processing, in particular to a method and a device for determining domain words, electronic equipment and a storage medium, which are used for extracting the domain words in a document and have higher accuracy. The method comprises the following steps: obtaining a target document, wherein the target document comprises one or more paragraphs; performing word segmentation processing on any paragraph to obtain a first word segmentation result; determining a plurality of candidate domain words based on the first word segmentation result; determining probability parameters of each candidate domain word by using a pre-trained domain word classification model; and determining the target domain word from the plurality of candidate domain words based on the first probability threshold and the probability parameter of each candidate domain word.

Description

Domain word determining method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of natural language processing, and in particular, to a method and apparatus for determining domain words, an electronic device, and a storage medium.

Background

As the financial field evolves, businesses tend to accumulate a large number of business documents. Such as financial product risk disclosure books, business description documents, bond underwriting operating manuals, and the like. The business documents in the financial industry have the characteristics of a large number, a large variety, a large number of professional vocabularies and the like. And most of the business document contents are longer, and some of the business document contents can reach twenty-three pages.

If the number of the documents is large and long, the documents need to be manually searched and classified if the documents do not depend on a computer, and a great deal of manpower is needed to understand the content of the documents. If the words in the document are accurately extracted, the accuracy of Natural Language Processing (NLP) processes such as document classification, retrieval, knowledge point extraction and the like can be improved.

The common word segmentation tool can only divide the existing words, and the word segmentation tool also needs to rely on the update of a word library to divide new words. For example, when the word "green finance" appears in the finance field, the word segmentation tool may divide the new word into "green", "finance". But in reality "green finance" is already a domain word in the financial domain with a specific expressive meaning or idea. If the field words in the document are accurately extracted, the accuracy of Natural Language Processing (NLP) processes such as document classification, retrieval, knowledge point extraction and the like can be greatly improved, and a user is helped to accurately extract useful information from a large number of documents.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, electronic equipment and a storage medium for determining domain words, which are used for realizing extraction of the domain words in a document and have higher accuracy.

In a first aspect, an embodiment of the present application provides a method for determining a domain word, where the method includes:

obtaining a target document, wherein the target document comprises one or more paragraphs;

performing word segmentation processing on any paragraph to obtain a first word segmentation result;

determining a plurality of candidate domain words based on the first word segmentation result;

Determining probability parameters of each candidate domain word by using a pre-trained domain word classification model;

and determining the target domain word from the plurality of candidate domain words based on the first probability threshold and the probability parameter of each candidate domain word.

In one possible implementation, the first word segmentation result includes a plurality of word segments, dependency syntaxes between the word segments, and positions of the word segments in the any of the paragraphs;

the determining a plurality of candidate domain words based on the first word segmentation result comprises:

For a first word in the plurality of words, if the number of second words having a dependency syntax relationship with the first word is a plurality of, selecting two second words from the plurality of second words, and combining the part of speech of each second word in the two second words, the position of each second word and the first word to generate a standby word;

determining candidate probabilities of the standby words based on the positions of the words and the dependency syntactic relations among the words in the standby words;

and if the candidate probability of the standby word is greater than a second probability threshold, determining the standby word as a candidate domain word.

In one possible implementation manner, the word segmentation processing is performed on any paragraph to obtain a first word segmentation result, which includes:

And performing word segmentation on any paragraph by using a word segmentation tool and a custom dictionary to obtain the first word segmentation result, wherein the custom dictionary is determined based on the target document.

In one possible implementation, determining the custom dictionary includes:

traversing paragraphs in the target document, and performing word segmentation on each paragraph by using the word segmentation tool to obtain a second word segmentation result, wherein the second word segmentation result comprises the part of speech, the position and the dependency syntax relationship of each word;

Generating a plurality of phrases based on the second word segmentation result, each phrase comprising a plurality of third word segments;

Generating a combination set corresponding to any phrase, wherein each element in the combination set comprises at least two third participles of the phrase;

And determining importance degree parameters of elements in any combination set aiming at any combination set, and adding target elements into the custom dictionary, wherein the element with the largest importance degree parameter in the importance degree parameters of the elements is the target element.

In a possible implementation manner, the determining the probability parameter of each candidate domain word by using the pre-trained domain word classification model includes:

acquiring attribute parameters of the target document;

And inputting the attribute parameters of the target document and the candidate domain words into the domain word classification model, and outputting the probability parameters of the candidate domain words by the domain word classification model.

In a possible implementation manner, the attribute parameters of the target document include a service class to which the target document belongs, and/or a source type of the target document.

In one possible implementation manner, after the attribute parameter of the target document and the candidate domain word are input into the domain word classification model, the method further includes:

The domain word classification model determines word features, semantic features and structural features of the candidate domain words;

The domain word classification model calculates probability parameters of the candidate domain words based on weight parameters of various features and word features, semantic features and structural features of the candidate domain words.

In a second aspect, an embodiment of the present application provides a domain word determining apparatus, including:

An acquisition unit configured to acquire a target document, the target document including one or more paragraphs;

the word segmentation processing unit is used for carrying out word segmentation processing on any paragraph to obtain a first word segmentation result;

A candidate domain word determining unit, configured to determine a plurality of candidate domain words based on the first word segmentation result;

the probability parameter determining unit is used for determining probability parameters of each candidate domain word by utilizing a pre-trained domain word classification model;

and the domain word determining unit is used for determining target domain words from the plurality of candidate domain words based on the first probability threshold and the probability parameter of each candidate domain word.

The unit for determining candidate domain words is specifically configured to:

In one possible implementation manner, the word segmentation processing unit is specifically configured to:

In a possible implementation manner, the probability parameter determining unit is specifically configured to:

acquiring attribute parameters of the target document;

In a possible implementation manner, the determining probability parameter unit inputs the attribute parameters of the target document and the candidate domain words into the domain word classification model: the domain word classification model in the probability parameter determining unit determines word characteristics, semantic characteristics and structural characteristics of the candidate domain words; and the domain word classification model calculates probability parameters of the candidate domain words based on weight parameters of various features and word features, semantic features and structural features of the candidate domain words.

In a third aspect, an embodiment of the present application provides an electronic device, including at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods provided by the embodiments of the first aspect of the present application.

In a fourth aspect, embodiments of the present application provide a computer storage medium, where the computer readable storage medium stores a computer program for causing a computer to perform any of the methods provided by the embodiments of the first aspect of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform any of the methods provided by the embodiments of the first aspect.

The application has the following beneficial effects:

The embodiment of the application provides a new idea for determining the domain words, which has higher accuracy, and the electronic equipment determines candidate domain words for the word segmentation result of any section in the target document. And then combining the trained domain word classification model in advance to determine probability parameters of candidate domain words being target domain words. And determining whether the candidate domain word is a target domain word based on the first probability threshold and the probability parameter of the candidate domain word. It can be seen that, in the embodiment of the present application, based on the word segmentation result, a plurality of candidate domain words are determined, and a target domain word is determined from the plurality of candidate domain words. Rather than determining the target domain word from the word segmentation result. This may not depend on the capabilities of the word segmentation tool. And the determined target domain word is or is close to a proprietary word or a unique word in the domain.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for determining domain words in an embodiment of the application;

FIG. 3 is a flowchart of a method for determining a first word segmentation result according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for determining candidate domain words according to an embodiment of the present application;

FIG. 5 is a directed non-rights graph based on dependency syntax in an embodiment of the application;

FIG. 6 is a schematic diagram of a structure of a domain word determining apparatus according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

The term "comprising" and any variations thereof in the description of the application and in the claims is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

And, unless otherwise indicated, the terms "first," "second," and the like according to the embodiments of the present application are used for distinguishing a plurality of objects, and are not used for limiting the size, content, order, timing, priority, importance, or the like of the plurality of objects. For example, the first word segmentation result and the second word segmentation result are only for distinguishing the word segmentation results, and are not different in content, size, priority, importance, or the like of the two word segmentation results.

In the technical scheme of the application, the data is collected, transmitted, used and the like, and all meet the requirements of national relevant laws and regulations.

The domain word refers to a word capable of clearly showing the content characteristics of a text paragraph, and the domain word is "what about" in terms of the content, such as the subject idea. The extraction of field words from business documents in the financial field of Chinese is a subtask of NLP, and Chinese NLP tasks are different from English and lack of obvious boundaries between Chinese words. The NLP task of Chinese needs to divide words and parts of speech labels in sequence, and sentence processing is carried out on the basis of dividing words. However, this analysis mode is prone to error propagation and the shared knowledge between the three subtasks is not fully utilized.

In the word segmentation process, the common word segmentation tool can only divide the existing words, and the word segmentation tool also needs to rely on the update of a word library to divide new words. For example, when the word "green finance" appears in the finance field, the word segmentation tool may divide the new word into "green", "finance". But in reality "green finance" is already a domain word in the financial domain with a specific expressive meaning or idea.

If the field words in the document are accurately extracted, the accuracy of Natural Language Processing (NLP) processes such as document classification, retrieval, knowledge point extraction and the like can be greatly improved, and a user is helped to accurately extract useful information from a large number of documents.

In view of the above, the present application provides a method for determining domain words, by which an electronic device may perform word segmentation processing on any paragraph after obtaining a target document, to obtain a first word segmentation result. And determining a plurality of candidate domain words based on the first word segmentation result. And then determining probability parameters of each candidate domain word by utilizing a pre-trained domain word classification model. And determining the target domain word from the plurality of candidate domain words based on the first probability threshold and the probability parameter of each candidate domain word. The method comprises the step of determining domain words in a target document. Such a domain word determining method may be applied to a scene of determining domain words of any domain.

After the design idea of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to all various business scenes needing to limit the access right.

Referring to fig. 1, a schematic view of a scenario provided in an embodiment of the present application may include a plurality of terminal devices 101 and a server 102, where each of the terminal devices 101-1, 101-2, … …, and 101-n may be used by different users, and each terminal device is provided with a respective document parsing system.

In the embodiment of the application, a user can log in a corresponding document analysis system on the terminal equipment 101, select a function of extracting domain words on a document uploading page of the document analysis system, upload a target document by the terminal equipment 101, and send an extracting domain word request carrying service classification and/or department classification to the server 102. The server 102 executes the domain word determining method provided by the application according to the received extracted domain word request carrying the service classification and/or the department classification and the target document, so that one or more domain words of the target document can be determined, and the one or more domain words of the target document are sent to the terminal device 101, so that the terminal device 101 can obtain one or more domain words of the target document.

In the embodiment of the present application, the terminal device 101 may be, for example, a mobile phone, a tablet personal computer (PAD), a personal computer (Personal computer, PC), an intelligent television, an intelligent vehicle-mounted device, a wearable device, or the like, which is not limited in the embodiment of the present application. In the embodiment of the present application, the server 102 may be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content distribution networks, basic cloud computing services such as big data and artificial intelligence platforms, or may be a physical server, but is not limited thereto.

Wherein, the terminal equipment 101 and the server 102, and the terminal equipment 101 can be directly or indirectly connected through one or more networks 103. The network 103 may be a wired network, or may be a Wireless network, for example, a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which are not limited in this embodiment of the present application.

Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, for example, application scenarios where multiple terminal devices interact with multiple servers, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the apparatus is performed.

Referring to fig. 2, fig. 2 is a flowchart of a method for determining domain words in an embodiment of the application. The flow of the method may be performed by an electronic device, which may be the server 102 in fig. 1, and the specific implementation flow of the method is as follows:

in step S201, a target document is obtained, the target document comprising one or more paragraphs.

In a practical scenario, enterprise servers are often applied to business in one domain. The server may default the domain of the acquired target document to a default domain. Or when the server acquires the target document, the condition of the domain to which the target document belongs can be acquired. Or after the server acquires the target document, the domain of the target document can be determined by means of semantic recognition and the like.

In embodiments of the application, the target document may include one or more paragraphs. It should be noted that a document may be understood as meaning text, an article, or the like. The file type of the target document according to the embodiment of the application can be, but is not limited to, word, PPT, PDF types and the like.

Step S202, performing word segmentation processing on any paragraph to obtain a first word segmentation result.

The electronic device may traverse the paragraphs, performing the operations in steps S202, S203, S204.

Step S203, determining a plurality of candidate domain words based on the first word segmentation result.

The electronic device may determine a plurality of candidate domain words existing in any paragraph by using the first word segmentation result corresponding to any paragraph. Optionally, the electronic device may randomly combine the segmented words in the paragraph, where the combined phrase is used as the candidate domain word. Or the electronic device may combine the word segments in the paragraph according to a preset manner to obtain the candidate domain word.

Step S204, determining probability parameters of each candidate domain word by using a pre-trained domain word classification model.

The electronic device may input each candidate domain word into a pre-trained domain word classification model that may obtain probability parameters for each candidate domain word. The probability parameter of the candidate domain word may represent the probability that the candidate domain word is a domain word in the target document, that is, the probability that the candidate domain word is a target domain word of the target document.

The pre-trained domain word classification model may be trained based on one or more documents of the domain to which the target document corresponds. The training process of the domain word classification model will be described later.

Step S205, determining a target domain word from the plurality of candidate domain words based on the first probability threshold and the probability parameter of each candidate domain word.

After obtaining the probability parameter of each candidate domain word, the electronic device can determine the candidate domain word with the probability parameter being greater than or equal to the first probability threshold as the target domain word. Candidate domain words with probability parameters smaller than the first probability threshold are not used as target domain words.

In one possible implementation manner, to improve word segmentation efficiency, accuracy in determining candidate domain words is improved. In step S202, the electronic device may perform word segmentation on the arbitrary paragraph by using a word segmentation tool and a custom dictionary, where the custom dictionary is determined based on the target document, to obtain the first word segmentation result.

In some examples, the custom dictionary may be a professional domain dictionary of the domain to which the target document corresponds. The custom dictionary needs to be updated manually and periodically.

In another possible embodiment, as shown in fig. 3, a method for determining a first word segmentation result may include the following steps:

Step S301, traversing paragraphs in the target document, and performing word segmentation processing on each paragraph by using the word segmentation tool to obtain a second word segmentation result, where the second word segmentation result includes the part of speech, the position and the dependency syntax relationship of each word segmentation.

The electronic device can use any existing word segmentation tool or word segmentation algorithm to segment each paragraph. Obtaining the segmentation words and part-of-speech marks of each paragraph and analyzing the dependency syntax.

Step S302, generating a plurality of phrases based on the second word segmentation result, wherein each phrase comprises a plurality of third word segmentation.

The electronic device may establish a mapping relationship ph=f (c, r, i) of the word-to-word combination into a phrase ph based on each part-of-speech c, the dependency syntax relationship r, and the position i in the paragraph. The electronic device may obtain a plurality of phrases ph. The electronic device may generate a phrase set a corresponding to the target document, where the phrase set a includes a plurality of phrases ph, and may be denoted as ph0, ph1, …, phn, respectively.

Each phrase phi may have j words, denoted w0, w1, …, wj, respectively. The words in the phrase are marked as third segmentation words in the application.

Step S303, generating a combination set corresponding to any phrase, wherein each element in the combination set comprises at least two third participles of the phrase.

For any phrase phi, a combined set Si corresponding to the phrase can be generated, wherein the combined set Si can comprise a plurality of elements, and one element can be composed of at least two third segmentations in the phrase. For example, the set of combinations corresponding to phrase phi may be si= { w0w1, w0w1w2, …, w0w1w2 … wn }.

Step S304, determining importance degree parameters of elements in any combination set, and adding target elements into the custom dictionary, wherein the element with the largest importance degree parameter in the importance degree parameters of the elements is the target element.

For determining the boundary of any phrase phi, for the kth element in the combination set Si corresponding to any phrase phi, determining the importance degree Z (w_k) of the element. Where num characterizes the number of paragraphs in the document that contain w_k and total characterizes the total number of paragraphs in the document. And taking the element with the largest importance degree Z in all elements in the combined set Si as a target element corresponding to the combined set Si, and adding the target element into a custom dictionary. Optionally, if the importance degree of each element in the combined set Si is 0, no target element is in the combined set, and the elements in the combined set Si do not need to be added into the custom dictionary.

Step S305, performing word segmentation processing on any paragraph by using a word segmentation tool and a custom dictionary, to obtain the first word segmentation result.

Words in the custom dictionary are also new words found from the target document, usually terms and terminology. Establishing a custom dictionary for the target document is beneficial to improving word segmentation effect on paragraphs in the target document.

Based on the domain word determining method provided by any one of the embodiments, in one possible implementation manner, after performing word segmentation processing on any paragraph, a plurality of segmented words in the paragraph and dependency syntactic relations among the segmented words can be obtained. In the embodiment of the application, the segmentation is carried out on any paragraph, and the position of the segmentation in the paragraph can be obtained. That is, the first word segmentation result includes a plurality of words, dependency syntaxes between the words, and positions of the words in any of the paragraphs. In the foregoing step S203, when the electronic device determines a plurality of candidate domain words based on the first word segmentation result, the method for determining candidate domain words as shown in fig. 4 may be performed:

Step S401, for a first word segment of the plurality of word segments, if the number of second word segments having a dependency syntax relationship with the first word segment is plural, selecting two second word segments from the plurality of second word segments, and generating a candidate word by combining the part of speech of each second word segment of the two second word segments and the position first word segment of each second word segment.

The electronic device may traverse the word segment in the first word segment result, and mark the traversed current word as the first word segment. In the first word segmentation result, the word segmentation with the dependency syntax relationship with the first word segmentation is marked as a second word segmentation. Referring to the directed non-authority diagram based on dependency syntax shown in fig. 5, a first word is used as a parent node, the number of second words having dependency syntax with the first word is assumed to be two, and the two second words are used as child nodes. And combining the parts of speech of each child node and the positions in the paragraphs with the parent node to obtain the standby word. For example, the parent node is "dimension", child node 1 is "bond", and child node 2 is "classification". Wherein the son node 1 is located as 1 in the paragraph, and the part of speech is noun (n). Son node 2 is located in the paragraph at position 2 and part of speech is a proper noun (vn). The dependency syntax relationship between the parent node and child node 1 is "centering relationship", and the dependency syntax relationship between the parent node and child node 2 is "centering relationship". The position of the child node 1 is smaller than the position of the child node 2, and the position of the child node 1 is before the position of the child node 2 in the generated standby word. And determining the generated standby word as a bond classification dimension by combining the dependency syntactic relation between the father node and each son node.

Optionally, for a first word segment in the plurality of word segments, if the number of second word segments having a dependency syntax relationship with the first word segment is one, a candidate word may be generated by combining the dependency syntax relationship between the second word segment and the first word segment;

Step S402, determining candidate probabilities of the standby words based on the positions of the segmented words in the standby words and the dependency syntactic relation among the segmented words.

Ideas of 3-gram language model in the art: the probability of a word occurring is only related to the 2 words that occur before it. The application marks the word in the standby word as a fourth word for convenience of distinction. For the candidate word, a distance coefficient f (poi) _m between any two fourth word segments is related to the position distances of the two fourth word segments, wherein the more the position distance is, the smaller the distance coefficient is, and m represents any two fourth word segments in the m-th group in the candidate word. For example, f (poi) _m=a ^x, where 0< a <1, x represents the absolute value of the distance between the positions of the two fourth tokens, x is equal to or greater than 1, and the further apart the two fourth tokens are, the smaller the value of f (poi) is.

The relation coefficient f (relation) _m between the two fourth partial words is related to whether the two fourth partial words are adjacent or whether there is a dependency syntax relationship between the two fourth partial words. The value range of f (relation) _m is {0,1}. If the two fourth partial words are adjacent or have a dependency syntax relationship, the relationship coefficient of the two fourth partial words is 1. If the positions of the two fourth participles are not adjacent, and the two fourth participles have no dependency syntax relationship, the relationship coefficient of the two fourth participles is 0.

If any two fourth word segments of the s groups exist in the candidate word, candidate probability of the candidate word is determinedWherein the method comprises the steps of ,f(poi)f9relation)＝f(poi)_1*f(relation)_1+f(poi)_2*f(relation)_2+f(poi)_3*f(relation)_3+…+f(poi)_s*f(relation)_s.

Step S403, if the candidate probability of the candidate word is greater than the second probability threshold, determining the candidate word as a candidate domain word.

After obtaining the candidate probability of each candidate word, the electronic device can determine the candidate word with the candidate probability greater than or equal to the second probability threshold as the candidate domain word. Spare words having a candidate probability less than the second probability threshold are not used as candidate domain words.

Based on the domain word determining method provided in any one of the foregoing embodiments, in a possible implementation manner, in step S204, probability parameters of each candidate domain word are determined using a pre-trained domain word classification model.

In some examples, the electronic device may input each candidate domain word into a pre-trained domain word classification model that may derive probability parameters for each candidate domain word. The probability parameter of the candidate domain word may represent the probability that the candidate domain word is a domain word in the target document, that is, the probability that the candidate domain word is a target domain word of the target document.

In other examples, the electronic device may obtain the attribute parameters of the target document. The electronic device may input the attribute parameters of the target document and the candidate domain words into the domain word classification model, and the domain word classification model outputs the probability parameters of the candidate domain words. Optionally, the attribute parameter of the target document may include a service class to which the target document belongs, and/or a source type of the target document.

The domain word classification model may determine word features, semantic features, and structural features of the candidate domain words. The domain word classification model calculates probability parameters of the candidate domain words based on weight parameters of various features and word features, semantic features and structural features of the candidate domain words.

In a possible implementation manner, in the process of constructing the classification model, the characteristics of the classification model mainly comprise three types of characteristics, namely word characteristics, semantic characteristics and structural characteristics. The characteristics of the classification model may also include attribute parameters of the document. The attribute parameters of the document may include the business class to which the document belongs and/or the source type of the document.

In the financial field, businesses have different types of businesses, such as personal businesses, corporate businesses, channel businesses, and the like. The domain words for different services are different. There may be some word that is a domain word in the first business and not a domain word in the second business. Similarly, enterprises have different departments, such as deposit departments, loan departments, individual departments, etc., and the domain words in documents output by different departments are also different. The source type of the document in the application can characterize departments. Based on the phenomenon, in the training process of the classification model, the attribute parameters of the document are introduced to train, so that the effect of determining the domain words of the document can be improved.

Alternatively, the word features may include local features, and/or global features. Or word features may be determined based on local features and global features. Alternatively, the local features may include word frequency-reverse document frequency (TF-IDF). Global features may include anti-document frequency (inverse documen tfrequency, IDF), chi-square, etc.

Structural features may include document structural information such as the title of the document, or the location in a field term paragraph in the document, etc.

Semantic features are relatively common features and can be realized by combining any existing method for determining the semantic features.

When training the classification model, a training set prepared in advance is input into the model for training, and then the training process is circulated through manual collection and feedback correction of the training set. The training samples in the training set comprise the business classification and source type of the documents, candidate domain word labels, document titles of the domain words and the positions of the domain words in paragraphs.

Based on the same inventive concept, the embodiment of the application also provides a device for determining the domain words. As shown in fig. 6, which is a schematic structural diagram of the domain word determining apparatus 600, may include:

An obtaining unit 601, configured to obtain a target document, where the target document includes one or more paragraphs;

The word segmentation processing unit 602 is configured to perform word segmentation processing on any paragraph to obtain a first word segmentation result;

A candidate domain word determining unit 603, configured to determine a plurality of candidate domain words based on the first word segmentation result;

a probability parameter determining unit 604, configured to determine a probability parameter of each candidate domain word by using a pre-trained domain word classification model;

The domain word determining unit 605 is configured to determine a target domain word from the plurality of candidate domain words based on the first probability threshold and the probability parameter of each candidate domain word.

the unit 603 for determining candidate domain words is specifically configured to:

In one possible implementation manner, the word segmentation processing unit 602 is specifically configured to:

In one possible implementation manner, the probability parameter determining unit 604 is specifically configured to:

acquiring attribute parameters of the target document;

In a possible implementation manner, the probability parameter determining unit 604 inputs the attribute parameters of the target document and the candidate domain words into the domain word classification model: the domain word classification model in the probability parameter determining unit 604 determines word features, semantic features and structural features of the candidate domain words; and the domain word classification model calculates probability parameters of the candidate domain words based on weight parameters of various features and word features, semantic features and structural features of the candidate domain words.

For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.

Having described the domain word determining method and apparatus of an exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

The embodiment of the application also provides electronic equipment based on the same conception as the embodiment of the method. In this embodiment, the structure of the electronic device may be shown in fig. 7, where the electronic device is, for example, the server 102 in fig. 1, as shown in fig. 7, and the electronic device in the embodiment of the present application includes at least one processor 701, and a memory 702 and a communication interface 703 connected to the at least one processor 701, where the embodiment of the present application does not limit a specific connection medium between the processor 701 and the memory 702, and in fig. 7, a connection between the processor 701 and the memory 702 is taken as an example, and in fig. 7, the system bus 700 is shown in bold line, and a connection manner between other components is merely illustrative and not limited thereto. The system bus 700 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 7 for convenience of illustration, but does not represent only one bus or one type of bus.

In the embodiment of the present application, the memory 702 stores instructions executable by the at least one processor 701, and the at least one processor 701 can perform the steps included in the aforementioned domain word determining method by executing the instructions stored in the memory 702.

The processor 701 is a control center of the electronic device, and may connect various parts of the entire fault detection device using various interfaces and lines, and may implement various functions of the electronic device by executing or executing instructions stored in the memory 702 and invoking data stored in the memory 702. Alternatively, the processor 701 may include one or more processing units, and the processor 701 may integrate an application processor and a modem processor, wherein the processor 701 primarily processes an operating system, a user interface, an application program, and the like, and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701. In some embodiments, processor 701 and memory 702 may be implemented on the same chip, or they may be implemented separately on separate chips in some embodiments.

The processor 701 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

The memory 702 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 702 may include at least one type of storage medium, and may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. Memory 702 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 702 in embodiments of the present application may also be circuitry or any other device capable of performing storage functions for storing program instructions and/or data.

The communication interface 703 is a transmission interface that can be used for communication, and data can be received or transmitted through the communication interface 703.

The electronic device also includes a basic input/output system (I/O system) 704, a mass storage device 708 for storing an operating system 705, application programs 706, and other program modules 707, which facilitate the transfer of information between the various devices within the electronic device.

The basic input/output system 704 includes a display 709 for displaying information and an input device 710, such as a mouse, keyboard, etc., for a user to input information. In which a display 709 and an input device 710 are coupled to the processor 701 through a basic input/output system 704 coupled to the system bus 700. The basic input/output system 704 may also include an input/output controller for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller also provides output to a display screen, a printer, or other type of output device.

In particular, mass storage device 708 is coupled to processor 701 through a mass storage controller (not shown) coupled to system bus 700. Wherein mass storage device 708 and its associated computer-readable media provide non-volatile storage for the server package. That is, mass storage device 708 may include a computer-readable medium (not shown), such as a hard disk or CD-ROM drive.

The electronic device may also operate via a network, such as the internet, connected to a remote computer on the network, in accordance with various embodiments of the present application. I.e., the electronic device may be connected to the network 711 through a communication interface 703 coupled to the system bus 700, or alternatively, the communication interface 703 may be used to connect to other types of networks or remote computer systems (not shown).

The embodiment of the application also provides a computer storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for enabling a computer to execute the technical scheme of the word determining method in the embodiment field.

Embodiments of the present application also provide a computer program product comprising: computer program code which, when run on a computer, causes the computer to execute the computer program to implement the technical solution of the domain word determining method in the above embodiments.

Those skilled in the art will appreciate that: all or part of the steps of implementing the above method embodiments may be implemented by hardware associated with program instructions pertaining to a computer program, which may be stored in a computer-readable storage medium, which when executed performs steps comprising the above method embodiments; the readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's equipment, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A domain word determining method, comprising:

2. The method of claim 1, wherein the first word segmentation result comprises a plurality of words, dependency syntaxes between the words, and positions of the words in the any paragraph;

3. The method of claim 1 or 2, wherein the performing word segmentation on any paragraph to obtain a first word segmentation result includes:

4. The method of claim 3, wherein determining the custom dictionary comprises:

5. The method of claim 1, wherein determining probability parameters for each candidate domain word using a pre-trained domain word classification model comprises:

acquiring attribute parameters of the target document;

6. The method of claim 5, wherein the attribute parameters of the target document include a traffic class to which the target document belongs and/or a source type of the target document.

7. The method of claim 5 or 6, wherein after said entering the attribute parameters of the target document, the candidate domain words, into the domain word classification model, the method further comprises:

8. A domain word determining apparatus, comprising:

9. The apparatus of claim 8, wherein the first word segmentation result comprises a plurality of words, a dependency syntax relationship between words, and a position of a word in the any paragraph;

The unit for determining candidate domain words is specifically configured to:

10. The apparatus according to claim 8 or 9, wherein the word segmentation processing unit is specifically configured to:

11. The apparatus of claim 10, wherein the word segmentation processing unit is specifically configured to:

12. The apparatus of claim 8, wherein the means for determining probability parameters is specifically configured to:

acquiring attribute parameters of the target document;

13. The apparatus of claim 12, wherein the attribute parameters of the target document include a traffic class to which the target document belongs and/or a source type of the target document.

14. The apparatus according to claim 12 or 13, wherein said determining probability parameter unit, after said entering the attribute parameters of the target document, the candidate domain words, into the domain word classification model: the domain word classification model in the probability parameter determining unit determines word characteristics, semantic characteristics and structural characteristics of the candidate domain words; and the domain word classification model calculates probability parameters of the candidate domain words based on weight parameters of various features and word features, semantic features and structural features of the candidate domain words.

15. An electronic device, comprising:

A memory for storing program instructions;

A processor for invoking program instructions stored in the memory and for performing the steps comprised in the method according to any of claims 1-7 in accordance with the obtained program instructions.

16. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-7.

17. A computer program product, the computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the preceding claims 1-7.