CN115659968A

CN115659968A - Professional term recognition method, device, computer equipment and storage medium

Info

Publication number: CN115659968A
Application number: CN202211300802.5A
Authority: CN
Inventors: 李少森; 罗捷; 黎珏强; 邱桂尧; 杜浩滔; 孙豪; 黄剑湘; 李�浩; 乔柱桥; 王宁; 陈图腾; 朱盛强; 王飞; 段春莹; 朱志俊; 李俊宇; 张哲�; 黄昌钰; 袁鑫; 朱燕青
Original assignee: Kunming Bureau of Extra High Voltage Power Transmission Co
Current assignee: Kunming Bureau of Extra High Voltage Power Transmission Co
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-31

Abstract

The application relates to a term of expertise identification method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: acquiring a document to be identified, and splitting a text in the document to be identified to obtain at least one text word; for each text word, determining the weight of the current text word in the document to be identified based on the frequency of the current text word in the document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library aiming at each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a term of art. By adopting the method, the recognition efficiency of the professional terms can be improved.

Description

Professional term recognition method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of power technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for identifying a term of art.

Background

A large number of special terms wait for recognition and mining in the field of extra-high voltage direct current transmission, the special terms in the field of extra-high voltage direct current transmission are recognized and extracted from massive text data, and the accurately recognized special terms can be used in various occasions, for example, as retrieval keywords for a network asset mapping engine to retrieve whether relevant sensitive information of a power system leaks to the Internet or not. Accurate professional terms are the basis of unstructured data analysis and have wide functions.

The current professional term recognition method is usually a rule-based method, a rule template is manually constructed by linguistic experts, the method with characteristics including statistical information, punctuation marks, keywords, indicator words, direction words, position words, central words and the like is selected, the method takes mode and character string matching as main means, the method mostly depends on the establishment of a knowledge base and a dictionary, meanwhile, the rules usually depend on specific languages, fields and text styles, the compiling process is time-consuming and difficult to cover all language types, errors are particularly easy to generate, and the problem of low professional term recognition efficiency exists because the linguistic experts are required to rewrite the rules for different systems.

Disclosure of Invention

In view of the above, it is necessary to provide a term of expertise recognition method, apparatus, computer device, computer-readable storage medium, and computer program product capable of improving the term of expertise recognition efficiency in view of the problem of low efficiency of conventional term of expertise recognition.

In a first aspect, the present application provides a term of art recognition method. The method comprises the following steps:

acquiring a document to be identified, and splitting a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word;

aiming at each text word, obtaining the frequency of the current text word in a document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents;

under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library for each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

In one embodiment, splitting a text in a document to be recognized to obtain at least one text word includes:

performing sentence splitting on a text in a document to be identified to obtain at least one text sentence;

and aiming at each text sentence, acquiring the number of characters of the text characters in the current text sentence, and splitting the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word.

In one embodiment, determining the weight of the current text word in the document to be identified based on the frequency, the first document number and the total number of documents comprises:

dividing the total number of documents by the first number of documents, and taking a derivative of the obtained quotient as a first frequency;

and multiplying the frequency with the first frequency, and determining the obtained product as the weight of the current text word in the document to be recognized.

In one embodiment, the obtaining of the frequency of the current text word in the document to be recognized, the first document number of the documents containing the current text word in the professional text library, and the total document number of the professional text library includes:

traversing blacklist vocabularies in the blacklist text library, and traversing professional terms in the professional text library if the current text words are inconsistent with each blacklist vocabulary;

and if the current text word is inconsistent with each professional term, acquiring the frequency of the current text word in the document to be identified, the first document number of the current text word contained in the document of the professional text library and the total document number of the professional text library.

In one embodiment, determining whether the current text word is a term of art based on the first probability and the second probability includes:

calculating a third probability of the current text word in the professional text library based on the first probability corresponding to each text word;

calculating a fourth probability of the current text word in the non-professional text library based on the second probability corresponding to each text word;

and if the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, determining that the current text word is the professional term.

In one embodiment, the term of expertise identification method further comprises:

and under the condition that the current text word is determined to be the professional term, adding the current text word to the professional text library to obtain an updated professional text library.

In a second aspect, the present application also provides a term of art recognition apparatus. The device comprises:

the data acquisition module is used for acquiring a document to be identified and splitting a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word;

the weight acquisition module is used for acquiring the frequency of the current text word in the document to be identified, the first document number of the document containing the current text word in the professional text library and the total document number of the professional text library aiming at each text word; determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents;

the determining module is used for acquiring a first probability that the current text word depends on the previous text word in the professional text library and a second probability that the current text word depends on the previous text word in the non-professional text library aiming at each text word in the current text word under the condition that the weight is smaller than a preset weight threshold; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library aiming at each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

According to the professional term identification method, the device, the computer equipment, the storage medium and the computer program product, the text in the document to be identified is split to obtain at least one text word, and the weight of the current text word in the document to be identified is determined based on the frequency of the current text word in the document to be identified in each text word, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library, wherein the weight reflects the possibility that the current text word is a professional term in the document to be identified; under the condition that the weight is smaller than the preset weight threshold value, the possibility that the current text word is the professional term is high, whether the current text word is the professional term or not is determined according to a first probability that the current text word depends on the previous text word in the professional text library and a second probability that the current text word depends on the previous text word in the non-professional text library aiming at each text word in the current text word, and whether the current text word is the professional term or not can be accurately determined according to the first probability and the second probability because the probabilities that the current text word depends on the previous text word in the professional text library and the non-professional text library of each text word in the current text word are known, so that the recognition efficiency of the professional term is improved.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a method for term recognition;

FIG. 2 is a flow diagram that illustrates a method for term of art recognition, according to one embodiment;

FIG. 3 is a schematic diagram illustrating a sub-flow of S202 in one embodiment;

FIG. 4 is a schematic diagram illustrating a sub-flow of S204 in one embodiment;

FIG. 5 is a schematic view of a sub-flow of S204 in another embodiment;

FIG. 6 is a schematic sub-flow chart of S206 in one embodiment;

FIG. 7 is a block diagram showing the structure of a term of expertise identification means in one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The term of expertise identification method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The terminal 102 acquires a document to be identified, and splits a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word; aiming at each text word, obtaining the frequency of the current text word in a document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents; under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library aiming at each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a specialized term. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or a server cluster comprised of multiple servers.

In one embodiment, as shown in fig. 2, a method for identifying a term of art is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and includes the following steps:

s202, acquiring a document to be identified, and splitting a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word.

Wherein the document to be identified is a document containing a term of art. Terminology refers to the common vocabulary for a particular field of expertise. In some embodiments, the terminology may be a common vocabulary in the field of electrical power specialty. In some embodiments, the term of expertise may also be a common vocabulary in the field of artificial intelligence expertise. A text word is composed of at least one text word. The type of text word includes any one of chinese characters, english letters, or numerals. The text in the document to be recognized comprises text words and punctuation marks. The text in the document to be recognized includes textual or non-textual content. The method for splitting the text in the document to be recognized is to split the text words and punctuation marks in the document to be recognized to obtain at least one text word.

The terminal obtains a document to be identified, and splits a text in the document to be identified to obtain at least one text word. It should be noted that the term-of-expertise identification method in the present application is to identify each text word in a document to be identified, so as to determine whether each text word is a term-of-expertise.

S204, aiming at each text word, obtaining the frequency of the current text word in the document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; and determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents.

Wherein, the frequency refers to the number of times that the current text word appears in the document to be recognized. The professional text base refers to a document base storing documents containing a large number of professional terms. Exemplary types of documents that contain a number of terms of art include, but are not limited to, equipment specifications, technical specifications, or operating procedures. Each document in the professional text corpus may or may not include the current text word. The first document number is the number of documents in the professional text library containing the current text word. The total number of documents refers to the total number of individual documents in the specialized text library.

Specifically, for each text word, the terminal obtains the frequency of the current text word in the document to be identified, the first document number of the documents containing the current text word in the professional text library, and the total document number of the professional text library.

The terminal can determine the weight of the current text word in the document to be identified based on the frequency of the current text word in the document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library. The weight of the current text word in the document to be recognized reflects the likelihood that the current text word is a professional term in the document to be recognized. The higher the weight of the current text word in the document to be recognized is, the larger the information amount of the current text word is, and the lower the possibility that the current text word is a professional term is.

S206, under the condition that the weight is smaller than a preset weight threshold value, aiming at each text word in the current text word, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

The weight of the current text word in the document to be recognized is smaller than a preset weight threshold value, and the probability that the current text word is a professional term is high. And under the condition that the weight of the current text word in the document to be recognized is smaller than a preset weight threshold value, the terminal determines whether the current text word is a professional term or not based on each text word in the current text word. The method is favorable for improving the recognition accuracy and the recognition efficiency of the professional terms.

The previous text word refers to a text word that is previous to the current text word in the current text word. The first probability refers to the probability that a current text word depends on a previous text word in the professional text. The second probability refers to the probability that the current text word depends on the previous text word in the non-professional text corpus.

Specifically, for each text word in the current text word, the terminal acquires a previous text word of the current text word. The professional text base is searched for a first number of occurrences of the current text word and a second number of occurrences of a previous text word before the current text word in the professional text base. And the terminal divides the second number by the first number to obtain a quotient, namely the first probability that the current text word depends on the previous text word in the professional text.

Likewise, the terminal looks up in the non-professional text base for the third number of times the current text word occurs, and the fourth number of times the previous text word in the non-professional text base occurs before the current text word. And the terminal divides the fourth time number by the third time number to obtain a quotient which is a second probability that the current text word depends on the previous text word in the non-professional text.

The first probability and the second probability reflect the probability that the current text word appears in dependence on the previous text word in the professional text corpus and the non-professional text corpus, respectively. The terminal can determine whether the current text word is a professional term based on the first probability and the second probability.

According to the method for recognizing the professional terms, a text in a document to be recognized is split to obtain at least one text word, and the weight of the current text word in the document to be recognized is determined based on the frequency of the current text word in the document to be recognized in each text word, the first document number of the documents containing the current text word in a professional text library and the total document number of the professional text library, wherein the weight reflects the possibility that the current text word is a professional term in the document to be recognized; under the condition that the weight is smaller than the preset weight threshold value, the possibility that the current text word is the professional term is high, whether the current text word is the professional term or not is determined according to a first probability that the current text word depends on the previous text word in the professional text library and a second probability that the current text word depends on the previous text word in the non-professional text library aiming at each text word in the current text word, and whether the current text word is the professional term or not can be accurately determined according to the first probability and the second probability because the probabilities that the current text word depends on the previous text word in the professional text library and the non-professional text library of each text word in the current text word are known, so that the recognition efficiency of the professional term is improved.

In one embodiment, as shown in fig. 3, splitting a text in a document to be recognized to obtain at least one text word includes:

s302, performing sentence splitting on the text in the document to be recognized to obtain at least one text sentence.

The sentence splitting refers to a process of splitting the text according to punctuations and the number of the carriage return. The punctuation mark comprises at least one of a comma, a period, an exclamation point, or a space. And segmenting the text in the document to be recognized into at least one text sentence by punctuation coincidence and the carriage return number. And the terminal splits the sentence of the text in the document to be recognized to obtain at least one text sentence. In some embodiments, the terminal may determine, for each of the obtained at least one text sentence,

s304, aiming at each text sentence, acquiring the number of characters of the text characters in the current text sentence, and splitting the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word.

Wherein the number of characters refers to the number of text words in a text sentence. And the terminal acquires the number of characters of the text characters in the current text sentence aiming at each text sentence. In some embodiments, the terminal labels each text word in the current text sentence, for example, the first text word in the text sentence is labeled S, the last text word in the text sentence is labeled E, and the other text words in the text sentence are labeled M. The initial text words and the termination text words of each text sentence can be clearly obtained by marking the text words.

And the terminal splits the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word. For example, the current text sentence is "low load reactive power optimization", the number of characters of the current text sentence is 7, the number of characters is represented by T, 2 ≦ N ≦ T, and when N =2, the terminal splits the current text sentence into at least one two-word, illustratively, the at least one two-word includes, but is not limited to, "low load", "no load", or "reactive". When N =3, the terminal splits the current text sentence into at least one three-word, illustratively including but not limited to "low load", "no load", "reactive optimization", or "work optimization". Likewise, when N = T, i.e. N =7, the terminal splits the current text sentence into one seven word, which is exemplarily "low load reactive optimization".

In the embodiment, at least one text sentence is obtained by performing sentence splitting on a text in a document to be recognized, and the vocabulary of the current text sentence is split based on the number of characters of text words in the current text sentence of each text sentence, so that at least one text word is obtained. The method and the device can split the text in the document to be recognized into at least one text word, and perform professional term recognition on the text word independently, so that the accuracy of the professional term recognition and the efficiency of the professional term recognition are improved.

In one embodiment, as shown in fig. 4, determining the weight of the current text word in the document to be recognized based on the frequency, the first document number and the total number of documents includes:

s402, dividing the total number of the documents by the first number of the documents, and taking the derivative of the obtained quotient as a first frequency.

The terminal divides the total number of the documents in the professional text library by the number of the first documents of the documents containing the current text words in the professional text library, and the obtained quotient is derived, and the obtained derivative is used as a first frequency. The first frequency reflects how well documents containing the current text word are distinguished from other documents.

S404, multiplying the frequency with the first frequency, and determining the obtained product as the weight of the current text word in the document to be recognized.

The terminal multiplies the frequency of the current text word in the document to be recognized by the first frequency, and the obtained product is used as the weight of the current text word in the document to be recognized. The frequency of the current text word in the document to be recognized reflects the importance of the current text word to the document to be recognized. Illustratively, if the weight of the current text word in the document to be recognized is greater than that of another text word, the importance of the current text word to the document to be recognized is higher than that of the other text word.

From the perspective of the entire professional text base, if the frequency of the document containing the current text word in the professional text base is high, the importance of the current text word to the text distinction in the document to be recognized and other documents in the professional text base is low. Conversely, if the frequency of the document containing the current text word in the professional text library is low and the frequency of the current text word in the document to be recognized is high, the current text word should be an important word capable of reflecting the content of the document to be recognized. Therefore, the frequency of the current text word in the document to be recognized is high, but the first frequency of the document containing the current text word in the professional text library is low, and the probability that the current text word is a professional term is high.

In this embodiment, the total number of documents is divided by the number of the first documents, a derivative of the obtained quotient is used as a first frequency, the frequency is multiplied by the first frequency, and an obtained product is determined as a weight of the current text word in the document to be recognized. The obtained weight of the current text word in the document to be recognized reflects the possibility that the current text word is a professional term in the document to be recognized, and the recognition of the professional term based on the weight can improve the accuracy and the recognition efficiency of the professional term recognition.

In one embodiment, as shown in fig. 5, acquiring the frequency of the current text word in the document to be recognized, the first document number of documents in the professional text library containing the current text word, and the total number of documents in the professional text library includes:

s502, traversing the blacklist vocabularies in the blacklist text library, and traversing the professional terms in the professional text library if the current text word is inconsistent with each blacklist vocabulary.

The blacklist text library refers to a text library storing blacklist words. Blacklisted words refer to non-professional terms that need not be identified. For example, "we", "you", or "he".

And traversing the blacklist vocabularies in the blacklist text base by the terminal, and comparing the current text word with each blacklist vocabulary. And if the current text word is consistent with any blacklist word, the terminal discards the current text word without further processing, and takes the next text word of the current text word in the current text sentence as the current text word for further processing. If the current text word is inconsistent with each blacklist vocabulary, the current text word is not the blacklist vocabulary, the possibility that the current text word is the professional term is high, the terminal continuously traverses the professional terms in the professional text library, and the current text word is compared with each professional term.

S504, if the current text word is not consistent with each professional term, obtaining the frequency of the current text word in the document to be identified, the first document number of the current text word contained in the documents in the professional text library and the total document number of the professional text library.

And under the condition that the terminal judges that the current text word is consistent with any one of the professional terms, the terminal indicates that the current text word is the professional term stored in the professional text library, the terminal does not need further analysis, discards the current text word and takes the next text word of the current text word in the current text sentence as the current text word to continue the next processing. Under the condition that the terminal judges that the current text word is inconsistent with each professional term, the terminal obtains the frequency of the current text word in the document to be identified, the first document number of the current text word contained in the document of the professional text library and the total document number of the professional text library.

In this embodiment, through traversing the blacklist vocabularies in the blacklist text library, if the current text word is inconsistent with each blacklist vocabulary, the professional terms in the professional text library are traversed, and if the current text word is inconsistent with each professional term, the frequency of the current text word in the document to be recognized, the number of first documents containing the current text word in the documents in the professional text library, and the total number of documents in the professional text library are obtained. The method can eliminate the condition that the current text word is the blacklist word, and can improve the accuracy of the professional term recognition. Meanwhile, the situation that the current text word is a known professional term in the professional text library is eliminated, the next-step professional term recognition is carried out on the current text word, an unknown new professional term in the professional text library can be recognized, and the method is beneficial to improving the recognition accuracy and the recognition efficiency of the professional term.

In one embodiment, as shown in fig. 6, determining whether the current text word is a term of art based on the first probability and the second probability includes:

s602, calculating a third probability of the current text word in the professional text library based on the first probability corresponding to each text word.

And the terminal multiplies the first probability corresponding to each text word in the current text word, and the obtained product is used as the third probability of the current text word in the professional text library.

S604, calculating a fourth probability of the current text word in the non-professional text library based on the second probability corresponding to each text word.

And the terminal multiplies the second probability corresponding to each text word in the current text word, and the obtained product is used as the fourth probability of the current text word in the non-professional text library.

And S606, if the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, determining that the current text word is the professional term.

The third probability is greater than the fourth probability, which represents that the probability of the current text word in the professional text library is greater than that in the non-professional text library, and the probability that the current text word is a professional term is high. The third probability is greater than or equal to the preset probability, which represents that the probability of the current text word in the professional text library is greater than or equal to the preset probability, and the probability that the current text word is a professional term is high. And under the condition that the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, the terminal determines that the current text word is the professional term, and the accuracy that the current text word is the professional term can be ensured.

In this embodiment, a third probability of a current text word in a professional text library is calculated based on a first probability corresponding to each text word, a fourth probability of the current text word in a non-professional text library is calculated based on a second probability corresponding to each text word, and when the third probability is greater than the fourth probability and is greater than or equal to a preset probability, and the two conditions are met at the same time, the current text word is determined to be a professional term, so that the accuracy of recognition of the professional term can be improved. Meanwhile, the third probability and the fourth probability are obtained based on the probability that the current text word depends on the previous text word, and compared with the method that the third probability and the fourth probability are calculated by directly depending on the probability of at least one text word in front of the current text word, the calculation efficiency can be improved, and the recognition efficiency of the professional term is further improved.

In one embodiment, the term of expertise identification method further comprises: and under the condition that the current text word is determined to be the professional term, adding the current text word to the professional text library to obtain an updated professional text library.

And the terminal adds the current text word to the professional text library to obtain an updated professional text library under the condition that the current text word is determined to be the professional term.

In this embodiment, the updated professional text library is obtained by adding the current text word to the professional text library when the current text word is determined to be the professional term. The method for adding the current text word as the new professional term into the professional text library is beneficial to comparing the subsequent text word with the professional term in the professional text library when the professional term is recognized, the situation that the text word is the known professional term is eliminated, and the accuracy and the recognition efficiency of the professional term recognition are improved.

To explain the method and effect of identifying terms in the present embodiment in detail, the following description is made with reference to a most detailed embodiment:

and acquiring a document to be identified, and performing sentence splitting on a text in the document to be identified to obtain at least one text sentence. And aiming at each text sentence, acquiring the number of characters of the text characters in the current text sentence, and splitting the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word. Each text word comprises at least one text word. For example, the current text sentence is "low load reactive power optimization", the number of characters of the current text sentence is 7, the number of characters is represented by T, 2 ≦ N ≦ T, and when N =2, the terminal splits the current text sentence into at least one two-word, illustratively, the at least one two-word includes, but is not limited to, "low load", "no load", or "reactive". When N =3, the terminal splits the current text sentence into at least one three-word, illustratively including but not limited to "low load", "no load", "reactive optimization", or "work optimization". Likewise, when N = T, i.e. N =7, the terminal splits the current text sentence into one seven word, which is exemplarily "low load reactive optimization".

Traversing blacklist vocabularies in a blacklist text library aiming at each text word, and traversing professional terms in a professional text library if the current text word is inconsistent with each blacklist vocabulary; and if the current text word is inconsistent with each professional term, acquiring the frequency of the current text word in the document to be identified, the number of first documents containing the current text word in the documents in the professional text library and the total number of the documents in the professional text library.

And determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents. Specifically, the total number of documents is divided by the first number of documents, and the derivative of the obtained quotient is taken as a first frequency; and multiplying the frequency with the first frequency, and determining the obtained product as the weight of the current text word in the document to be recognized. Let the first frequency be idf, then the first frequency is calculated as: idf = lg (n/ni). And n is the total number of documents in the professional text library, and ni is the first number of documents of the documents containing the current text word in the professional text library. And (3) representing the weight of the current text word in the document to be recognized by tf-idf, wherein the calculation formula of the weight is as follows: tf-idf = tf. Where tf represents the frequency of the current text word in the document to be recognized. For example, the total number of documents in the professional text base is 10000, the number of the first documents of the documents containing the current text word in the professional text base is 2000, the frequency of the current text word in the document to be recognized is 20 words, and then the weight of the current text word in the document to be recognized is: tf-idf =20 × lg (10000/2000) =13.98.

And under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in the professional text library and a second probability that the current text word depends on the previous text word in the non-professional text library for each text word in the current text word. Calculating a third probability of the current text word in the professional text library based on the first probability corresponding to each text word; calculating a fourth probability of the current text word in the non-professional text library based on the second probability corresponding to each text word; and if the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, determining that the current text word is the professional term.

The method for recognizing the professional terms comprises the steps of splitting a text in a document to be recognized to obtain at least one text word, determining the weight of the current text word in the document to be recognized based on the frequency of the current text word in the document to be recognized in each text word, the first document number of the documents containing the current text word in a professional text library and the total document number of the professional text library, wherein the weight reflects the possibility that the current text word is a professional term in the document to be recognized; under the condition that the weight is smaller than the preset weight threshold value, the possibility that the current text word is the professional term is high, whether the current text word is the professional term or not is determined according to a first probability that the current text word depends on the previous text word in the professional text library and a second probability that the current text word depends on the previous text word in the non-professional text library aiming at each text word in the current text word, and whether the current text word is the professional term or not can be accurately determined according to the first probability and the second probability because the probabilities that the current text word depends on the previous text word in the professional text library and the non-professional text library of each text word in the current text word are known, so that the recognition efficiency of the professional term is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a term of art recognition apparatus for implementing the above-mentioned term of art recognition method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so that specific limitations in one or more embodiments of the term recognition device provided below can be referred to the limitations of the term recognition method in the above, and are not described herein again.

In one embodiment, as shown in fig. 7, there is provided a term of expertise recognition apparatus 100, including: a data acquisition module 120, a weight acquisition module 140, and a determination module 160, wherein:

the data acquisition module 120 is configured to acquire a document to be identified, and split a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word.

The weight obtaining module 140 is configured to obtain, for each text word, a frequency of the current text word in the document to be identified, a first document number of documents in the professional text library, where the documents include the current text word, and a total document number of the professional text library; and determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents.

The determining module 160 is configured to, in a case that the weight is smaller than a preset weight threshold, obtain, for each text word in the current text word, a first probability that the current text word depends on a previous text word in the professional text base, and a second probability that the current text word depends on the previous text word in the non-professional text base; based on the first probability and the second probability, it is determined whether the current text word is a term of art.

The professional term recognition device obtains at least one text word by splitting a text in a document to be recognized, and determines the weight of the current text word in the document to be recognized based on the frequency of the current text word in the document to be recognized in each text word, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library, wherein the weight reflects the possibility that the current text word is a professional term in the document to be recognized; in the case that the weight is smaller than the preset weight threshold, the possibility that the current text word is a professional term is high, and for each text word in the current text word, whether the current text word is a professional term is determined based on a first probability that the current text word depends on a previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library.

In an embodiment, in splitting the text in the document to be recognized to obtain at least one text word, the data obtaining module 120 is further configured to: performing sentence splitting on a text in a document to be identified to obtain at least one text sentence; and aiming at each text sentence, acquiring the number of characters of the text characters in the current text sentence, and splitting the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word.

In one embodiment, in determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents, and the total number of documents, the weight obtaining module 140 is further configured to: dividing the total number of the documents by the first number of the documents, and taking a derivative of the obtained quotient as a first frequency; and multiplying the frequency with the first frequency, and determining the obtained product as the weight of the current text word in the document to be recognized.

In one embodiment, in obtaining the frequency of the current text word in the document to be recognized, the first document number of documents in the professional text base containing the current text word, and the total number of documents in the professional text base, the weight obtaining module 140 is further configured to: traversing blacklist vocabularies in a blacklist text base, and traversing professional terms in a professional text base if the current text word is inconsistent with each blacklist vocabulary; and if the current text word is inconsistent with each professional term, acquiring the frequency of the current text word in the document to be identified, the number of first documents containing the current text word in the documents in the professional text library and the total number of the documents in the professional text library.

In one embodiment, in determining whether the current text word is a term of expertise based on the first probability and the second probability, the determination module 160 is further configured to: calculating a third probability of the current text word in the professional text library based on the first probability corresponding to each text word; calculating a fourth probability of the current text word in the non-professional text library based on the second probability corresponding to each text word; and if the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, determining that the current text word is the professional term.

In one embodiment, the term of expertise recognition apparatus 100 further comprises: and under the condition that the current text word is determined to be the professional term, adding the current text word to the professional text library to obtain an updated professional text library.

The modules in the above-mentioned terminology recognition device can be implemented wholly or partially by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected by a system bus, and the communication interface, the display unit and the input device are connected by the input/output interface to the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of term recognition.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring a document to be identified, and splitting a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word; aiming at each text word, acquiring the frequency of the current text word in a document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents; under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library aiming at each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

carrying out sentence splitting on a text in a document to be identified to obtain at least one text sentence; and aiming at each text sentence, acquiring the number of characters of the text characters in the current text sentence, and splitting the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word.

dividing the total number of documents by the first number of documents, and taking a derivative of the obtained quotient as a first frequency; and multiplying the frequency with the first frequency, and determining the obtained product as the weight of the current text word in the document to be recognized.

In one embodiment, the processor when executing the computer program further performs the steps of:

traversing blacklist vocabularies in the blacklist text library, and traversing professional terms in the professional text library if the current text words are inconsistent with each blacklist vocabulary; and if the current text word is inconsistent with each professional term, acquiring the frequency of the current text word in the document to be identified, the number of first documents containing the current text word in the documents in the professional text library and the total number of the documents in the professional text library.

calculating a third probability of the current text word in the professional text library based on the first probability corresponding to each text word; calculating a fourth probability of the current text word in the non-professional text library based on the second probability corresponding to each text word; and if the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, determining that the current text word is the professional term.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring a document to be identified, and splitting a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word; aiming at each text word, obtaining the frequency of the current text word in a document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents; under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library for each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

In one embodiment, the computer program when executed by the processor further performs the steps of:

performing sentence splitting on a text in a document to be identified to obtain at least one text sentence; and aiming at each text sentence, acquiring the number of characters of the text characters in the current text sentence, and splitting the vocabulary of the current text sentence based on the number of the characters to obtain at least one text word.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

acquiring a document to be identified, and splitting a text in the document to be identified to obtain at least one text word; each text word comprises at least one text word; aiming at each text word, obtaining the frequency of the current text word in a document to be identified, the first document number of the documents containing the current text word in the professional text library and the total document number of the professional text library; determining the weight of the current text word in the document to be identified based on the frequency, the number of the first documents and the total number of the documents; under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that the current text word depends on the previous text word in a professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library aiming at each text word in the current text word; based on the first probability and the second probability, it is determined whether the current text word is a specialized term.

calculating a third probability of the current text word in the professional text library based on the first probability corresponding to each text word; calculating a fourth probability of the current text word in the non-professional text library based on the second probability corresponding to each text word; and if the third probability is greater than the fourth probability and the third probability is greater than or equal to the preset probability, determining that the current text word is a professional term.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. A method for identifying a term of art, the method comprising:

for each text word, obtaining the frequency of the current text word in the document to be identified, the first document number of the document containing the current text word in a professional text library and the total document number of the professional text library; determining the weight of the current text word in the document to be identified based on the frequency, the first document number and the total document number;

under the condition that the weight is smaller than a preset weight threshold value, acquiring a first probability that a current text word depends on a previous text word in the professional text library and a second probability that the current text word depends on the previous text word in a non-professional text library aiming at each text word in a current text word; determining whether the current text word is a professional term based on the first probability and the second probability.

2. The method according to claim 1, wherein the splitting the text in the document to be recognized to obtain at least one text word comprises:

performing sentence splitting on the text in the document to be identified to obtain at least one text sentence;

3. The method of claim 1, wherein the determining a weight of the current text word in the document to be identified based on the frequency, the first number of documents, and the total number of documents comprises:

and multiplying the frequency with the first frequency, and determining the obtained product as the weight of the current text word in the document to be identified.

4. The method of claim 1, wherein the obtaining the frequency of the current text word in the document to be recognized, the first document number of documents in a professional text library containing the current text word, and the total document number of the professional text library comprises:

traversing blacklist vocabularies in a blacklist text library, and traversing professional terms in a professional term library if the current text words are inconsistent with each blacklist vocabulary;

5. The method of claim 1, wherein determining whether the current text word is a term of art based on the first probability and the second probability comprises:

calculating a fourth probability of the current text word in a non-professional text library based on the second probability corresponding to each text word;

and if the third probability is greater than the fourth probability and the third probability is greater than or equal to a preset probability, determining that the current text word is a professional term.

6. The method according to any one of claims 1 to 5, further comprising:

and under the condition that the current text word is determined to be a professional term, adding the current text word to the professional text library to obtain an updated professional text library.

7. A term-of-art recognition apparatus, the apparatus comprising:

the weight acquisition module is used for acquiring the frequency of the current text word in the document to be identified, the first document number of the document containing the current text word in the professional text library and the total document number of the professional text library aiming at each text word; determining the weight of the current text word in the document to be identified based on the frequency, the first document number and the total document number;

the determining module is used for acquiring a first probability that a current text word depends on a previous text word in the professional text library and a second probability that the current text word depends on the previous text word in the non-professional text library aiming at each text word in a current text word under the condition that the weight is smaller than a preset weight threshold; determining whether the current text word is a professional term based on the first probability and the second probability.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.